ComScore Uses Hadoop to Tame Its Big Data Flow
Market researcher ComScore tames a torrent of data with open-source analytics.
Mike Brown, the CTO at ComScore, knows a bit about managing big data. Every day 12 terabytes of information rushes into his cluster of 80 servers running the open-source software Hadoop, which sorts and analyzes the data for a host of clients who want to know things like which online vendor sold the most e-cards or how fast Facebook is growing in Brazil. "We ingest 32 billion new rows of a data a day," he says.
The torrent of data is swelling fast enough that Brown plans to be running 200 servers by the end of the year--and without the right data-integration software, he thinks that number could double. Brown has been wading through an ocean of information ever since he joined ComScore as its first software engineer in 1999, shortly after the startup landed its original venture financing. Today the Internet market research firm reports $232 million in revenue a year. "Our growth is pretty darn linear and should continue," Brown says.
ComScore started off on a homegrown grid processing stack and in 2000 added Syncsort's data integration software, the current version of which is DMExpress. "We were up and running in weeks," Brown says. "It literally made our software run 5-10 times faster. You're not just adding storage, but you're adding compute as well."
In 2009, ComScore began migrating to Hadoop, becoming an early adopter of the technology, which has recently begun gaining traction in the enterprise market.
"We decided it was better to leverage the community than invest in building our own," Brown says. "In general, Hadoop is harder to bring into an enterprise when you have mixed operating systems. DMExpress, with their connector, is helping to solve this issue."
That's a typical experience, notes James Kobielus, in a recent report for Forrester Research, where he was an analyst. Hadoop, he wrote, "lacks some critical enterprise data warehouse features, such as real-time integration and robust high availability. The Hadoop market includes many vendors that have focused on these and other deficiencies in the core Hadoop stack. Vendors have, of necessity, either built proprietary extensions to address these requirements or have leveraged various NoSQL tools and open source code to provide the requisite functionality."
In ComScore's case, Brown found that Syncsort's software made the Hadoop migration a piece of cake. "You don't have to change any code, except the push code," he says. "We use DMExpress in [more than] 30 different apps. It's our tool for any situation [where] we have to adjust the data."
"We can store twice as much data on the cluster," he continues, "and we also use it to improve performance. One big problem it solved was the ability to chunk and split the large files we have into files that fit perfectly into the chunks on Hadoop. This enables us to have a higher rate of parallelism on compressed files while reducing our costs for disk on the cluster."
That, Brown says, translates into saving 75 terabytes of data storage a month. That, too, is big data.
While the buzz around big data analysis is at a peak, there is less discussion about how to get the necessary data into the systems in the first place, which can involve the cumbersome task of setting up and maintaining a number of data processing pipelines.
Next-generation endpoint protection vendor SentinelOne has received the same certification that many traditional antivirus platforms seek, meaning it can be considered suitable for meeting certain requirements of industry and governmental regulations.
Smartphone sales increased substantially in the second quarter of 2015, but the rate of growth continued to slow, fueling concerns that the market has started to become saturated, according to a study released today by Juniper Research.
Attackers could exploit a new vulnerability in BIND, the most popular Domain Name System (DNS) server software, to disrupt the Internet for many users.