ComScore Uses Hadoop to Tame Its Big Data Flow
Market researcher ComScore tames a torrent of data with open-source analytics.
Mike Brown, the CTO at ComScore, knows a bit about managing big data. Every day 12 terabytes of information rushes into his cluster of 80 servers running the open-source software Hadoop, which sorts and analyzes the data for a host of clients who want to know things like which online vendor sold the most e-cards or how fast Facebook is growing in Brazil. "We ingest 32 billion new rows of a data a day," he says.
The torrent of data is swelling fast enough that Brown plans to be running 200 servers by the end of the year--and without the right data-integration software, he thinks that number could double. Brown has been wading through an ocean of information ever since he joined ComScore as its first software engineer in 1999, shortly after the startup landed its original venture financing. Today the Internet market research firm reports $232 million in revenue a year. "Our growth is pretty darn linear and should continue," Brown says.
ComScore started off on a homegrown grid processing stack and in 2000 added Syncsort's data integration software, the current version of which is DMExpress. "We were up and running in weeks," Brown says. "It literally made our software run 5-10 times faster. You're not just adding storage, but you're adding compute as well."
In 2009, ComScore began migrating to Hadoop, becoming an early adopter of the technology, which has recently begun gaining traction in the enterprise market.
"We decided it was better to leverage the community than invest in building our own," Brown says. "In general, Hadoop is harder to bring into an enterprise when you have mixed operating systems. DMExpress, with their connector, is helping to solve this issue."
That's a typical experience, notes James Kobielus, in a recent report for Forrester Research, where he was an analyst. Hadoop, he wrote, "lacks some critical enterprise data warehouse features, such as real-time integration and robust high availability. The Hadoop market includes many vendors that have focused on these and other deficiencies in the core Hadoop stack. Vendors have, of necessity, either built proprietary extensions to address these requirements or have leveraged various NoSQL tools and open source code to provide the requisite functionality."
In ComScore's case, Brown found that Syncsort's software made the Hadoop migration a piece of cake. "You don't have to change any code, except the push code," he says. "We use DMExpress in [more than] 30 different apps. It's our tool for any situation [where] we have to adjust the data."
"We can store twice as much data on the cluster," he continues, "and we also use it to improve performance. One big problem it solved was the ability to chunk and split the large files we have into files that fit perfectly into the chunks on Hadoop. This enables us to have a higher rate of parallelism on compressed files while reducing our costs for disk on the cluster."
That, Brown says, translates into saving 75 terabytes of data storage a month. That, too, is big data.
After a profitable fourth quarter in fiscal year 2014 following a corporate restructure, Canadian smartphone maker BlackBerry is reportedly targeted for acquisition by Microsoft, Xiaomi, Lenovo and Huawei.
BlackBerry plans to lay off an unspecified number of staff in its devices unit, as it attempts to make that business profitable, while expanding in other areas.
Fluke Networks has unveiled a comprehensive strategy to align its product portfolio around the needs of the "borderless enterprise".
Captive centers -- in-house IT and business process delivery arms -- accounted for one quarter of the $150 billion global services market last year, according outsourcing consultancy and research firm Everest Group.