What’s next for open-source Spark?
Boston -- A conference focused on a single open source project sounds like the sort of event that will feature a lone keynote speaker speaking to maybe 100 interested parties in a lecture hall at a local college. Spark Summit East was very much the opposite.
A total of 1,503 people watched the five keynote speakers in a cavernous ballroom at the Hynes Convention Center lay out the future of Spark, the big data processing engine originally developed at the University of California – Berkeley by Matei Zaharia. Spark underlies huge data-driven applications being used by major players like Salesforce, Facebook, IBM and many others, helping organize, analyze, and surface specific grains of sand from beach-sized databases.
+More on Network World: Gartner: Algorithm-based technology patents are raging+
Part of the reason that Spark has taken off in such a big way, said Zaharia from the stage, is that Moore’s Law has slowed down considerably of late. While the average data center network connection is about 10 times faster than it was even seven years ago, and the average storage I/O rate has grown by a similar amount, CPUs have remained roughly the same.
Hardware manufacturers are working around the problem by using simpler devices like GPUs and FPGAs, but it can be a lot of work moving applications onto completely different silicon, he noted. Spark’s moving to take advantage of new hardware platforms, according to Zaharia, but it’s also working to maximize performance on existing systems.
“The effort to do this is called Project Tungsten, which began about two years ago, to optimize Spark’s CPU and memory usage using two things – a binary storage format that escapes the [Java virtual machine] and is no longer tied to the limits of that, and runtime code generation,” he said.
Michael Armbrust is an engineer at Databricks, which sells a hosted Spark environment and is one of the chief sponsors of the summit. He traced the genesis of Spark back to Zaharia’s realization that cross-machine complexity – i.e. errors caused by the use of large groups of computers to work on a single problem – was going to be a stumbling block.
Optimization is still a key concept for Spark development, but the way in which that optimization happens is a little different.
“Roll forward to the year 2013, and a lot of people are using Spark, but what we’re finding is that a lot of people are spending their time tuning their computation,” said Armbrust. “You have to make sure you’re minimizing overheads like garbage collection, you want to make sure you’re getting the last inches of performance out of your cores … what you really want is just a high-level language that allows you to quickly and concisely express common computations.”
This, coupled with the fact that 95% or more of Spark users are running SQL datasets, led to the development of Spark SQL, a language that “allow[s] you to just quickly say what you want Spark to figure out, and you leave it up to Spark to figure out exactly the most efficient way to perform that computation.”
Salesforce Senior Engineering Manager Alexis Roos detailed how his team is putting some of Spark's capabilities to use broadening the horizons of the company’s flagship Salescloud and Salesforce Inbox products.
“Using AI, we can make Salesforce Inbox smarter,” said Roos, before outlining the type of complex connections the system is able to make to ensure that the correct people are identified as hot leads and which contacts to make, and in what order.
“We want to tell users why an email is important, but we don’t want to stop there,” Roos said. “We also want to tell them what they should do about it.”
Nobody is working with bigger datasets than Cotton Seed, senior principal engineer at MIT and Harvard’s Broad Institute, which studies genomics using reams of digital information that rival YouTube for sheer scale. Broad – pronounced to rhyme with “road” – generates 17TB of new genome data every day, and manages a total of 45 petabytes of information.
YouTube’s still bigger at 25TB per day and 86 petabytes total, but that will change quickly in the near future, according to Seed. By 2025, he said, genomics research around the world will be taking in more than 20 exabytes – or 20 billion gigabytes – per year.
“That would be about $400 million a month, just in raw storage costs,” he said, adding that the compute tasks required to analyze that data in, for example, Google’s cloud would result in fees of nearly $6 billion per month, along with an “are you sure you typed that right?” query from the Google Cloud Platform estimation tool.
“It’s really gonna require innovations in computing technology and large data to continue to maintain our current pace of innovation in biomedicine,” Seed said, with some understatement.
For the moment, Seed’s team has created Hail, a platform built on Spark designed to process genetic data more efficiently. It uses a high-level language, a la Spark SQL, to automate certain basic analysis tasks, is highly scalable, and is designed to be easy to use for non-computer scientists, i.e. most people in genomics labs.