Strava to bring more data to users after shifting its data lake to Snowflake

Strava, the hugely popular fitness tracking app based in San Francisco, has recently shifted its massive data lake from Amazon Redshift to cloud upstart Snowflake. Its reason for shifting is a common one amongst new Snowflake customers: Redshift was creaking under greater concurrency demands from growing data science and analytics demands, and query times were becoming unsustainable.

Scott Carey Jan 22nd 2019
Strava.jpg

Strava, the hugely popular fitness tracking app based in San Francisco, has recently shifted its massive data lake from Amazon Redshift to cloud upstart Snowflake. Its reason for shifting is a common one amongst new Snowflake customers: Redshift was creaking under greater concurrency demands from growing data science and analytics demands, and query times were becoming unsustainable.

Before moving to Snowflake, Cathy Tanimura, senior director of analytics and data science at Strava told Computerworld UK that analysts were having to run queries over their lunch break or even overnight. "I don't see or hear that anymore," she said, "every once in a while it might be a little bit slow across 1 billion rows, but the productivity of the team and that ability to stay in the flow is a game changer for us."

For example, Tanimura talks about how one analyst last year was running some analytics on the activity feed within the Strava mobile app to see what users liked to engage with and where they were offering kudos (essentially Facebook likes on content). "It took weeks to make sense of this, kicking off an hour-long query just to browse that data," Tanimura said, "now, this year, we can have people query this and grab a part of the data and get insight almost immediately."

Strava stores 120TB of data today, including 13 trillion GPS data points, 15 million uploads per week and 1.5 billion analytics points, which allow its analytics team to spot pinch points in the app where users might not be getting the best possible experience.

The migration was easy according to Tanimura, starting in March last year and completed by June. "It was painless," she said, "and I have been through other migration changes and was expecting all sorts of pain and data accuracy issues and it didn't turn out that way. There was no major retraining needed, it's still MySQL in the end," she added.

The company also seamlessly switched its Looker frontend for data visualisation to run on top of Snowflake, allowing more business users to engage with this user data without having to know SQL.

Benefits of Snowflake

By decoupling compute and storage, Snowflake has been able to overcome some of these concurrency issues for its customers, and seems to be hitting a nerve with companies that have been reliant on Redshift.

In a video about the switch, Carlin Eng, a data engineer at Strava, said that Redshift "didn't handle the concurrency very well", and that "we selected Snowflake primarily because it handled that concurrency situation really well. We saw with the separation of compute and storage we were able to spin up independent compute clusters to have all of our users accessing the data and not really contending with each other."

Snowflake promises almost limitless scale and concurrency by effectively spinning up new cloud instances (S3 on AWS, for example) for each workload to effectively run as a standalone data warehouse but all under the same roof, so data science queries never tread on the toes of BI, or vice versa.

Strava isn't the first Snowflake customer to publicly slate Amazon's Redshift database. Last year retailer Not On The High Street criticised Redshift for its lack of scalability, and the year before food delivery company Deliveroo made the change because the old data warehouse "couldn't handle concurrent users".

The benefits tend to focus on freeing up engineering resource and allowing more users to hit the data and start mining it for insight, but cost is also a factor.

That being said, Tanimura stressed that the priority is "growing the platform and our revenue, so we don't necessarily see it as a cost-saving".

Instead, "we have compelling and interesting capabilities to use data to help user acquisition, marketing awareness and to onboard customers, to understand what is successful and help people use the product. So we think more about growing the top line and keep the costs part in a good scalable position."

New features

More than 35 million users rely on Strava to track their cycles, runs and hikes, giving the company a rich data set to work with.

Crucially, this new data infrastructure cuts down query times by a major factor, allowing for greater experimentation amongst the small but growing data science team, who can now spend less time worrying about optimising queries or if the infrastructure will hold up under pressure.

By smoothing out its data infrastructure Strava is able to do more experimentation with its data to produce things like its global Heatmap or to quantify effort through heart rate or optimise its Grade Adjusted Pace metric.

For example, users were able to create an end of year video based on their app usage for 2018, something that would have been far more cumbersome with the previous data infrastructure, according to Tanimura.

"I am excited about some of the data science projects and how they can help improve the product experience and build new products and give that data back to people to help their training and what they might want to achieve," she concluded.