5 reasons data science belongs in the cloud

Why laptop-hugging data scientists and server-hugging machine learning practitioners should consider letting go

Piero Cinquegrana, InfoWorld Feb 21st 2019

In a world inundated with data, data scientists help enterprises generate insights and predictions to enable smarter business decisions. Typically, these data scientists are experts in statistical analysis and mathematical modeling and proficient in programming languages such as R or Python.

However, barring a few large enterprises, most data science is still done on laptops, or on-prem servers, resulting in inefficient processes that are prone to errors and delays. Having observed how a few cutting-edge companies are putting data to work, I can tell you that “laptop data science” will soon go the way of the dinosaur. It’s inefficient, it doesn’t lend itself well to collaboration, and it can’t produce the best results.

Here are five good reasons data scientists should get off their laptops or local servers and into the cloud.

Data science is a team sport
Algorithms and machine learning models form one piece of the advanced analytics and machine learning puzzle for enterprises. Data scientists, data engineers, machine learning engineers, data analysts, and citizen data scientists all need to collaborate on these elements to deliver data-driven insights for business decision making.

When data scientists are building models on their laptops, they will download data sets created by data engineers onto their machines to build and train machine learning models. Sometimes they’ll use an on-prem server for building and training, but often it’s the laptop. Due to the computing and memory limitations of laptops and on-prem servers, these data scientists have to sample the data set to create a smaller, more manageable data set to work with. While these sample sets can help get a project off the ground, they create numerous issues at later stages in the data science lifecycle.

Data staleness also becomes an issue. With local copies of the data, data scientists may be building predictions based on an inaccurate snapshot of the world. Using larger, more representative samples from a central cloud location would alleviate this concern.

Big data beats smart algorithms
The recent surge of interest in artificial intelligence and machine learning is driven by the ability to quickly process and iterate (train and tune the machine learning model) over large volumes of structured, unstructured, and semi-structured data. In almost all cases, machine learning benefits from being trained on larger, more representative sample sets.

Enterprises can unlock powerful use cases by combining semi-structured interaction data (website interaction logs, event data) and unstructured data (email text, online review text) with structured transaction data (ERP, CRM, order management systems). The key to unlocking business value from machine learning is having large data sets that combine transactional and interaction data. With this increased scale, the data often needs to be processed on the cloud or in large on-premises clusters. Adding a laptop to the mix creates a bottleneck in the entire flow and leads to delays.

Data science needs flexible infrastructure
These days, data scientists can leverage numerous open source machine learning frameworks such as R, Scikit-learn, Spark MLlib, TensorFlow, MXNet, and CNTK. However, managing the infrastructure, configuration, and environments for these frameworks is cumbersome when done on a laptop or on-premises server. The additional overhead of managing infrastructure takes time away from core data science activities.

Much of that overhead goes away in the software-as-a-service model. The cloud’s usage-based pricing model works well for machine learning workloads, which are bursty in nature. The cloud also makes it easier to explore different machine learning frameworks, with cloud vendors offering model hosting and deployment options. In addition, cloud service providers including Amazon Web Services, Microsoft Azure, and Google Cloud offer intelligent capabilities as services. This lowers the barriers to integrating these capabilities into new products or applications.

A central repository improves data accuracy and model auditability
The predictions from a machine learning model are only as accurate and representative as the data used to train them. Every modern manifestation of AI and machine learning is made possible by the availability of high-quality data. For instance, apps that provide turn-by-turn directions have been around for decades, but they’re much more accurate today thanks to the larger volume of data.

It is no surprise, then, that a significant part of AI machine learning operations revolves around data logistics, which is the collecting, labeling, categorizing, and managing of data sets that reflect the real world we are trying to model with machine learning. Data logistics is already complicated for an enterprise with many data users; the problem only gets worse when multiple local copies of the data set are scattered among those users.

Further, the concerns around security and privacy are increasingly taking center stage. Enterprise data processes will need to be in compliance with data privacy and security regulations. A centralized repository for all data sets not only simplifies management and governance of data but also ensures data consistency and model auditability.

Faster data science is better for business
All of the above reasons contribute to delayed time to value with laptop-based data science. In a typical workflow for a data scientist working on a laptop or on-prem server, the first step is to sample the data and download data sets manually onto the local system, or to connect via ODBC driver to a database. The second step is to install all of the required software tools and packages such as RStudio, Jupyter Notebook, Anaconda distributions, machine learning libraries, and language versions such as R, Python, and Java.

When the model is ready to be deployed to production, the data scientist hands it off to a machine learning engineer. The machine learning engineer must then either convert the code to a production language—such as Java, Scala, or C++—or at least optimize the code and integrate with the rest of the application. Code optimization would consist of rewriting any data query into an ETL job, profiling the code to find any bottlenecks, and adding logging, fault-tolerance, and other production-level capabilities.

Each of these steps presents a bottleneck that can result in delays. For instance, inconsistencies in software or package versions between development and production environments can result in issues in deployment. And code built in a Windows or Mac environment will certainly break when deployed into Linux.

All of the above issues with running data science on laptops result in loss of business value. Data science involves resource-intensive tasks in data preparation, model building, and model validation. Data scientists will typically iterate several hundreds of times—trying different features, algorithms, and model specifications—before they find the right model for the business problem they are trying to address. These iterations can take a significant amount of time. Imposing bottlenecks around infrastructure and environment management, deployment, and collaboration can further delay time-to-value for enterprises.

Data scientists who rely on laptops or local servers are making an unwise trade-off between ease of getting started and ease of scaling and productionizing machine learning models. While working on a laptop or a local server gets the data science team up and running faster, cloud platforms provide greater long-term advantages including unlimited compute and storage, easier collaboration, easier infrastructure management and data governance, and most importantly, faster time to production.

The fastest and most cost-effective way to get started with data science and machine learning in the cloud is to use a cloud-based data science and machine learning platform. Laptops, at least for this use case, have a limited future.

Piero Cinquegrana is data science senior product manager at Qubole, a cloud-based big data platform, where his responsibilities include optimizing the platform for data scientists and managing Qubole’s deep learning cluster for GPU acceleration and distributed training.