How do you evaluate and select the right Data Science and Machine Learning platform?

Automated feature engineering and AI-powered data preparation are the key differentiators for a code-free or code-first approach to data science.

Sachin Andhare
4 min readMar 19, 2021

Innovation, data, and analytics leaders looking for the best data science and machine learning platform have a hard nut to crack! Selecting a data science and machine learning (DSML) platform, given how fragmented the market is, where every vendor claims to be the ideal enterprise AI platform can be jarring. The challenge is even more complex for organizations that are new to machine learning or a traditional BI background without predictive analytics experience. And ditto for application developers and software architects searching for Cloud AI services to leverage AI and ML using APIs. What are some of the technical features that they need to consider? Which platform capabilities are most important? Gartner recently published the magic quadrant report for DSML platforms and evaluated over 20 platform vendors from AWS SageMaker, Microsoft Azure ML to H2O. It’s great to see dotData mentioned in the report. In case you don’t have access to Gartner reports or are pressed for time, here are a couple of things that can help you narrow down your list:

Who will use and benefit from the DSML platform?

Before starting a data science project, the stakeholders should brainstorm to identify relevant use cases, develop requirements, and prioritize the impact and value to the business. The process is heavily dependent on the available resources, the data architecture of the company, and the skillset of the intended users. To make the best possible choice, AI and business leaders should seek answers to these fundamental questions:

  1. Who will be the primary user of the ML platform? The Data Science team, application developers, or the BI and analytics team?
  2. What are the skill-level and data science expertise of the primary user? Are they expert data scientists with several years of experience or just starting?
  3. Which programming language is most used and preferred by the intended users — Python, Scala, R, or something else?

The rationale for selecting a particular DSML platform will depend on the audience. If the intended users are experienced data scientists, the primary environment is Python you need a platform that offers a significant amount of customization and flexibility. Experienced data scientists generally prefer to build, test, and tweak models manually. These data scientists will have an affinity for a platform that automatically discovers and generates new features to build accurate models faster and explore broader feature space.

Code-Free or Code-First, what degree of automation will accelerate the data science workflow?

Another important criterion is the selection of a no-code (or low code) versus code-first approach to data science. Traditional DSML platforms (code-first) require data science teams to generate features manually, a very time-consuming process that involves a lot of domain knowledge. Once the features are built, AutoML platforms can accelerate the work by selecting the algorithms and building ML models automatically. As an analytics and data science leader, you need to decide how much of this process you need to automate?

Data Science workflow and role of automation

On the other hand, a no-code environment means using visual tools, drag and drop functionality. The BI & analytics team or inexperienced data scientists will prefer an enterprise platform with AutoMl 2.0 capabilities such as end-to-end data science automation, including data preparation, automated feature engineering, ML, and one-click model deployment.

Here is a quick rundown of five significant attributes to think through while evaluating the DSML platforms:

  1. Data Ingestion and Preparation: How much manipulation of data must be performed before it is ready for ingestion by the DSML platform? Can you upload data to the platform without having to write additional SQL code?
  2. Feature Engineering Automation: How much manual work is involved in Feature Engineering? Will the platform support automated feature engineering and can the AI engine automatically explore all available database entity relationships and discover and evaluate features based on available columns and relationships?
  3. Machine Learning: Does the system support automated machine learning, state-of-the-art ML algorithms like scikit-learn, XGBoost, LightGBM, TensorFlow, and PyTorch? Can the users perform an automated hyper-parameter search of ML algorithms?
  4. ML Operationalization: How easy is it to deploy ML models in a production environment? Can you monitor models, discover model drift, and quickly retrain models if production data changes over time?
  5. Platform Integration, Ease of Use, and Deployment Flexibility: Can all steps of the data science process be executed seamlessly within a single platform without the need for moving between systems and applications?

Making the right choice from a crowded field in the DSML platform market can be challenging. Forrester Research had published a report highlighting nine automation focussed Machine Learning Solutions. The report underscored the importance of Feature Engineering and Explainability as critical differentiating factors for leaders in the automated ML space. To learn more about automation-focussed Machine Learning Solutions, the Forrester Wave report is a great resource. To learn more about end-to-end data science platform and AI Automation check out www.dotdata.com.

--

--

Sachin Andhare
0 Followers

All things about technology, product & marketing. Head of Product Marketing at dotData.