Enterprise applications trending to adopt Machine Learning as their strategic implementation and performing machine learning deep analytics across multiple problem statements is becoming a common trend. There are variety of machine learning solutions / packages / platform that exist in market. One of the main challenges that the teams initially trying to resolve is to choose the correct platform / package for their solution.
Based on my limited experience with different machine learning solutions I thought to write this blog to list out the points (features in machine learning term) to consider while choosing a specific ML platform and list pros and cons of each of the solutions in market.
Let’s look at feature set to be weighted before deciding a ML solution
High Level Feature |
Feature Set |
Comments |
Data Storage |
High Storage Volume Need |
Ability to store huge volume of data to add ever growing storage needs |
High Availability |
High availability of data on partial failures |
|
Data Exploration |
Visualizing summary tables and patterns in input data |
Ability to find patterns in input data |
Data Preparation / Cleansing |
Feature Extraction |
Manipulating the raw data to extract features needed for algorithm execution |
Distributed Execution |
Ability to perform the data manipulation in a distributed way , this is required when you have huge volume of data and need to reduce the time to complete |
|
Development |
Supported Languages |
Scripting languages support for development |
Ease of development |
How easy is the platform to develop scripts and execution? |
|
General Purpose Programming |
General purpose programming needs and ability of the underlying support for general purpose programming? |
|
Model |
Algorithms Supported |
Availability of different algorithm implementation packages on the platform |
Distributed Execution in Model Creation |
Model creation is a time consuming operation and hence the ability to create the model in a distributed way saves lot of time |
|
Deep Learning Support |
Support for Deep learning algorithms |
|
GPU Support |
GPU execution support |
|
Flexibility to Tune Model |
Level at which the mode parameters can be tuned |
|
Model Examination Flexibility |
Ability to examine the model helps to deep dive into what is happening behind the model |
|
Ease in switching between Models |
Switch between different models for suitable choice |
|
Data Visualization |
Visualize and Plot the results |
Availability of different charts to visualize the output |
Productionizing |
Ease of deploying the model in production use case on web environment |
Run in large scale deployment Ability to deploy the model in web Scale to huge volume of data handling |
Support |
Official / Community Support with Active development |
Commercial support availability for the platform / solution Active community development |
Now let’s look at the different machine learning solutions / platforms available in the market and where they stand with respect address the feature requirements.
Solution |
Language |
Pros |
Cons |
R |
Thousands of packages for different solutions Easy to develop Deep Model examination and tuning |
Time consuming execution due to single threaded nature. Not easy productionizing for web environment |
|
Scala, Python, R |
Scalable Machine learning library Distributed execution utilizing platform like Yarn , Mesos etc. Faster execution Supports multiple languages like Scala, Python, R |
New to market Does not have exhaust list of algorithm implementation Knowledge of Hadoop eco system |
|
Scala, Python, R |
Easy integration to platforms like Spark through Sparkling water , R Connect to data from hdfs, S3, NOSQL db etc… |
Compatibility between H20 and Spark with Sparkling water No support for scala in H20 Notebooks |
|
Python, C++ |
Flexible architecture that can deployed to run CPU / GPU Effective utilization of underlying hardware. Stronger in Deep Learning implementations |
Learning Curve is comparatively more Generally meant for Neural network based implementation |
|
Matlab |
Advanced tool box with wide variety of algorithm implementations Algorithms can be deployed as Java or dot net packages for deployment |
Learning of Matlab language Expensive product |
|
Python |
Good collection of algorithm implementations Easy to learn and develop Integration with PYSPARK Good for local usage and trials |
Enterprise license cost Advanced features is licensed and expensive |
|
Python |
SFrame concept aims for distributed machine learning executions Can read and process from HDFS, S3 etc. Simplified machine learning executions |
Commercial licensed product |
|
PaaS for ML |
PaaS platform for Machine Learning on IBM Blue Mix Easy integration with social, cloud End to end solution development with limited knowledge Easy to deploy |
Limited control in model creation & tuning Limited control over underlying infrastructure |
|
PaaS for ML |
PaaS platform for Machine Learning on Microsoft Azure Workflow based ML solution on Azure Easy to develop ML solutions on Azure cloud |
Limited control in model creation & tuning Limited control over underlying infrastructure |
|
SaaS for ML |
PaaS platform for Machine Learning on AWS Easy to develop ML solutions on AWS cloud |
Limited control in model creation & tuning Limited control over underlying infrastructure |
Machine learning packaged solutions like RStudio, H20, Anaconda, Turi are trying to improve in the space of accessing and storing data on distributed storage and trying to add capabilities for distributed multi thread / core /node execution on time consuming tasks like data preparation, feature extraction and model creation.
Machine learning PaaS solutions like IBM Watson, Azure ML, AWS ML having benefits of cloud background tries to abstract the overhead of packaging and aims for easy deployment and scalability. These PaaS solutions are limited with fine tuning models and algorithms trying to improve in that space.