Machine Learning is a diverse field covering a wide territory and has impacted many verticals. It is able to tackle tasks in language and image processing, anomaly detection, credit scoring sentiment analysis, forecasting alongside dozens of other downstream tasks. A proficient developer, in this line of work; has to be able to draw, borrow, and steal from many adjacent fields such as mathematics, statistics, programming, and most importantly common sense. I for one have drawn tremendous benefits from myriad of tools available to break down complex tasks into smaller more manageable components. It turns out that developing and training a model only takes a small fraction of the project duration. The bulk of the time and resources are spent on data acquisition, preparation, hyperparameter tuning, optimization, and model deployment. I have been successful in building a systematic knowledge base that has helped my team to tackle some common yet tough challenges. The following is an attempt to identify some of them:
- Building an efficient and reliable end-to-end deep learning pipeline can be very challenging. Fortunately, there are a myriad of ‘workflow management’ tools that can dramatically lessen the difficulties of this task. Jenkins, Airflow, and KubeFlow are to name a few. While each one has strengths and weaknesses, my favorite is Airflow. Fortunately there are many online tutorials for Airflow but my favorite is the following video series on YouTube by Tuan Vu
- It does not take much for any serious ML practitioner to recognize the importance of feature engineering. Time spent on analyzing and transforming feature columns in a dataset will produce dramatic improvements in the outcome. While feature engineering is critical, it can be complex and time consuming. I am absolutely impressed by the capabilities of a package called Automunge used for feature engineering and transformation. The tool can tackle complex numerical and categorical transformations, custom infills, and ‘feature importance analysis’, oversampling, and much more. Truly remarkable
- If you are a heavy user of Jupyter notebooks and want to scale your reliance on them, check out papermill. The environment allows parameterization of notebooks and enables the execution of various routines via Python API, as well as CLI. papermill can help you store notebooks in a number of locations including AWS S3, Azure data blobs, and Azure data lakes. Last but not least, papermill supports powerful features needed to write unit tests
- Although I have not been a user, but I have heard rave reviews about Deequ. Deequ can be viewed as a tool for testing large datasets. It is an open source tool developed by Amazon and is intended to generate data quality metrics for large datasets destined for production. This is done in accordance with the quality constraints set by the user. Effective use of this tool will eliminate the need for developing code to manually perform checks and balances. The tool is implemented on Apache Spark and is designed to scale with large datasets
- If you are involved in Language Processing. You will find the site “TheSuper Duper NLP Repo” invaluable. If you can’t find things there; the chances are that you don’t need them.
- I have used Docker for a long time and frankly I can’t imagine how things were done before its availability. I recently went through the following videos and learned quite a bit specially when it comes to nuances: (registration required)
- How to Get Started with Docker
- Simplify All the Things with Docker Compose
- Hands on Helm
- Build and Deploy Multi Container Applications to AWS
- They say pandas are powerful, proven, fast, and user friendly. I am in agreement with the first three. That should explain why I keep the following cheat sheets handy at all times.
- Like feature engineering, hyperparameter tuning is a critical and resource intensive phase in the development of machine learning pipeline. There are many approaches for hyperparameter optimizations such as grid, random, manual, and automated search (using Bayesian optimization). If you choose to pursue the automated path, I recommend evaluating Ax (Adaptive Experimentation Platform). This package has been developed by Facebook and is quite mature and easy to use
- If your deep learning works the very first time; be assured that you are doing something wrong. I have found the following list to be useful should you be looking for ways to troubleshoot your model:
- 37 Reasons why your Neural Network is not working
- I have been an early user of Tensorflow and have not shied away by saying that Tensorflow is “Google’s Revenge on Humanity”. I am pleased to indicate that my views of this framework have completely changed (for the better) once I started using the 2.0 version. Very, very powerful and much easier to use.
- If you have found difficulty finding an appropriate dataset for running experiments, I recommend checking out Google’s Dataset Search Engine
- Developing, training, and tuning a machine learning model are just the beginnings. A rigorous testing regime is required before models can be deployed in the real world.
- If you are new to the field of Language Processing, you will find a tremendous value in learning about opensource tools such as SpaCy, NLTK, Flair, and StanfordNLP.
- I would like to extend a big shout-out to Chris Fregly. Chris was a principal at PipelineAI and recently joined the AWS machine learning team. I have never met anyone with such a commanding knowledge of the building blocks needed to build an end-to-end AI pipeline. He is intimately familiar with a plethora of tools and frameworks and presents his technical knowledge with an infectious enthusiasm and tone. He hosts a free monthly workshop that is super informative.
- Call it bad karma or bad luck, but I have rarely been able to deal with a nice, symmetric, and balanced dataset. I have found the following resources beneficial when I had to contend with an unbalanced dataset:
- Step-By-Step Framework for Imbalanced Classification Projects
- Understanding Algorithms for Imbalanced Classification
- The Impact of Imbalanced Training Data for Convolutional Neural Networks
- Weight normalization is a crucial task when it comes to training deep networks. It has many benefits such as:
- Enables faster training by allowing higher learning rates
- Has “Regularization” effect
- It easies weight initialization
- If you are like me and spend more than you like for cloud-based GPUs, check out Google’s Colab Pro. For less than $10/mo. you can have access to GPUs/TPUs and a decent amount of memory to help with your prototyping
- If ‘Model Explainability’ is a paramount requirement for your project and you are considering using SHAP or LIME, you will find the following summaries beneficial:
- SHAP and LIME Python Libraries: Part 1 – Great Explainers, with Pros and Cons to Both
- SHAP and LIME Python Libraries: Part 2 – Using SHAP and LIME