A machine learning solution can be broadly divided into 3 parts. A typical ML exercise would involve experimentation and iteration of all the 3 parts together and/or 1 of the 3 parts before arriving at a solution.
1. Pre-Processing: Preparation of data for modeling.You are the best judge of what needs to be done, but here are some considerations:
- What is my expected output and what’s the nature and size of data I have at my disposal? Is it a binary – 0 or 1, cluster, probabilities? This would dictate the choice of algorithm, method (supervised, unsupervised) and thereby data preparation for that particular algorithm.
- Which are the dependent variables and which is the independent variable?
- Are the variables categorical or continuous? Can I transpose the variable or represent it in a different way (encoding it for example) to make it more palatable to my needs?
- Training dataset and testing data set split.
- Class Imbalance.
- Data Cleaning – Removal of NaNs, duplicates, erroneous entries and the likes.
- Take into consideration – Stemming, Lemmatization, Stop-words removal, similaritymeasure (cosine similarity, edit distance etc), L1 & L2 regularization, normalization among other similar things.
- Feature selection – PCA – Principal Component Analysis, SVD or other statistical/mathematical measures. Academic papers (Google Scholar is a good resource), research reports, past experiments, solid human domain knowledge to exclude false indicators.
2. Modeling:
- The choice of algorithm is important. What’s my desired outcome – how do I want to interpret it? What’s the nature/size/class of data at my disposal? The analysis done in the preprocessing stage is also relevant in modeling. These 2 should help you narrow down the search.
- Do look at academic papers (Google Scholar) or business reports or other examples from colleagues/online/kaggle etc. and find similar work done before. Pay careful attention to the nature of their data, their output, the kind of data cleaning they did, the size of data, choice of algorithm, evaluation criteria, results and interpretation of those results. If there is a close match or if there is an analogous match (same type of data but different source for example), then it might make sense to replicate the algorithm and the entire methodology or to adopt the algorithm to your methodology.
- Overfitting, underfitting and the different types of cross-validation.
- Training, Testing and development split.
Evaluation:
- Choice of evaluation metric is as important as scoring in that particular evaluation metric. Each algorithm/method typically has some evaluation metrics closely associated with it, mostly due to practice. One argues there that area under Precision/Recall is best used when there is a class imbalance and for the other cases when one needs to capture the relevance of positives and negatives equally, area under ROC curve works better. There are different schools of thought. You have your own. There are other evalaution metrics too other than these both. I picked these both as an example, since there are popular in popular culture.
- The question is – What are we trying to evaluate and whatare we trying to interpret? What metric will help me evaluate it with the least bias?
- Recording/measuring results. Iteration with different algorithm choices and/or different pre-processing tweaks/choices.
Thumbrules:
- From a ML standpoint, it’s not just important to score high in evaluation metrics but to also ensure there are no biases in evaluation metric choices, dataset formation, class examples, overfitting etc. It’s only then that the system can scale and provide consistent results. Eventually, with more data and more experimentation and learning, the system will score high as long as the experimentation/modeling framework is setup the right way.
- The training dataset needs to be a good representation of the testing dataset that will eventually be used.
- Recording the choices and assumptions will help in the longer run.
- Like all experiments – one only learns from implementation, recording and measuring results, feedback based iteration.