This article was written by Matthew Mayo.
Scikit-learn is the de facto official machine learning library in use in the Python ecosystem. As described on its official website, Scikit-learn is:
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable – BSD license
This tutorial is meant to serve as a demonstration of several machine learning classifiers, and { is inspired by | references | incoporates techniques from } the following excellent works:
- Randal Olson’s An Example Machine Learning Notebook
- Analytics Vidhya’s Common Machine Learning Algorithms Cheat Sheet
- Scikit-learn’s official Cross-validation Documentation
- Scikit-learn’s official Iris Dataset Documentation
- Likely includes influence of the various referenced tutorials included in this KDnuggets Python Machine Learning article I recently wrote
We will use the well-known Iris and Digits datasets to build models with the following machine learning classification algorithms:
- Logistic Regression
- Decision Tree
- Support Vector Machine
- Naive Bayes
- k-nearest Neighbors
- Random Forests
We also use different strategies for evaluating models:
- Separate testing and training datasets
- k-fold Cross-validation
Some simple data investigation methods and tools will be undertaken as well, including:
- Plotting data with Matplotlib
- Building and data via Pandas dataframes
- Constructing and operating on multi-dimensional arrays and matrices with Numpy
This tutorial is brief, non-verbose, and to the point. Please alert me if you find inaccuracies. Also, if you find it at all useful, and believe it to be worth doing so, please feel free to share it far and wide.
To read the tutorial, with the demonstration, click here.