In this article, I summarize the components of any data science / machine learning / statistical project, as well as the cross-dependencies between these components. This will give you a general idea of what a data science or other analytic project is about.
Components
1. Problem
This is the top, fundamental component. I have listed 24 potential problems in my article 24 uses of statistical modeling. It can be anything from building a market segmentation, building a recommendation system, association rule discovery for fraud detection, or simulations to predict extreme events such as floods.
2. Data
It comes in many shapes: transactional (credit card transactions), real-time, sensor data (IoT), unstructured data (tweets), big data, images or videos, and so on. Typically raw data needs to be identified or even built and put into databases (NoSQL or traditional), then cleaned and aggregated using EDA (exploratory data analysis). The process can include selecting and defining metrics.
3. Algorithms
Also called techniques. Examples include decision trees, indexation algorithm, Bayesian networks, or support vector machines. A rather big list can be found here.
4. Models
By models, I mean testing algorithms, selecting, fine-tuning, and combining the best algorithms using techniques such as model fitting, model blending, data reduction, feature selection, and assessing the yield of each model, over the baseline. It also includes calibrating or normalizing data, imputation techniques for missing data, outliers processing, cross-validation, over-fitting avoidance, robustness testing and boosting, and maintenance. Criteria that make a model desirable include robustness or stability, scalability, simplicity, speed, portability, adaptability (to changes in the data), and accuracy (sometimes measured using R-squared, though I recommend this alternative instead).
5. Programming
There is almost always some code involved, even if you use a black-box solution. Typically, data scientists use Python, R or Java, and SQL. However, I’ve completed some projects that did not involve real coding, but instead, machine-to-machine communications via API’s. Automation of code production (and of data science in general) is an hot topic, as evidenced by the publication of articles such as The Automated Statistician, and my own work to design simple, robust black-box solutions.
6. Environments
Some call it packages. It can be anything such as a bare Unix box accessed remotely combined with scripting languages and data science libraries such as Pandas (Python), or something more structured such as Hadoop. Or it can be an integrated database system from Teradata, Pivotal or other vendors, or a package like SPSS, SAS, RapidMiner or MATLAB, or typically, a combination of these.
7. Presentation
By presentation, I mean presenting the results. Not all data science projects run continuously in the background, for instance to automatically buy stocks or predict the weather. Some are just ad-hoc analyses that need to be presented to decision makers, using Excel, Tableau and other tools. In some cases, the data scientist must work with business analysts to create dashboards, or to design alarm systems, with results from analysis e-mailed to selected people based on priority rules.
Cross-Dependencies
These components interact as follows. I invite you to create a nice graph from the dependencies table below. The first relationships reads as “the problem impacts or dictate the data”.
Problem -> Data
Problem -> Algorithms
Algorithms -> Models
Algorithms -> Programming
Algorithms -> Environment
Data -> Environment
Environment -> Data
Data -> Algorithms
Data -> Problem
Problem -> Presentation
Models -> Presentation
Also read the lifecycle of data science projects (see also this article).
DSC Resources
- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers
Additional Reading
- The 10 Best Books to Read Now on IoT
- 50 Articles about Hadoop and Related Topics
- 10 Modern Statistical Concepts Discovered by Data Scientists
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 13 New Trends in Big Data and Data Science
- 22 tips for better data science
- Data Science Compared to 16 Analytic Disciplines
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- 10 types of data scientists
- 66 job interview questions for data scientists
- High versus low-level data science
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge