This article was written by Claudia Perlich. Claudia is a Chief Scientist Distillery and Adjunct Professor at NYU. She is also a Data Scientist at Quora.
First of, let me state what I think is NOT the the problem: the fact that data scientists spend 80% of their time with data preparation. That is their JOB! If you are not good at data preparation, you are NOT a good data scientists. It is not a janitor problem as Steve Lohr provoked. The validity of any analysis is resting almost completely on the preparation. The algorithm you end up using is close to irrelevant.Complaining about data preparation is the same as being a farmer and complaining about having to do anything but harvesting and please have somebody else deal with the pesky watering, fertilizing, weeding, etc.
This being said – data preparation can be made difficult by the process of raw data collection. Designing a system that collects data in a form that is useful and easily digestible by data science is a high art. Providing full transparency to DS how exactly the data flows to the system is another. It involves processes that consider sampling, data annotation, matching, etc. It does not include things like replacing missing value and excessive normalization. Creating an effective data environment for DS needs to involve DS and cannot be entirely owned by engineering. DS is often NOT able to spec such system requirements in sufficient detail to allow for a clean handover.
But in the bigger picture, there are more important things to consider. The by far biggest issue I see is data science solving irrelevant problems. This is a huge waste of time and energy. The reason is typically that whoever has the problem is lacking data science understanding to even express the issue and data scientists end up solving whatever they understood might be be the problem, ultimately creating a solution that is not really helpful (and often far too complicated)…
Read more, click here.
DSC Resources
- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers
Additional Reading
- What statisticians think about data scientists
- Data Science Compared to 16 Analytic Disciplines
- 10 types of data scientists
- 91 job interview questions for data scientists
- 50 Questions to Test True Data Science Knowledge
- 24 Uses of Statistical Modeling
- 21 data science systems used by Amazon to operate its business
- Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)
- 5 Data Science Leaders Share their Predictions for 2016 and Beyond
- 50 Articles about Hadoop and Related Topics
- 10 Modern Statistical Concepts Discovered by Data Scientists
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 22 tips for better data science
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- High versus low-level data science
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge