Countless hours of online courses haven’t prepared me for challenges in my first full-time position as a data scientist. Yes, I learned Python well enough to land the job, but the reality of developing a data science project went beyond my anticipations. Now it’s time for me to pinpoint several misconceptions and matters that are not voiced enough.
Do not take data for granted
The Kaggle times, when tremendous community effort was unearthing secrets and patterns of almost every data set, are over once you turn pro. You will find your data fragmented, skewed and screwed, simply missing, abundant but noisy – just to list a few plausible scenarios. Your newbie energy won’t leave you discouraged, yet patching the gaps may consume too much time and resources than available. Although it’s often stated that companies sit on stockpiles of data it doesn’t mean it’s accessible for data science research. You can easily find yourself constrained by licenses, corporate agreements, confidentiality matters and technical issues, such as parsing or streaming. If that happens navigate yourself through conversations with experts. Their thorough understanding, of the field you were appointed to investigate, will guide you through the confusion and facilitate data science research. There is one more vital reason to stay in touch with the experts’ panel. It’s been said many times – data science projects fail due to miscommunication with clients. Either you get their expectations wrong or they imagine your solution to be different than it is. This is by all means true.
My takeaway here is to bridge gaps in data with the expertise of people close to the problem and foster cooperation with a data engineering team. After all, they deliver data fuel to your model rocket.
Do not underestimate the power of Maths
It was so much fun to import this or that from scikit learn and fit my models, back in the days. What I quickly experienced at work is the cost of computation, especially if you work on big data, meaning you’re out of RAM right after loading a data set. The currency of that cost is either real money spent on cloud or execution time utilizing in-house infrastructure. It may also happen, like in my case, that your environment would require you to switch from Python to PySpark. Regardless of the industry, business objectives are always the same. If you have to deliver your solution in production it has to be fast and cheap. Otherwise, you will circulate around the infinite RND loop. That’s why I turned to statistics and probability, investigating how can I blend pure maths into my algorithms. As I was closely working with experts I was getting the vital context of the industry our team was assigned to. Splitting complex problems into really narrow cases, separated by well-defined thresholds, made even standard deviation applicable. Although it may not sound data scientific at all, relatively simple math can deliver lean solutions that work lighting fast on excessively large volumes of data.
GIT matters
I wasn’t any different than numerous junior data scientists entering the job market with a belief, that Jupyter Notebook is the fundamental tool of our work. I simply couldn’t be more wrong. Just as the name suggests, ‘notebook’ stands for keeping notes, full stop. Jupyter won’t facilitate teamwork, won’t enable code version control and won’t lead you to production. My conclusion regarding Jupyter Notebook is that although it’s great for quick exploration and verification of your ideas it cripples the overall performance of the data science team. Now, what has become of fundamental importance is to keep your code repository thriving. Daily commits, work on branches, it will all benefit transparency of your project, facilitate testing and production, taking tasks over from fellow data scientists. Before I started as a data scientists I was on a 3-month front-end web app internship. One year later it’s really striking how much typical app development has in common with developing data science projects.
Having articulated my thoughts above, let me conclude with one productivity hack that was actually unthinkable at my data science discovery phase. Disconnect from Jupyter Notebook and say hello to Python IDE of your choice. You won’t lose the Jupyter experience as both, Visual Studio Code and PyCharm, support notebooks. What you gain, however, is instant ability to turn your code into proper .py files. This is what you commit at the end of the day and schedule for testing in the development environment. Tracking changes and development of your algorithms is a robust component of quality assurance and indicator of your performance. This is how you keep things organized. Ultimately, running a data science project is so much alike app development. At least this is what I’ve observed in my rookie year as a data scientist.