What is the best way for getting started in Statistics for Programmers/Data Science?
I am often asked this question: What’s the best way for getting started in Statistics for Programmers?
I have used the following approach in my teaching
Comments welcome:
Firstly, the interest in Statistics for Programmers is a fairly recent phenomenon.
This interest is based on the uptake of Data Science – a hot profession now.
Here’s how most people approach the problem
They pick up an old High School statistics text book – either their own from younger days– or a standard book.
These books are often decades old.
They start with page One .. and work linearly through a few pages ..
They quickly realize why they disliked stats earlier.
And that sentiment has not changed with the passage of time ..
But, here is a different approach
For Data Science, you do not need to master Statistics per se
You need to understand Statistical models.
A model is defined as a combination of predictive algorithms (based on Statistics) and Data.
Data science is based on creating models that improve with experience / training/
In contrast, if we take an Engineering led approach – we start with problems.
I recommend three sources which I am using (if you have others, please let me know at ajit.jaokar at futuretext.com and I shall link them and refer back to you)
Start with Understanding the problem
See these two links by @Brandon Rohrer (@Microsoft Data Science) –
Which algorithm family can answer my question and
Which questions can Data Science answer.
See also this post by Dr Vincent Granville @DataScienceCtrl
on 24 uses of Statistical modelling Part 1 and 2
These posts give you an idea of the problems that can be solved using Data science and stats(without going into the math itself initially)
Then read Allen Downey’s books
Allen Downney writes excellent books and they are all free under creative commons. You can download them at Green Tea Press and they have an excellent ethos. Especially – Think Stats, Think Bayes, Think complexity (in that order).
To encourage the author I would also encourage you to buy these books especially Think Stats.
You can follow him on Twitter @allendowney
Having mastered to this stage, then start with code and small datasets.
I prefer UCI datasets and Python scikit learn library. We could also use the REPL approach or notebooks ex Spark notebook.
In any case, these are small sections of code run in a controlled environment and show you how the stats are implemented(libraries / APIs like scikit learn – are relatively easier to understand if you come from a Programming background)
We use this approach in our teaching at Data Science for IoT course.
Any comments welcome on how the teaching of Statistics can be improved
Image source: Scatter plots – wikipedia