Guest blog post by Martijn Theuwissen, co-founder at DataCamp.Other Python resources can be found here.
Python is widely used for data analysis and you might have considered learning it yourself (if not, or if you’re still looking for that bit of extra motivation to get started, see why you should be learning Python below). Of course, learning on your own can be a challenge and some guidance is always helpful. Guidance to learn Python for working with data is exactly what this article will provide you with.
We will discuss steps you should take for learning Python accompanied with some essential resources, such as the free Python for Data Analysis courses and tutorials from DataCamp as well as reading and learning materials.
Step 0: Reasons to Learn Python
Why learn Python as a data analytics tool?
- It’s a Popular Data Analysis Tool: Firstly, by itself Python is one of the most popular tools for data analysis. With 35% of data scientists using Python, it is ahead of SQL and SAS, and behind only R.
- General Purpose Programming: Despite there being other very popular and great computing tools used for analyzing data (e.g. R, SAS), Python is the only true general purpose programming language. Check out this infographic for a more thorough comparison.
- Popular Programming Language: In addition, Python is one the most popular programming languages, when compared with other general purpose languages (e.g. Java, C++, PHP).
- If that’s not enough, Python is also the language of choice for teaching computer science in top U.S. universities.
As a side note: we don’t recommend that you only learn Python and forget about the rest. However, learning Python is one of the best things you can do for your career. There are good reasons why Python is being adopted so widely by computer scientists, and why it’s a data analysis tool of choice for so many, the main one being the ease of learning and using Python. Nonetheless, it can be challenging to set a learning path, so that’s what we will do now.
Step 1: Setting up your Python Environment for Data Analysis
Setting up your Python environment for performing data analysis is relatively simple. The most convenient way to go about this is to download the free Anaconda package from Continuum Analytics, as it contains the core Python language, as well as all of the essential libraries including NumPy, Pandas, SciPy, Matplotlib, and IPython. By using the graphical installer, downloading Python is as easy as downloading any computer program.
After installing, you will get a launcher containing a number of programs. The most important one is the iPython notebook, which is also called Jupyter notebook. Once you launch the notebook, the terminal is opened and a notebook is opened in your browser. Don’t get confused here! You don’t need internet connection to create or use the notebooks. Simply, the browser is used instead of a separate program and serves as your environment,where you can code.
However, you are not limited to using the browser based Jupyter notebooks. If you prefer an IDE, a great option for data analysis is Rodeo from Yhat. If you are familiar with RStudio for R, Rodeo is something very similar for Python. Be sure to try out both alternatives, as, ultimately, the Python environment you use will depend on your personal preference.
Step 2: Learning the Basics and Fundamentals
Now you are ready to begin learning to code with Python. There are a couple of good ways to go about this. Given your interest to learn Python for data analysis, your best option is the Introduction for Python for Data Science from DataCamp. This free course consist of video tutorials and interactive in browser exercises and is a great way to learn by doing, as opposed to simply reading concepts and looking at examples. You wouldn’t begin learning how to paint by reading a book about it. You would pick up a brush and start painting. That’s the way we would suggest for you to start learning Python! In addition to the introductory course, DataCamp offers an Intermediate Python for Data Science course which takes you even further.
Another quite useful resource is the Python course from Codecademy. While this course is not about data, but rather programming with Python, it is a great way to both practice with Python syntax and gain exposure to programming concepts that will be useful to you when working with data.
Step 3: Python Packages for Data Analysis
Python is a general purpose language and is often used for things other than data analysis and data science. What makes Python extremely useful for working with data, however, are the libraries that give users the necessary functionality. Below are the major Python libraries that are used for working with data. You should take some time to familiarize yourself with the basic purposes of these packages.
- Numpy and Scipy – fundamental scientific computing.
- Pandas – data manipulation and analysis.
- Matplotlib – plotting and visualization.
- Scikit-learn – machine learning and data mining.
- StatsModels – statistical modeling, testing, and analysis.
Step 4: Get Data to Learn With. (Loading Data)
The best way to learn and get comfortable with Python, or any other new programming language, is to take a sample dataset to work with, experiment, and try the new skills and techniques you pick up along the way.
The StatsModels library contains some preloaded datasets that you could use. Otherwise you can load a data set from the web or a csv file. To do so you can follow a sample code from available examples or forums like Stack overflow. Always have your dataset available and treat it as a toy that you can play with and learn from.
Step 5: Manipulating Data
One of the most hands-on skills of working with data is data manipulation. Data doesn’t always come clean and analysis-ready. In order to be able to analyse data, we often need data that is manipulated through transformations, formatting, cleaning, etc. Pandas and Numpy are the go to tools for that in Python, so start learning how to use them with your sample dataset.
- Get started with a short introduction: 10 Minutes to Pandas
- Follow an introductory tutorial: Pandas Notebook Lesson
- Go back to the DataCamp courses and re-apply what you learned to your toy dataset: DataCamp Python for Data Science
Step 6: Visualizing Data
Another essential skill in data analysis is data . Visuals are extremely important for both exploratory data analysis, as well the communication of your results. Matplotlib is the most commonly used library for this in Python.
- Get inspired by viewing some plots and graphs: Matplotlib Gallery
- Take a look at some sample code: Matplotlib Examples
- Review the Matplotlib chapter on DataCamp: DataCamp Python for Data Science
- Come up with some visualizations for your toy dataset.
Step 7: Data Analytics
Of course analyzing data is not just about formatting and making plots and graphs. The analytics begins with statistical modeling, machine learning algorithms, data mining techniques, inferences and so on. Python is a fantastic tool for analyzing data because it has libraries such as Scikit-learn and StatsModels which contain the implementations of the models and algorithms that you might need for your analysis. Of course, as Python is a general purpose programming language, you are also free to program your own methods when you become an advanced user, though make sure you are not replicating what already exists.
- Begin by considering a familiar technique that you will be able to follow (e.g.. Linear Regression, K-nearest neighbors, Time Series) and find an example of implementation in Python.
- Try performing a simple analysis on your toy dataset.
- You can look at examples of Scikit-learn and StatsModels methods that you might not know, just to appreciate the possibilities.
Step 8: Reporting
Communicating your analysis is a key soft skill in data science. Of course, communication begins with good use of language and style. However, an equally important aspect of communicating your analysis is preparing legible reports. Luckily you have a handy tool for that in the form of the previously mentioned Jupyter Notebooks (see step 1).
While you can use the Jupyter Notebooks as a place to code and do your analysis, you can also imbed text, formatted formulas, even images and video if you like. What’s more, you have to options of exporting your code in various formats that include PDFs, HTML, and Markdown.
- Gain a better familiarity with Jupyter through a tutorial: Jupyter Notebooks Getting Started
- If you are familiar with LaTeX learn how to use it with Jupyter: LaTeX in iPython Notebooks
- If you are not familiar with LaTeX but and you want to use mathematical notation in your reports, read about it: LaTeX for Mathematics Wiki
- Read some high-quality reports: Data Science Projects from Berkeley
Step 9: Mastering Python
After learning the basics of Python and exploring the main tools and libraries with your sample dataset, you should proceed to taking some courses either for Python, or courses taught with Python to begin mastering the language. Apparently, after 10,000 hours you can become an expert in anything, so don’t wait and get started! Some online course and sources for projects we would recommend include:
- Data Management and Processing: Python for Everybody by University of Michigan
- Data Analysis: Data Analysis and Interpretation Specialization
- Data Science Fundamentals: Intro to Data Science on Udacity
- Kaggle Competitions
- DrivenData.org Competitions
DSC Resources
- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers
Additional Reading
- What statisticians think about data scientists
- Data Science Compared to 16 Analytic Disciplines
- 10 types of data scientists
- 91 job interview questions for data scientists
- 50 Questions to Test True Data Science Knowledge
- 24 Uses of Statistical Modeling
- 21 data science systems used by Amazon to operate its business
- Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)
- 5 Data Science Leaders Share their Predictions for 2016 and Beyond
- 50 Articles about Hadoop and Related Topics
- 10 Modern Statistical Concepts Discovered by Data Scientists
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 22 tips for better data science
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- High versus low-level data science
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge