This article was written by Manish Saraswat.
Source for picture: click here
This article is meant to help R users enhance their set of skills and learn Python for data science (from scratch). After all, R and Python are the most important programming languages a data scientist must know.
Python is a supremely powerful and a multi-purpose programming language. It has grown phenomenally in the last few years. It is used for web development, game development, and now data analysis / machine learning. Data analysis and machine learning is a relatively new branch in python.
For a beginner in data science, learning python for data analysis can be really painful. Why ? You try Googling “learn python,” and you’ll get tons of tutorials only meant for learning python for web development. How can you find a way then ?
In this tutorial, we’ll be exploring the basics of python for performing data manipulation tasks. Alongside, we’ll also look how you do it in R. This parallel comparison will help you relate the set of tasks you do in R to how you do it in python! And in the end, we’ll take up a data set and practice our newly acquired python skills.
Note: This article is best suited for people who have a basic knowledge of R language.
Table of Contents:
- Why learn Python (even if you already know R)
- Understanding Data Types and Structures in Python vs. R
- Writing Code in Python vs. R
- Practicing Python on a Data Set
1. Why learn Python (even if you already know R)
No doubt, R is tremendously great at what it does. In fact, it was originally designed for doing statistical computing and manipulations. Its incredible community support allows a beginner to learn R quickly.
But, python is catching up fast. Established companies and startups have embraced python at a much larger scale compared to R.
According to indeed.com (from Jan 2016 to November 2016), the number of job postings seeking “machine learning python” increased much faster (approx. 123%) than “machine learning in R” jobs. Do you know why ? It is because
- Python supports the entire spectrum of machine learning in a much better way.
- Python not only supports model building but also supports model deployment.
- The support of various powerful deep learning libraries such as keras, convnet, theano, and tensorflow is more for python than R.
- You don’t need to juggle between several packages to locate a function in python unlike you do in R. Python has relatively fewer libraries, with each having all the functions a data scientist would need.
2. Understanding Data Types and Structures in Python vs. R:
These programming languages understand the complexity of a data set based on its variables and data types. Yes! Let’s say you have a data set with one million rows and 50 columns. How would these programming languages understand the data ?
Basically, both R and Python have pre-defined data types. The dependent and independent variables get classified among these data types. And, based on the data type, the interpreter allots memory for use. Python supports the following data types:
- Numbers- It stores numeric values. These numeric values can be stored in 4 types: integer, long, float, and complex. Let’s understand them.
- Integer – It refers to whole numbers such as 10,13,91,102, etc. It is the same as R’s
integer
type. - Long – It refers to long integers which are represented in octa and hexadecimal. In R, you use
bit64
package to read hexadecimal values. - Float – It refers to decimal values such as 1.23, 9.89, etc. It is the same as R’s
numeric
type. - Complex – It refers to complex numbers such as 2 + 3i, 5i, etc. However, this data type is rarely found in data.
- Integer – It refers to whole numbers such as 10,13,91,102, etc. It is the same as R’s
- Boolean- It stores two values (True and False). In R, it can be stored as a
factor
type or acharacter
type. There exists a tiny difference between Boolean values in R and python. In R, Boolean are stored as TRUE and FALSE. In python, they are stored as True and False. There’s a difference in the letter case. - Strings- It stores text (character) data such as “elephant,” “lotus,” etc. It is the same as R’s
character
type. - Lists- It is the same as R’s list data type. It is capable of storing values of multiple variable types such as string, integer, Boolean, etc.
- Tuples- There is nothing like tuples in R. Think of tuples as an R vector whose values can’t be changed; i.e., it is immutable.
- Dictionary- It provides a two dimensional structure which supports key : value pair. In simple words, think of a key as a column name, and pair as column values.
Since R is a statistical computing language, all the functions to manipulate data and reading variables are available inherently. On the other hand, python hails all the data analysis / manipulation / visualization functions from external libraries. Python has several libraries for data manipulation and machine learning. The most important ones are:
- Numpy- It is used for doing numerical computing in python. It provides access to numerous mathematical function such as linear algebra, statistics etc. It is largely used to create arrays. In R, think of an array as a list. It consists of one class (numeric or string or boolean) or multiple classes also. It can be unidimensional or multidimensional.
- Scipy- It is used for doing scientific computing in python.
- Matplotlib- It is used for doing data visualization in python. For R, we use the famous ggplot2 library.
- Pandas- It is the powerhouse for doing data manipulation tasks. In R, we use packages like dplyr, data.table etc.
- Scikit Learn- It is the powerhouse for implementing machine learning algorithms. In fact, it’s the best part about doing machine learning in python. It contains all the functions you would require for model building.
In a way, python for a data scientist is largely about mastering the libraries stated above. However, there are many more advanced libraries which people have started using. Therefore, for practical purposes you should remember the following things:
- Array- This is similar to R’s
list
. It can be multidimensional. It can contain data of the same or multiple classes. In case of multiple classes, the coercion effect takes place. - List- This is also similar to R’s List.
- Data Frame- It’s a two-dimensional structure comprising several lists. R has a built-in function
data.frame
and python uses theDataframe
function from the pandas library. - Matrix- It’s a two (or multi) dimensional structure comprising all values of the same class (or multiple class). Think of a matrix as a 2D-version of a vector. In R, we use the
matrix
function. In python, we use thenumpy.column_stack
function.
Until here, I hope you’ve understood the basics of data types and data structures in R and Python. Now, let’s start working with them!
To read the full original article (and to learn writing code in Python vs. R and practice Python on a data set) click here. For more Python vs. R related articles on DSC click here.
DSC Resources
- Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
- Contributors: Post a Blog | Ask a Question
- Follow us: @DataScienceCtrl | @AnalyticBridge
Popular Articles