It’s a complete tutorial on data wrangling or manipulation with R. This tutorial covers one of the most powerful R package for data wrangling i.e. dplyr. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc. It’s one of the most popular R package as of date. This post includes several examples and tips of how to use dply package for cleaning and transforming data.
dplyr vs. Base R Functions
dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.
dplyr Function | Description | Equivalent SQL |
---|---|---|
select() | Selecting columns (variables) | SELECT |
filter() | Filter (subset) rows. | WHERE |
group_by() | Group the data | GROUP BY |
summarise() | Summarise (or aggregate) data | – |
arrange() | Sort the data | ORDER BY |
join() | Joining data frames (tables) | JOIN |
mutate() | Creating New Variables | COLUMN ALIAS |
The sample_n function selects random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.
sample_n(mydata,3)
The sample_frac function returns randomly N% of rows. In the example below, it returns randomly 10% of rows.
sample_frac(mydata,0.1)
Example 3 : Selecting Variables (or Columns)
Suppose you are asked to select only a few variables. The code below selects variables “Index”, columns from “State” to “Y2008”.
mydata2 = select(mydata, Index, State:Y2008)
Example 4 : Dropping Variables
The minus sign before a variable tells R to drop the variable.
mydata = select(mydata, -Index, -State)
The above code can also be written like :
mydata = select(mydata, -c(Index,State))
For Original Article , click here