Data Wrangling

Two paradigms

R

library(oilabs)
fdims <- bdims[bdims$sex == "f", ]
mean(fdims$hgt)

## [1] 164.8723

R with dplyr

library(dplyr)
fdims <- filter(bdims, sex == "f")
summarize(fdims, mean = mean(hgt))

##       mean
## 1 164.8723

R Subsetting

fdims <- bdims[bdims$sex == "f", ]

Square brackets are used to subset dataframes.
[rows, columns]

R Summaries

mean(fdims$hgt)

## [1] 164.8723

Many commands operate on a vector (column) of data extracted from a dataframe using the $.

dplyr Subsetting

fdims <- filter(bdims, sex == "f")

dplyr contains separate functions to filter the rows and select the columns.
The dataframe is one argument and the column names are another (no $).

fdims <- bdims %>% filter(sex == "f")

Can also chain commands using the pipe.

dplyr Summaries

summarize(fdims, mean = mean(hgt))

##       mean
## 1 164.8723

Any numerical summary that you want to apply to a column of a dataframe is specified within summarize().

fdims %>% summarize(mean = mean(hgt))

##       mean
## 1 164.8723

This, too, can be put into a chain.

Two paradigms

R

fdims <- bdims[bdims$sex == "f", ]
mean(fdims$hgt)

## [1] 164.8723

R with dplyr

library(dplyr)
fdims <- filter(bdims, sex == "f")
summarize(fdims, mean = mean(hgt))

##       mean
## 1 164.8723

Why dplyr?

Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.

This is all possible in base R, but with dplyr, it is simple, readible, and fast.

The Seven Verbs (commands)

filter
arrange
select
distinct
mutate
summarize
sample_n

filter()

Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you'd like to apply.

For all flights on January 1st:

library(pnwflights14)
filter(flights, month == 1, day == 1)

## Source: local data frame [419 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2014     1     1        1        96      235        70      AS  N508AS
## 2   2014     1     1        4        -6      738       -23      US  N195UW
## 3   2014     1     1        8        13      548        -4      UA  N37422
## 4   2014     1     1       28        -2      800       -23      US  N547UW
## 5   2014     1     1       34        44      325        43      AS  N762AS
## 6   2014     1     1       37        82      747        88      DL  N806DN
## 7   2014     1     1      346       227      936       219      UA  N14219
## 8   2014     1     1      526        -4     1148        15      UA  N813UA
## 9   2014     1     1      527         7      917        24      UA  N75433
## 10  2014     1     1      536         1     1334        -6      UA  N574UA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

Constructing filters

Filters are constructed of logical operators: <, >, <=, >=, ==, != (and some others).

Adding them one by one to filter() is akin to saying "this AND that". To say "this OR that OR both", use |.

filter(flights, month == 1 | month == 2)

## Source: local data frame [23,044 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2014     1     1        1        96      235        70      AS  N508AS
## 2   2014     1     1        4        -6      738       -23      US  N195UW
## 3   2014     1     1        8        13      548        -4      UA  N37422
## 4   2014     1     1       28        -2      800       -23      US  N547UW
## 5   2014     1     1       34        44      325        43      AS  N762AS
## 6   2014     1     1       37        82      747        88      DL  N806DN
## 7   2014     1     1      346       227      936       219      UA  N14219
## 8   2014     1     1      526        -4     1148        15      UA  N813UA
## 9   2014     1     1      527         7      917        24      UA  N75433
## 10  2014     1     1      536         1     1334        -6      UA  N574UA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

Exercise 1

Construct filters to isolate:

Flights that left on St. Patrick's Day.
Flights that were destined for Chicago's primary airport.
Flights that were destined for Chicago's primary airport and were operated by United Airlines.
Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

Flights that left on St. Patrick's Day.
Flights that were destined for Chicago's primary airport.
Flights that were destined for Chicago's primary airport and were operated by United Airlines.
Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

filter(flights, month == 3, day == 17)
filter(flights, dest == "ORD")
filter(flights, dest == "ORD", carrier == "UA")
filter(flights, distance > 2000 | air_time > 5 * 60)

arrange()

arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)

Use desc() to sort in descending order.

arrange(flights, desc(arr_delay))

select()

Often you work with large datasets with many columns where only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset:

select(flights, year, month, day)

You can exclude columns using - and specify a range using :.

select(flights, -(year:day))

distinct()

A common use of select() is to find out which values a set of variables takes. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table.

What do the following data correspond to?

distinct(select(flights, origin, dest))

## Source: local data frame [115 x 2]
## 
##    origin  dest
##     (chr) (chr)
## 1     PDX   ANC
## 2     SEA   CLT
## 3     PDX   IAH
## 4     PDX   CLT
## 5     SEA   ANC
## 6     SEA   DTW
## 7     SEA   ORD
## 8     SEA   DEN
## 9     SEA   EWR
## 10    PDX   DEN
## ..    ...   ...

mutate()

As well as selecting from the set of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of mutate():

mutate(flights, gain = arr_delay - dep_delay)

## Source: local data frame [162,049 x 17]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2014     1     1        1        96      235        70      AS  N508AS
## 2   2014     1     1        4        -6      738       -23      US  N195UW
## 3   2014     1     1        8        13      548        -4      UA  N37422
## 4   2014     1     1       28        -2      800       -23      US  N547UW
## 5   2014     1     1       34        44      325        43      AS  N762AS
## 6   2014     1     1       37        82      747        88      DL  N806DN
## 7   2014     1     1      346       227      936       219      UA  N14219
## 8   2014     1     1      526        -4     1148        15      UA  N813UA
## 9   2014     1     1      527         7      917        24      UA  N75433
## 10  2014     1     1      536         1     1334        -6      UA  N574UA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl), gain (dbl)

summarize() and sample_n()

summarize() collapses a data frame to a single row. It's not very useful yet. sample_n() provides you with a random sample of the rows.

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

## Source: local data frame [1 x 1]
## 
##      delay
##      (dbl)
## 1 6.133859

sample_n(flights, 10)

## Source: local data frame [10 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2014     6    13     2051         1     2330        -4      AS  N442AS
## 2   2014     5     3     1224        -6     1409       -11      AS  N534AS
## 3   2014     5    17     1258         8     1521         6      WN  N352SW
## 4   2014     4    23     1329        24     1909        20      AS  N307AS
## 5   2014     6    12     1307        -8     1406        -5      OO  N291SW
## 6   2014     7     9     1046         6     1411        -9      WN  N261WN
## 7   2014     4    25      652        -8      742        -8      OO  N219AG
## 8   2014     4    20     1307        -8     1534       -12      OO  N814SK
## 9   2014     2    26      631        -9      844        -2      OO  N216AG
## 10  2014     1    18      657        -3     1810        NA      WN  N204WN
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

Exercise 2

Mutate the data so that it includes a new variable that contains the average speed traveled by the plane for each flight. Select that new variable and save it, along with tailnum, as a new dataframe object called speed.

flights <- mutate(flights, speed = distance / (air_time / 60))
speed <- select(flights, tailnum, speed)

Aggregation aka group_by()

These verbs become really powerful when you combine them with the idea of "group by", repeating the operation individually on groups of observations within the dataset. The group_by() function describes how to break a dataset down into groups of rows.

You can then use the resulting object in exactly the same functions as above; they'll automatically work "by group" when the input is a grouped.

group_by()

We can find the fastest airplanes in the bunch:

by_tailnum <- group_by(speed, tailnum)
avg_speed <- summarize(by_tailnum, 
                       count = n(), 
                       avg_speed = mean(speed, na.rm = TRUE))
arrange(avg_speed, desc(avg_speed))

## Source: local data frame [3,023 x 3]
## 
##    tailnum count avg_speed
##      (chr) (int)     (dbl)
## 1   N306DQ     1  597.7982
## 2   N860NW     1  589.7297
## 3   N821NW     1  574.2105
## 4   N831MH     1  574.0969
## 5   N839MH     1  571.5789
## 6   N5FPAA     1  569.4774
## 7   N675NW     1  569.2174
## 8   N5DWAA     1  567.5000
## 9   N807NW     2  566.7639
## 10  N5DNAA     2  564.8443
## ..     ...   ...       ...

Chaining

Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the %>% operator. You start the pipeline with a dataframe and then pass it to each function in turn.

speed %>%
  group_by(tailnum) %>% 
  summarize(count = n(), avg_speed = mean(speed, na.rm = TRUE)) %>%
  arrange(desc(avg_speed))

## Source: local data frame [3,023 x 3]
## 
##    tailnum count avg_speed
##      (chr) (int)     (dbl)
## 1   N306DQ     1  597.7982
## 2   N860NW     1  589.7297
## 3   N821NW     1  574.2105
## 4   N831MH     1  574.0969
## 5   N839MH     1  571.5789
## 6   N5FPAA     1  569.4774
## 7   N675NW     1  569.2174
## 8   N5DWAA     1  567.5000
## 9   N807NW     2  566.7639
## 10  N5DNAA     2  564.8443
## ..     ...   ...       ...

Exercise 3

Form a chain that creates a dataframe containing only carrier and their average departure delay time. Which carriers have the highest and lowest mean delays?

flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_delay))

## Source: local data frame [11 x 2]
## 
##    carrier avg_delay
##      (chr)     (dbl)
## 1       WN 13.329137
## 2       AA 10.597632
## 3       F9 10.152473
## 4       UA  9.795162
## 5       B6  8.462857
## 6       VX  7.852158
## 7       DL  4.820384
## 8       OO  4.435644
## 9       AS  2.783932
## 10      US  2.734850
## 11      HA  2.580439

Pipe into a plot

library(ggplot2)
flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_delay)) %>%
  qplot(x = carrier, y = avg_delay, data = ., geom = "point")

The `.`

When forming a chain, the %>% operator will default to putting the output of one function into the first argument of the next. You can change where this argument "lands" by using .

flights %>%
  filter(origin == "PDX") %>%
  qplot(x = carrier, data = ., geom = "bar")

Two paradigms

R

R with dplyr

R Subsetting

R Summaries

dplyr Subsetting

dplyr Summaries

Two paradigms

R

R with dplyr

Why dplyr?

The Seven Verbs (commands)

filter()

Constructing filters

Exercise 1

arrange()

select()

distinct()

mutate()

summarize() and sample_n()

Exercise 2

Aggregation aka group_by()

group_by()

Chaining

Exercise 3

Pipe into a plot

The .

The `.`