R
library(oilabs) fdims <- bdims[bdims$sex == "f", ] mean(fdims$hgt)
## [1] 164.8723
R with dplyr
library(dplyr) fdims <- filter(bdims, sex == "f") summarize(fdims, mean = mean(hgt))
## mean ## 1 164.8723
library(oilabs) fdims <- bdims[bdims$sex == "f", ] mean(fdims$hgt)
## [1] 164.8723
library(dplyr) fdims <- filter(bdims, sex == "f") summarize(fdims, mean = mean(hgt))
## mean ## 1 164.8723
fdims <- bdims[bdims$sex == "f", ]
[rows, columns]
mean(fdims$hgt)
## [1] 164.8723
$
.fdims <- filter(bdims, sex == "f")
dplyr
contains separate functions to filter the rows and select the columns.$
).fdims <- bdims %>% filter(sex == "f")
summarize(fdims, mean = mean(hgt))
## mean ## 1 164.8723
summarize()
.fdims %>% summarize(mean = mean(hgt))
## mean ## 1 164.8723
fdims <- bdims[bdims$sex == "f", ] mean(fdims$hgt)
## [1] 164.8723
library(dplyr) fdims <- filter(bdims, sex == "f") summarize(fdims, mean = mean(hgt))
## mean ## 1 164.8723
Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.
This is all possible in base R, but with dplyr
, it is simple, readible, and fast.
Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you'd like to apply.
For all flights on January 1st:
library(pnwflights14) filter(flights, month == 1, day == 1)
## Source: local data frame [419 x 16] ## ## year month day dep_time dep_delay arr_time arr_delay carrier tailnum ## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr) ## 1 2014 1 1 1 96 235 70 AS N508AS ## 2 2014 1 1 4 -6 738 -23 US N195UW ## 3 2014 1 1 8 13 548 -4 UA N37422 ## 4 2014 1 1 28 -2 800 -23 US N547UW ## 5 2014 1 1 34 44 325 43 AS N762AS ## 6 2014 1 1 37 82 747 88 DL N806DN ## 7 2014 1 1 346 227 936 219 UA N14219 ## 8 2014 1 1 526 -4 1148 15 UA N813UA ## 9 2014 1 1 527 7 917 24 UA N75433 ## 10 2014 1 1 536 1 1334 -6 UA N574UA ## .. ... ... ... ... ... ... ... ... ... ## Variables not shown: flight (int), origin (chr), dest (chr), air_time ## (dbl), distance (dbl), hour (dbl), minute (dbl)
Filters are constructed of logical operators: <
, >
, <=
, >=
, ==
, !=
(and some others).
Adding them one by one to filter()
is akin to saying "this AND that". To say "this OR that OR both", use |.
filter(flights, month == 1 | month == 2)
## Source: local data frame [23,044 x 16] ## ## year month day dep_time dep_delay arr_time arr_delay carrier tailnum ## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr) ## 1 2014 1 1 1 96 235 70 AS N508AS ## 2 2014 1 1 4 -6 738 -23 US N195UW ## 3 2014 1 1 8 13 548 -4 UA N37422 ## 4 2014 1 1 28 -2 800 -23 US N547UW ## 5 2014 1 1 34 44 325 43 AS N762AS ## 6 2014 1 1 37 82 747 88 DL N806DN ## 7 2014 1 1 346 227 936 219 UA N14219 ## 8 2014 1 1 526 -4 1148 15 UA N813UA ## 9 2014 1 1 527 7 917 24 UA N75433 ## 10 2014 1 1 536 1 1334 -6 UA N574UA ## .. ... ... ... ... ... ... ... ... ... ## Variables not shown: flight (int), origin (chr), dest (chr), air_time ## (dbl), distance (dbl), hour (dbl), minute (dbl)
Construct filters to isolate:
filter(flights, month == 3, day == 17) filter(flights, dest == "ORD") filter(flights, dest == "ORD", carrier == "UA") filter(flights, distance > 2000 | air_time > 5 * 60)
arrange()
works similarly to filter()
except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day)
Use desc()
to sort in descending order.
arrange(flights, desc(arr_delay))
Often you work with large datasets with many columns where only a few are actually of interest to you. select()
allows you to rapidly zoom in on a useful subset:
select(flights, year, month, day)
You can exclude columns using -
and specify a range using :
.
select(flights, -(year:day))
A common use of select()
is to find out which values a set of variables takes. This is particularly useful in conjunction with the distinct()
verb which only returns the unique values in a table.
What do the following data correspond to?
distinct(select(flights, origin, dest))
## Source: local data frame [115 x 2] ## ## origin dest ## (chr) (chr) ## 1 PDX ANC ## 2 SEA CLT ## 3 PDX IAH ## 4 PDX CLT ## 5 SEA ANC ## 6 SEA DTW ## 7 SEA ORD ## 8 SEA DEN ## 9 SEA EWR ## 10 PDX DEN ## .. ... ...
As well as selecting from the set of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of mutate()
:
mutate(flights, gain = arr_delay - dep_delay)
## Source: local data frame [162,049 x 17] ## ## year month day dep_time dep_delay arr_time arr_delay carrier tailnum ## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr) ## 1 2014 1 1 1 96 235 70 AS N508AS ## 2 2014 1 1 4 -6 738 -23 US N195UW ## 3 2014 1 1 8 13 548 -4 UA N37422 ## 4 2014 1 1 28 -2 800 -23 US N547UW ## 5 2014 1 1 34 44 325 43 AS N762AS ## 6 2014 1 1 37 82 747 88 DL N806DN ## 7 2014 1 1 346 227 936 219 UA N14219 ## 8 2014 1 1 526 -4 1148 15 UA N813UA ## 9 2014 1 1 527 7 917 24 UA N75433 ## 10 2014 1 1 536 1 1334 -6 UA N574UA ## .. ... ... ... ... ... ... ... ... ... ## Variables not shown: flight (int), origin (chr), dest (chr), air_time ## (dbl), distance (dbl), hour (dbl), minute (dbl), gain (dbl)
summarize()
collapses a data frame to a single row. It's not very useful yet. sample_n()
provides you with a random sample of the rows.
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
## Source: local data frame [1 x 1] ## ## delay ## (dbl) ## 1 6.133859
sample_n(flights, 10)
## Source: local data frame [10 x 16] ## ## year month day dep_time dep_delay arr_time arr_delay carrier tailnum ## (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr) ## 1 2014 6 13 2051 1 2330 -4 AS N442AS ## 2 2014 5 3 1224 -6 1409 -11 AS N534AS ## 3 2014 5 17 1258 8 1521 6 WN N352SW ## 4 2014 4 23 1329 24 1909 20 AS N307AS ## 5 2014 6 12 1307 -8 1406 -5 OO N291SW ## 6 2014 7 9 1046 6 1411 -9 WN N261WN ## 7 2014 4 25 652 -8 742 -8 OO N219AG ## 8 2014 4 20 1307 -8 1534 -12 OO N814SK ## 9 2014 2 26 631 -9 844 -2 OO N216AG ## 10 2014 1 18 657 -3 1810 NA WN N204WN ## Variables not shown: flight (int), origin (chr), dest (chr), air_time ## (dbl), distance (dbl), hour (dbl), minute (dbl)
Mutate the data so that it includes a new variable that contains the average speed traveled by the plane for each flight. Select that new variable and save it, along with tailnum, as a new dataframe object called speed
.
flights <- mutate(flights, speed = distance / (air_time / 60)) speed <- select(flights, tailnum, speed)
These verbs become really powerful when you combine them with the idea of "group by", repeating the operation individually on groups of observations within the dataset. The group_by()
function describes how to break a dataset down into groups of rows.
You can then use the resulting object in exactly the same functions as above; they'll automatically work "by group" when the input is a grouped.
We can find the fastest airplanes in the bunch:
by_tailnum <- group_by(speed, tailnum) avg_speed <- summarize(by_tailnum, count = n(), avg_speed = mean(speed, na.rm = TRUE)) arrange(avg_speed, desc(avg_speed))
## Source: local data frame [3,023 x 3] ## ## tailnum count avg_speed ## (chr) (int) (dbl) ## 1 N306DQ 1 597.7982 ## 2 N860NW 1 589.7297 ## 3 N821NW 1 574.2105 ## 4 N831MH 1 574.0969 ## 5 N839MH 1 571.5789 ## 6 N5FPAA 1 569.4774 ## 7 N675NW 1 569.2174 ## 8 N5DWAA 1 567.5000 ## 9 N807NW 2 566.7639 ## 10 N5DNAA 2 564.8443 ## .. ... ... ...
Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the %>%
operator. You start the pipeline with a dataframe and then pass it to each function in turn.
speed %>% group_by(tailnum) %>% summarize(count = n(), avg_speed = mean(speed, na.rm = TRUE)) %>% arrange(desc(avg_speed))
## Source: local data frame [3,023 x 3] ## ## tailnum count avg_speed ## (chr) (int) (dbl) ## 1 N306DQ 1 597.7982 ## 2 N860NW 1 589.7297 ## 3 N821NW 1 574.2105 ## 4 N831MH 1 574.0969 ## 5 N839MH 1 571.5789 ## 6 N5FPAA 1 569.4774 ## 7 N675NW 1 569.2174 ## 8 N5DWAA 1 567.5000 ## 9 N807NW 2 566.7639 ## 10 N5DNAA 2 564.8443 ## .. ... ... ...
Form a chain that creates a dataframe containing only carrier and their average departure delay time. Which carriers have the highest and lowest mean delays?
flights %>% group_by(carrier) %>% summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% arrange(desc(avg_delay))
## Source: local data frame [11 x 2] ## ## carrier avg_delay ## (chr) (dbl) ## 1 WN 13.329137 ## 2 AA 10.597632 ## 3 F9 10.152473 ## 4 UA 9.795162 ## 5 B6 8.462857 ## 6 VX 7.852158 ## 7 DL 4.820384 ## 8 OO 4.435644 ## 9 AS 2.783932 ## 10 US 2.734850 ## 11 HA 2.580439
library(ggplot2) flights %>% group_by(carrier) %>% summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% arrange(desc(avg_delay)) %>% qplot(x = carrier, y = avg_delay, data = ., geom = "point")
.
When forming a chain, the %>%
operator will default to putting the output of one function into the first argument of the next. You can change where this argument "lands" by using .
flights %>% filter(origin == "PDX") %>% qplot(x = carrier, data = ., geom = "bar")