Exercise 1

filter(flights, month == 3, day == 17)
filter(flights, dest == "ORD")
filter(flights, dest == "ORD", carrier == "UA")
filter(flights, distance > 2000 | air_time > 5 * 60)

Exercise 2

flights2 <- mutate(flights, speed = distance/(air_time/60))
speed <- select(flights2, tailnum, speed)

Exercise 3

flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_delay))
## Source: local data frame [11 x 2]
## 
##    carrier avg_delay
##      (chr)     (dbl)
## 1       WN 13.329137
## 2       AA 10.597632
## 3       F9 10.152473
## 4       UA  9.795162
## 5       B6  8.462857
## 6       VX  7.852158
## 7       DL  4.820384
## 8       OO  4.435644
## 9       AS  2.783932
## 10      US  2.734850
## 11      HA  2.580439

The highest mean delay is with Southwest and the lowest is with Hawaiian Airlines.


On your own

For the following questions, it may be helpful to sketch out what you want the final data set to look like (what are the rows? what are the columns?), then work backwards to figure out how to create that dataset from the original data.

delay <- flights %>%
  group_by(tailnum) %>%
  summarize(count = n(),
            delay = mean(arr_delay, na.rm = TRUE),
            dist = mean(distance)) %>%
  filter(count > 20, dist < 2000)
ggplot(delay, aes(dist, delay)) +
 geom_point(aes(size = count), alpha = 1/2) +
 geom_smooth() +
 scale_size_area()

Planes that fly just shy of 1000 miles tend to have the highest arrival delays on average. Planes making longer flights (around 1500 miles) have fewer arrival delays on average. This is somewhat surprising. It may suggest that if there has been a departure delay, the plane is able to make up more of the lost time if it tends to make longer flights. Other explanations are possible, though, including carrier as a confounder. Southwest makes primarily regional flights and we saw earlier that they have the highest mean departure delay, so their planes might be making up the majority of the sub-1000 mile planes.

flights %>%
  # this is one possible measure of overall delay
  mutate(tot_delay = dep_delay + arr_delay) %>%
  group_by(origin) %>%
  summarize(avg_tot_delay = mean(tot_delay, na.rm = TRUE))
## Source: local data frame [2 x 2]
## 
##   origin avg_tot_delay
##    (chr)         (dbl)
## 1    PDX      7.388994
## 2    SEA      8.820187

According to my measure of overall delay, the 2014 suggests that PDX had a lower mean overall delay at 7.39 minutes versus 8.82 for Seattle, so I’d fly out of PDX.

flights %>%
  filter(origin == "PDX", dest == "JFK") %>%
  mutate(date = as.Date(paste(month, day, year, sep = "-" ), "%m-%d-%Y")) %>%
  mutate(day_of_week = weekdays(date)) %>%
  group_by(day_of_week) %>%
  summarize(count = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(delay)
## Source: local data frame [7 x 3]
## 
##   day_of_week count    delay
##         (chr) (int)    (dbl)
## 1      Monday   137 2.157895
## 2    Saturday    90 2.443182
## 3      Friday   137 4.216418
## 4   Wednesday   109 5.250000
## 5    Thursday   136 6.389313
## 6      Sunday   136 6.567164
## 7     Tuesday   109 7.669811

If you define delays as the arrival delay, then you’d be best served by traveling on a Monday or a Saturday.

flights %>%
  filter(origin == "PDX", dest == "JFK") %>%
  group_by(carrier) %>%
  summarize(count = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(delay)
## Source: local data frame [2 x 3]
## 
##   carrier count    delay
##     (chr) (int)    (dbl)
## 1      DL   546 3.431262
## 2      B6   308 7.729373

There are only two carriers that fly between PDX and JFK: Delta and JetBlue. I’d recommend taking Delta in general, as their average arrival delay across all days is less than half of that of JetBlue. You can also filter on the day of the week being Monday or Tuesday, and the recommendation remains the same.

Challenge: Remake two plots that show up in the slides - the points plot of average delays by carrier and the bar chart of total count of flights by carrier out of PDX - but display the carriers in descending order.

There are many ways to do this. Here are two possibilities.

flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(avg_delay) %>%
  qplot(x = reorder(carrier, rev(avg_delay)), y = avg_delay, data = ., geom = "point")

flights %>%
  filter(origin == "PDX") %>%
  group_by(carrier) %>%
  summarize(count = n()) %>%
  arrange(count) %>%
  qplot(x = reorder(carrier, rev(count)), y = count, data = .,
        stat = "identity", geom = "bar")