filter(flights, month == 3, day == 17)
filter(flights, dest == "ORD")
filter(flights, dest == "ORD", carrier == "UA")
filter(flights, distance > 2000 | air_time > 5 * 60)
flights2 <- mutate(flights, speed = distance/(air_time/60))
speed <- select(flights2, tailnum, speed)
flights %>%
group_by(carrier) %>%
summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay))
## Source: local data frame [11 x 2]
##
## carrier avg_delay
## (chr) (dbl)
## 1 WN 13.329137
## 2 AA 10.597632
## 3 F9 10.152473
## 4 UA 9.795162
## 5 B6 8.462857
## 6 VX 7.852158
## 7 DL 4.820384
## 8 OO 4.435644
## 9 AS 2.783932
## 10 US 2.734850
## 11 HA 2.580439
The highest mean delay is with Southwest and the lowest is with Hawaiian Airlines.
For the following questions, it may be helpful to sketch out what you want the final data set to look like (what are the rows? what are the columns?), then work backwards to figure out how to create that dataset from the original data.
delay <- flights %>%
group_by(tailnum) %>%
summarize(count = n(),
delay = mean(arr_delay, na.rm = TRUE),
dist = mean(distance)) %>%
filter(count > 20, dist < 2000)
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area()
Planes that fly just shy of 1000 miles tend to have the highest arrival delays on average. Planes making longer flights (around 1500 miles) have fewer arrival delays on average. This is somewhat surprising. It may suggest that if there has been a departure delay, the plane is able to make up more of the lost time if it tends to make longer flights. Other explanations are possible, though, including carrier as a confounder. Southwest makes primarily regional flights and we saw earlier that they have the highest mean departure delay, so their planes might be making up the majority of the sub-1000 mile planes.
flights %>%
# this is one possible measure of overall delay
mutate(tot_delay = dep_delay + arr_delay) %>%
group_by(origin) %>%
summarize(avg_tot_delay = mean(tot_delay, na.rm = TRUE))
## Source: local data frame [2 x 2]
##
## origin avg_tot_delay
## (chr) (dbl)
## 1 PDX 7.388994
## 2 SEA 8.820187
According to my measure of overall delay, the 2014 suggests that PDX had a lower mean overall delay at 7.39 minutes versus 8.82 for Seattle, so I’d fly out of PDX.
flights %>%
filter(origin == "PDX", dest == "JFK") %>%
mutate(date = as.Date(paste(month, day, year, sep = "-" ), "%m-%d-%Y")) %>%
mutate(day_of_week = weekdays(date)) %>%
group_by(day_of_week) %>%
summarize(count = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(delay)
## Source: local data frame [7 x 3]
##
## day_of_week count delay
## (chr) (int) (dbl)
## 1 Monday 137 2.157895
## 2 Saturday 90 2.443182
## 3 Friday 137 4.216418
## 4 Wednesday 109 5.250000
## 5 Thursday 136 6.389313
## 6 Sunday 136 6.567164
## 7 Tuesday 109 7.669811
If you define delays as the arrival delay, then you’d be best served by traveling on a Monday or a Saturday.
flights %>%
filter(origin == "PDX", dest == "JFK") %>%
group_by(carrier) %>%
summarize(count = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(delay)
## Source: local data frame [2 x 3]
##
## carrier count delay
## (chr) (int) (dbl)
## 1 DL 546 3.431262
## 2 B6 308 7.729373
There are only two carriers that fly between PDX and JFK: Delta and JetBlue. I’d recommend taking Delta in general, as their average arrival delay across all days is less than half of that of JetBlue. You can also filter on the day of the week being Monday or Tuesday, and the recommendation remains the same.
Challenge: Remake two plots that show up in the slides - the points plot of average delays by carrier and the bar chart of total count of flights by carrier out of PDX - but display the carriers in descending order.
There are many ways to do this. Here are two possibilities.
flights %>%
group_by(carrier) %>%
summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(avg_delay) %>%
qplot(x = reorder(carrier, rev(avg_delay)), y = avg_delay, data = ., geom = "point")
flights %>%
filter(origin == "PDX") %>%
group_by(carrier) %>%
summarize(count = n()) %>%
arrange(count) %>%
qplot(x = reorder(carrier, rev(count)), y = count, data = .,
stat = "identity", geom = "bar")