Homework #2

  • Aim for production quality plots here: title, axis labels, no overplotting.
  • Can set figure options in r chunk (fig.height, fig.width, fig.align).
  • Questions 7 and 8 are best answered with 2 plots, but one could suffice.
  • Be sure stuff compiles right on the last question. Don't forget to submit images on moodle!

Numerical descriptors

Two (numerical) Variables

  1. Shape: linear, quadratic
  2. Direction: positive/neg in slope/curvature
  3. Strength: how tightly clustered?

A measure of the strength of a linear relationship: r, the correlation coefficient (cor()).

Graphical descriptors

Scatterplot (base)

plot(x = nc$fage, y = nc$mage)

plot of chunk unnamed-chunk-3

Graphical descriptors

Scatterplot (base)

plot(mage ~ fage, data = nc)

plot of chunk unnamed-chunk-4

Graphical descriptors

Scatterplot (ggplot2)

ggplot(nc, aes(x = fage, y = mage)) + geom_point()

plot of chunk unnamed-chunk-5

ggplot2 sandbox

mplot(nc)
mplot(mtcars, default = "histogram")

Two univariate distributions

par(mfrow = c(2,1))
plot(density(d$activity), xlim = c(1000000, 1500000), main = "")
plot(density(d$homework), xlim = c(1000000, 1500000), main = "")

plot of chunk unnamed-chunk-7

Two univariate distributions

plot of chunk unnamed-chunk-8

One bivariate distribution: scatterplot

ggplot(data=d, aes(x=activity, y=homework)) + geom_jitter()  + theme(legend.position="none") + labs(title="")

plot of chunk unnamed-chunk-9

Contour plot

plot of chunk unnamed-chunk-10

Image plot / heat map

plot of chunk unnamed-chunk-11

Perspective plot

plot of chunk unnamed-chunk-12

Getting beyond 2 dimensions

Fisher's Irises: Is there a relationship between sepal length and sepal width of irises? I.e., if you have measurements on one, can you predict what the other will be?

Getting beyond 2 dimensions

ggplot(data=iris, aes(x=Sepal.Length, y=Sepal.Width)) + geom_jitter()  + theme(legend.position="none") + labs(title="Iris Data") 

plot of chunk unnamed-chunk-13

Getting beyond 2 dimensions

Fisher's Irises: Is there a relationship between sepal length and sepal width of irises? I.e., if you have measurements on one, can you predict what the other will be?

No…but what about if you incorporate a third variable: species?

Getting beyond 2 dimensions

ggplot(data=iris, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()  + facet_wrap(~Species, ncol=4) + theme(legend.position="none") + labs(title="") 

plot of chunk unnamed-chunk-14

Getting beyond 2 dimensions

You can add a third (categorical) variable by:

  1. Color: mapping the category of that observation to a discrete color scale.

  2. Faceting: separate out the data and make a scatterplot for each category. (preferable if the data is dense).

Key idea: Once you get into >2 dimenions, you can describe the structure jointly or conditioning/controlling for certain variables.

Getting beyond 2 dimensions

You can add a third (numerical) variable by:

  1. Color: map the variable of that observation to a continuous color scale.

  2. Size: map the variable to the size of that point.

  3. Euclidian dimension: add a third dimension!

3D scatterplot

plot of chunk unnamed-chunk-15

3D scatterplot

3D density plot

Visualizations in yet higher dimensions

http://www.gapminder.org/world

  1. How many variables/columns/dimensions are displayed?

  2. Is there any structure that is persistent over an entire variable?

  3. How would you characterize the relationship between fertility and life expectancy during the 1950s and 1960s when comparing the industrialized western nations to the rest of the world?

High dimension, high art