Dataframe: nc

download.file("http://www.openintro.org/stat/data/nc.RData",
              destfile = "nc.RData")
load("nc.RData")
  • What are the dimensions?
  • What mode of data is in each column? (head())

Dataframe: nc

class(nc)
## [1] "data.frame"
dim(nc)
## [1] 1000   13
length(nc)
## [1] 13

Dataframes

  1. What attributes() does this dataframe have?
  2. What happens when you convert this dataframe into a matrix?

Subsetting columns

$ is an operator used to access a column (vector) from a dataframe.

head(nc$fage)
## [1] NA NA 19 21 NA NA
length(nc$fage)
## [1] 1000

It can be subsetted just like a vector.

nc$fage[3]
## [1] 19

Subsetting as a matrix

You can also subset the whole dataframe like a matrix. What do you think the following commands do?

nc[1:10, 1:2]
nc[1, ]
nc[nc$gender == "female", "premie"]

Note that the last command could be written also using vector subsetting.

nc$premie[nc$gender == "female"]

Two ways to subset

  1. Subset by index: specify inside the square bracks exactly which elements you want.

    nc$gender[1:10]
  2. Subset by logical: specify conditions inside brackets with logical operators that will evaluate to a T/F vector of the same length as that being subsetted.

    nc$gender[nc$premie == "premie"]

    Note that the vector that you're using to specify the condition can be different from the one you're subsetting, but it has to be of the same length.

Functions on vectors

There are many statistical functions that take a vector as an argument (mean(), sum(), median(), sd(), max(), min(), etc.) that can also be used to subset.

nc[nc$mage == max(nc$mage), ]
##      fage mage     mature weeks    premie visits     marital gained weight
## 1000   45   50 mature mom    39 full term     14 not married     23   7.13
##      lowbirthweight gender     habit whitemom
## 1000        not low female nonsmoker    white

Fix the following subsetting errors using data from nc.

nc[nc$visits = 4, ]
nc[-1:4, ]
nc[nc$visits <= 5]
nc[nc$visits == 4 | 6, ]