Overview

R is a powerful and open source statistical computing and graphical language and environment. The goal of this tutorial is to give a concise introduction to R and is aimed at Civil and Environmental Engineering undergraduate and graduate students. This tutorial is by no means comprehensive. There are several very good detailed sources for R. Some of the popular R books are:

There are also several free resources available. I highly recommend reading the ``Introduction to R’’ pdf document which comes with the R installation. There are also several youtube videos and courses on MOOCs on using R for data analysis.

Note

If the figures in this tutorial appear too small or too big, try a different browser. If the formatting appears odd on multiple browsers, send me an email and I will do my best to fix the issue.

Installing R and R Studio

This section provides instructions for installing R as well as RStudio.

Installing R

  • Go to the webpage: http://cran.cnr.berkeley.edu/
  • There are three links on the webpage:
    • Download R for Linux
    • Download R for MacOSX
    • Download R for Windows.
  • The rest of the instructions assume that you have a windows operating system
  • Click on Download R for Windows link.
  • You will reach R for windows webpage. Click on ``install R for the first time’’ link
  • In the next page click on Download R 3.1.3 for Windows link. Note that 3.1.3 corresponds to the R version number. This will get updated frequently as the software is being developed.
  • Save the ``R-3.1.3-win.exe’’ file to any directory on your computer
  • Double click the file to start the installation and use the default settings for all installation related settings.
  • The Setup Language is English. Choose default options for the rest of the installation steps.
  • At the end of the installation you should see an icon with a big R on your desktop. If you have a 64 bit operation system you will see two icons as shown below:
  • If you double click on the icon you should get a RGui startup screen as shown below. Close the RGui and proceed with the installation of RStudio.

Installing RStudio

  • RStudio is a very good Integrated Development Environment (IDE) for R.
  • Go to: http://www.rstudio.com/products/rstudio/download/
  • Download the appropriate installer for your operating system.
  • Double click on the installation file and follow the default settings.
  • Once the installation is complete, open RStudio and you should get the following window:

Basic R commands

From now on we will be using RStudio to run the R commands. Open RStudio.

Your first R commands

R can be used as a calculator. Run the following commands in the console window. Type the following expression and press enter.

> 10 + 20
[1] 30
> 4^2
[1] 16
> 3 + (18/5)
[1] 6.6

Type x <- 20 and press enter. Then type x and press enter. You should get the following results.

> x <- 20
> x
[1] 20

This is an example of an R expression. <- is the assignment operator which assigns the value 20 to the R object named x. When you type x and press enter, the value stored in x is displayed.

Setting your working directory

The working directory corresponds to the folder on the computer which R looks for by default when you give the command to read a file or save plots and results. You can obtain the current working directory using the getwd() command.

> getwd()
[1] "C:/Users/avinash/Dropbox/Teaching/Tutorial/R_Tutorial New"

If I ask R to save a plot, it will save to the directory displayed above by default. You can change the working directory using setwd("Path of Folder"). I always recommend students to change the working directory to something which is convenient for them. For example if you want to store all R input and output files in the folder C:\Courses\CE410\RStuff. You can do so by executing the following command

> setwd("C:/Courses/CE410/RStuff")

Note that R uses forward slashes.

Console versus R Scripts

You can run R commands either through the console as we have done until now or using R scripts. For the purpose of this class, I want all of you to get comfortable using R scripts. You will be submitting the final R scripts in your assignments.

In RStudio you can create a R script by clicking File, New File, and selecting R script. Then give the new script a descriptive name by clicking File and then Save as. I am naming this new R Script FirstRScript.R. The .R extension signifies that it is an R script. Save the file in your working directory.

Then type the earlier commands as shown in the figure below.

Note that R does not execute a command in an R script when you press enter. In the R script # symbol helps you write comments. Anything to the right of the # symbol is not executed. When you write longer scripts, this is really useful as sometimes you may forget why you ran a particular command.

In order to execute the commands, select a line or a group of lines and press the Run button. You will notice that the commands on the selected lines will be executed and the results are shown in the console.

Using a script file is convenient when you have to run the same set of commands repeatedly. For example, if you are working for the Department of Transportation and let us say you need to analyze traffic volume data every month. If you create a script file containing the various statistical analysis procedures you want to conduct every month, then all you will need to do is execute the script file every month with a new set of input data.

Expressions

All standard mathematical operators +, -, / , *, ^ and functions log, exp, sin, sqrt work in R.

> x <- 20
> x
[1] 20
> y <- x + 5
> y
[1] 25
> z <- sqrt(x)
> z
[1] 4.472136
> u <- x*y
> u
[1] 500
> v <- y^z
> v
[1] 1785574
> w <- log(u + v)
> w
[1] 14.39553

NOTE:

  • R is case sensitive. Therefore u is different from U.
  • R works on objects. Everything created and manipulated in R is an object. In the boave example, x, y, z, u, v, w are all objects.
  • The command ls() displayes all objects in the workspace.
  • The command rm("object name 1") removes the particular object from the workspace.
  • The command rm(list=ls()) removes all the objects from the workspace.

Exercise:

  • Assign the value 7 to an R object r.
  • Calculate the area of a circle with radius equal to 7 and assign the value to a symbol A.
  • Evaluate the command A > 100.
  • Evaluate the command A == 100.

Atomic Data Types in R

R has the following basic data types:

  • character
  • integer
  • numeric
  • logical
  • complex

A character object is used to store strings. In the following example x is a character object which stores the string ``Avi’’. A string object is specified by quotes (single and doubt quotes work)

> x <- "Avi"
> y <- 'nash'
> x
[1] "Avi"
> y
[1] "nash"

Integer objects are used to store integers. We will focus on numeric objects which can store both integers and real numbers. In the example shown below, x is a numeric object.

> x <- 2.5

A logical object can take only true or false values. More on that later. The class() and typeof() are useful commands to get more information about the objects. Note that double is a type of numeric data type.

> x <- "Avi"
> class(x)
[1] "character"
> typeof(x)
[1] "character"
> y <- 2.5
> class(y)
[1] "numeric"
> typeof(y)
[1] "double"

Data Structures in R

R has several data structures:

  • Vectors
  • Lists
  • Matrices
  • Data Frames

Vector

The simplest data structure in R is a vector. A vector is a collection of objects of the same type or class. Vectors can be created by the c() function.

> x <- c(2, 3, 8.2, 12)
> x
[1]  2.0  3.0  8.2 12.0
> x[2]
[1] 3
> x[3]
[1] 8.2
> class(x)
[1] "numeric"
> typeof(x)
[1] "double"

In the above expression the vector x is assigned the values within the c() function. Note that x[2] prints the second element and x[3] the third element and so on. The class() and typeof() are useful commands to get more information about vector objects also.

Sequence of numbers can be generated using the colon operator. In the example below the vector x contains 1,2,3,...,20.

> x <- 1:20
> x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Another way to generate a sequence of numbers is using seq() function. In the example below, the vector x is assigned a sequence of numbers from 1 to 5 at intervals of 0.25. The length() returns the number of elements in the vector.

> x <- seq(1, 5, 0.25)
> x
 [1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
[15] 4.50 4.75 5.00
> length(x)
[1] 17

The rep() command can be used to create a vector by repeating any R object. For example in the command below, the x vector contains 5 repeated 10 times.

> x <- rep(5, 10)
> x
 [1] 5 5 5 5 5 5 5 5 5 5

Here the sequence 2, 3, 4 is repeated 3 times.

> x <- rep(2:5,3)
> x
 [1] 2 3 4 5 2 3 4 5 2 3 4 5

We can perform arithmetic operations on vectors. The commans x*2 multiplies every element in the vector x by 2. The command log(x) takes the logarithm of each element of x.

> x <- 5:10
> x
[1]  5  6  7  8  9 10
> y <- x*2
> y
[1] 10 12 14 16 18 20
> z <- c(100, 300, 500, 700, 900, 1000)
> z*x
[1]   500  1800  3500  5600  8100 10000
> z/x
[1]  20.00000  50.00000  71.42857  87.50000 100.00000 100.00000

For numeric vectors, the average, sample standard deviation, median can be calculated as follows. The summary() command provides useful summary statistics. The order() lists the indices in ascending order. Note the default order is ascending.

> x <- c(23, 35, 47,21, 38)
> mean(x)
[1] 32.8
> var(x)
[1] 117.2
> sd(x)
[1] 10.82589
> median(x)
[1] 35
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   21.0    23.0    35.0    32.8    38.0    47.0 
> order(x)
[1] 4 1 2 5 3
> order(x, decreasing=TRUE)
[1] 3 5 2 1 4
> x[order(x)]
[1] 21 23 35 38 47
> x[order(x, decreasing=TRUE)]
[1] 47 38 35 23 21

Until now we have looked at vectors which were a collection of numbers. R has five basic classes of objects: numeric, integer, character, logical, and complex. Vectors can contain only one type class of object. The following are examples of character and logical vectors.

> x <- c("John", "David", "Melissa", "Barbara", "Janice")
> y <- c(FALSE, TRUE, FALSE, TRUE, TRUE)
> typeof(x)
[1] "character"
> typeof(y)
[1] "logical"

Exercise:

  • Create a sequence from 70 to 150 at intervals of 5 and assign it to symbol x
  • Evaluate x[2:5]
  • Evaluate x[c(1,9)]
  • Evaluate x[x > 120]
  • Evaluate x > 120

Matrices and Lists

R has matrices which are two-dimensional vectors. We will not be covering matrices in detail in this tutorials.

A matrix with specified dimensions can be created as shown below. Matrices are filled column wise. the dim command provides the matrix dimension

> x <- matrix(c(2,4,6,8,100,120), nrow = 2, ncol = 3)
> x
     [,1] [,2] [,3]
[1,]    2    6  100
[2,]    4    8  120
> dim(x)
[1] 2 3
> x[2,1] # Accesses the matrix element in the second row and first column
[1] 4

cbind and rbind can be used to combine matrices by columns and rows.

> a <- c(2, 4, 6)
> b <- c(10, 12, 16)
> x <- cbind(a,b)
> x
     a  b
[1,] 2 10
[2,] 4 12
[3,] 6 16
> class(x)
[1] "matrix"
> dim(x)
[1] 3 2
> y <- rbind(a,b)
> y
  [,1] [,2] [,3]
a    2    4    6
b   10   12   16
> dim(y)
[1] 2 3

Lists are collection of objects. The difference between vectors and lists is that vectors are collection of objects of the same type. There are no such restrictions with lists.

> nam <- c("Avi", "Lincoln")
> sc <- c(25, 100, 22, 30)
> testlist <- list(names=nam, scores = sc)
> testlist
$names
[1] "Avi"     "Lincoln"

$scores
[1]  25 100  22  30

Lists can be accessed using the [[]] brackets. The different ways of accessing the lists are shown below.

> testlist[[2]]
[1]  25 100  22  30
> testlist[["scores"]]
[1]  25 100  22  30
> testlist$scores
[1]  25 100  22  30
> testlist$scores[2]
[1] 100

Note that there is a difference in the output class if the list elements are access using [] and [[]] brackets. When the list elements are accessed using [], the output is another list. Study the outputs carefully for the following piece of code.

> x <- testlist[[2]]
> y <- testlist[2]
> x
[1]  25 100  22  30
> y
$scores
[1]  25 100  22  30
> class(x)
[1] "numeric"
> class(y)
[1] "list"
> x[1]
[1] 25
> y[1]
$scores
[1]  25 100  22  30

Exercise:

  • Create three vectors - character vector name which contains “Jim”, “Bob”,“Sam”; numeric vector age which contains 30, 22, and 75; numeric vector scores which contains 91, 86, and 92
  • Create a list called newlist which from the three vectors
  • Study the output of str(newlist)

Factors

R uses factors to store categorical data. In the example below grade is a character vector which stores the grades of six students

> grade <- c("B", "A" , "C" , "A" , "D" , "B")
> grade
[1] "B" "A" "C" "A" "D" "B"

The grade vector does not capture the order implied, i.e, grade A is obviously better than B. In order to capture these levels, R uses factors

> grade <- factor(grade)
> grade
[1] B A C A D B
Levels: A B C D

In addition to printing the vector the levels are also displayed. The default values of levels are chosen in alphabetical order. In this case there is no issue as the grades are in alphabetical order.

> grade <- c("Poor" , "Excellent" , "Average" , "Average" , "Good" , "Excellent", "Average")
> grade <- factor(grade)
> grade
[1] Poor      Excellent Average   Average   Good      Excellent Average  
Levels: Average Excellent Good Poor

Now the levels are displayed as Average Excellent Good Poor which is not what we want. The levels should be Excellent Good Average Poor or ‘Poor Average Good Excellent’.

> grade <- c("Poor" , "Excellent" , "Average" , "Average" , "Good" , "Excellent", "Average")
> grade <- factor(grade, ordered = TRUE, levels = c("Excellent", "Good", "Average", "Poor"))
> grade
[1] Poor      Excellent Average   Average   Good      Excellent Average  
Levels: Excellent < Good < Average < Poor
> grade <- factor(grade, ordered = TRUE, levels = c("Poor" , "Average" , "Good" , "Excellent"))
> grade
[1] Poor      Excellent Average   Average   Good      Excellent Average  
Levels: Poor < Average < Good < Excellent

The term ordered = TRUE indicates that the levels have an order. This can be removed if the categorical variable simply represents levels which do no correspond to a order.

Exercise:

  • Create a vector test which contains the following sequence - “low”,“medium”,“low”,“low”,“medium”,“medium”,“high”,“high”,“high”,“low”
  • Convert test to a vector of type factor. Make sure the levels have order.
  • Assign the 10th element in the vector test to ‘very low’
  • Print the test vector and see if the 10th element has been assigned correctly. If not adjust the levels argument of the test vector to include very low.

Data frame

A data frame is a collection of vectors of different types. A data frame can be thought of as a matrix where each column corresponds to a vector of different class. Let us assume you conducted a survey of 20 households in your city and collected the following data and stored them as R vector objects.

> # The number of people in the household
> hhsize <- c(3, 5, 5, 4, 1, 5, 1, 5, 4, 3, 2, 3, 3, 1, 3, 2, 1, 5, 3, 3)
> # Gender of the main earning member
> gender <- c("FEMALE", "MALE", "MALE" , "MALE" , "MALE" , "FEMALE" , "MALE" , "MALE" , "MALE" , "FEMALE" , "MALE" ,  "FEMALE",  "MALE",  "MALE", "FEMALE" , "FEMALE" , "FEMALE" , "FEMALE" , "FEMALE" , "MALE")
> gender <- factor(gender)
> # Number of vehicles in the household
> numveh <- c(1, 1, 2, 2, 2, 3, 2, 3, 2, 2, 1, 3, 3, 3, 2, 1, 1, 3, 2, 2)
> # Household Annual Income
> income <- c(71703, 41387, 42450, 42365, 42794, 56645, 41057, 62354, 46500, 81684, 68378, 53783, 48747, 55939, 93491, 109798,  107449, 67245, 54402, 41722)
> # Distance to work in miles
> dist   <- c(5, 10, 15, 18, 14, 15, 10, 20, 18, 14, 10, 24, 11, 19, 12, 20, 5, 7, 5, 18)
> # Travel Time in minutes
> ttime  <- c(9, 16, 20, 32, 19, 20, 17, 36, 36, 22, 14, 39, 15, 36, 17, 28, 8, 10, 9, 25)

A new dataframe object HSTravel which combines all the above vectors can be created as shown below.

> HSTravel <- data.frame(hhsize, gender, numveh, income, dist, ttime)
> HSTravel
   hhsize gender numveh income dist ttime
1       3 FEMALE      1  71703    5     9
2       5   MALE      1  41387   10    16
3       5   MALE      2  42450   15    20
4       4   MALE      2  42365   18    32
5       1   MALE      2  42794   14    19
6       5 FEMALE      3  56645   15    20
7       1   MALE      2  41057   10    17
8       5   MALE      3  62354   20    36
9       4   MALE      2  46500   18    36
10      3 FEMALE      2  81684   14    22
11      2   MALE      1  68378   10    14
12      3 FEMALE      3  53783   24    39
 [ reached getOption("max.print") -- omitted 8 rows ]

All the variable names in the dataset can be displayed using the command names(). The first few rows can be viewed using the head() command.

> names(HSTravel)
[1] "hhsize" "gender" "numveh" "income" "dist"   "ttime" 
> head(HSTravel)
  hhsize gender numveh income dist ttime
1      3 FEMALE      1  71703    5     9
2      5   MALE      1  41387   10    16
3      5   MALE      2  42450   15    20
4      4   MALE      2  42365   18    32
5      1   MALE      2  42794   14    19
6      5 FEMALE      3  56645   15    20

The command Data frame name$Variable name can extract specific variables in the data set. For example, income details can be extracted as

> HSTravel$income
 [1]  71703  41387  42450  42365  42794  56645  41057  62354  46500  81684
[11]  68378  53783  48747  55939  93491 109798 107449  67245  54402  41722

Let us say from the original data set we want to create a new dataset HSTravel1 which has only hhsize, income, and distance; that can be accomplished as shown below.

> HSTravel1 <- HSTravel[,c(1,4,5)]
> HSTravel1
   hhsize income dist
1       3  71703    5
2       5  41387   10
3       5  42450   15
4       4  42365   18
5       1  42794   14
6       5  56645   15
7       1  41057   10
8       5  62354   20
9       4  46500   18
10      3  81684   14
11      2  68378   10
12      3  53783   24
13      3  48747   11
14      1  55939   19
15      3  93491   12
16      2 109798   20
17      1 107449    5
18      5  67245    7
19      3  54402    5
20      3  41722   18

Let us say from the original data set we want to create a new dataset HSTravel2 which has the first 5 datapoints:

> HSTravel2 <- HSTravel[1:5,]
> HSTravel2
  hhsize gender numveh income dist ttime
1      3 FEMALE      1  71703    5     9
2      5   MALE      1  41387   10    16
3      5   MALE      2  42450   15    20
4      4   MALE      2  42365   18    32
5      1   MALE      2  42794   14    19

Let us say from the original data set we want to create a new dataset HSTravel3 which has only the males:

> HSTravel3 <- subset(HSTravel, gender == "MALE")
> HSTravel3
   hhsize gender numveh income dist ttime
2       5   MALE      1  41387   10    16
3       5   MALE      2  42450   15    20
4       4   MALE      2  42365   18    32
5       1   MALE      2  42794   14    19
7       1   MALE      2  41057   10    17
8       5   MALE      3  62354   20    36
9       4   MALE      2  46500   18    36
11      2   MALE      1  68378   10    14
13      3   MALE      3  48747   11    15
14      1   MALE      3  55939   19    36
20      3   MALE      2  41722   18    25

Let us take the income variable and create new income category variable catinc: * catinc takes value 1 if income is less than or equal to 30000. * catinc takes value 2 if income is between 30000 and 50000. * catinc takes value 3 if income is greater than 50000.

> HSTravel$catinc[HSTravel$income <= 30000] <- 1
> HSTravel$catinc[HSTravel$income > 30000 & HSTravel$income <= 50000 ] <- 2
> HSTravel$catinc[HSTravel$income > 50000 ] <- 3
> HSTravel$catinc
 [1] 3 2 2 2 2 3 2 3 2 3 3 3 2 3 3 3 3 3 3 2

To create a new variable speed which is distance divided by travel time converted to hours:

> HSTravel$speed <- HSTravel$dist*60/HSTravel$ttime
> HSTravel$speed
 [1] 33.33333 37.50000 45.00000 33.75000 44.21053 45.00000 35.29412
 [8] 33.33333 30.00000 38.18182 42.85714 36.92308 44.00000 31.66667
[15] 42.35294 42.85714 37.50000 42.00000 33.33333 43.20000
> head(HSTravel)
  hhsize gender numveh income dist ttime catinc    speed
1      3 FEMALE      1  71703    5     9      3 33.33333
2      5   MALE      1  41387   10    16      2 37.50000
3      5   MALE      2  42450   15    20      2 45.00000
4      4   MALE      2  42365   18    32      2 33.75000
5      1   MALE      2  42794   14    19      2 44.21053
6      5 FEMALE      3  56645   15    20      3 45.00000

Exercise:

In the HSTravel dataframe

  • Create a new dataset which contains only the hhsize and number of vehicles
  • Create a new variable FEM_TT which takes value 1 if the main earning member is a female and travel time to work is higher than 20 minutes and 0 otherwise.
  • Create a new variable for travel time in hours.

Summary Statistics

The str() command provides summary information about any R object. In the case of a data frame the str() function provides useful information about the data frame in a compact manner. The summary(Data frame name) command provides the summary statistics - mean, median, minimum, maximum, first and third quantile of every variable and frequency of each level for each facto variable in the data set. For summary statistic of a specific variable, the command summary(Data frame name$variable name) can be used.

> str(HSTravel)
'data.frame':   20 obs. of  8 variables:
 $ hhsize: num  3 5 5 4 1 5 1 5 4 3 ...
 $ gender: Factor w/ 2 levels "FEMALE","MALE": 1 2 2 2 2 1 2 2 2 1 ...
 $ numveh: num  1 1 2 2 2 3 2 3 2 2 ...
 $ income: num  71703 41387 42450 42365 42794 ...
 $ dist  : num  5 10 15 18 14 15 10 20 18 14 ...
 $ ttime : num  9 16 20 32 19 20 17 36 36 22 ...
 $ catinc: num  3 2 2 2 2 3 2 3 2 3 ...
 $ speed : num  33.3 37.5 45 33.8 44.2 ...
> summary(HSTravel)
     hhsize        gender       numveh         income            dist     
 Min.   :1.00   FEMALE: 9   Min.   :1.00   Min.   : 41057   Min.   : 5.0  
 1st Qu.:2.00   MALE  :11   1st Qu.:1.75   1st Qu.: 42708   1st Qu.:10.0  
 Median :3.00               Median :2.00   Median : 55171   Median :14.0  
 Mean   :3.10               Mean   :2.05   Mean   : 61495   Mean   :13.5  
 3rd Qu.:4.25               3rd Qu.:3.00   3rd Qu.: 69209   3rd Qu.:18.0  
 Max.   :5.00               Max.   :3.00   Max.   :109798   Max.   :24.0  
     ttime           catinc        speed      
 Min.   : 8.00   Min.   :2.0   Min.   :30.00  
 1st Qu.:14.75   1st Qu.:2.0   1st Qu.:33.65  
 Median :19.50   Median :3.0   Median :37.84  
 Mean   :21.40   Mean   :2.6   Mean   :38.61  
 3rd Qu.:29.00   3rd Qu.:3.0   3rd Qu.:42.94  
 Max.   :39.00   Max.   :3.0   Max.   :45.00  
> summary(HSTravel$income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  41060   42710   55170   61490   69210  109800 

The mean, variance, and standard deviation of individual variables can be obtained as shown below

> mean(HSTravel$income)
[1] 61494.65
> var(HSTravel$income)
[1] 468664164
> sd(HSTravel$income)
[1] 21648.65

The summary() command can be nested with a subset() command to obtain summary statistics of subset of dataset. For example, we can obtain summary statistics of all households earning more than 60000 as follows:

> summary(subset(HSTravel, income >= 60000))
     hhsize       gender      numveh         income            dist      
 Min.   :1.0   FEMALE:6   Min.   :1.00   Min.   : 62354   Min.   : 5.00  
 1st Qu.:2.0   MALE  :2   1st Qu.:1.00   1st Qu.: 68095   1st Qu.: 6.50  
 Median :3.0              Median :1.50   Median : 76694   Median :11.00  
 Mean   :3.0              Mean   :1.75   Mean   : 82763   Mean   :11.62  
 3rd Qu.:3.5              3rd Qu.:2.25   3rd Qu.: 96981   3rd Qu.:15.50  
 Max.   :5.0              Max.   :3.00   Max.   :109798   Max.   :20.00  
     ttime           catinc      speed      
 Min.   : 8.00   Min.   :3   Min.   :33.33  
 1st Qu.: 9.75   1st Qu.:3   1st Qu.:36.46  
 Median :15.50   Median :3   Median :40.09  
 Mean   :18.00   Mean   :3   Mean   :39.05  
 3rd Qu.:23.50   3rd Qu.:3   3rd Qu.:42.48  
 Max.   :36.00   Max.   :3   Max.   :42.86  

The frequency of discrete variables can obtained as:

> table(HSTravel$hhsize)

1 2 3 4 5 
4 2 7 2 5 

If you want to display the frequencies in columen format:

> cbind(table(HSTravel$hhsize))
  [,1]
1    4
2    2
3    7
4    2
5    5

The relative frequencies can be obtained either using the prop.table() command or by dividing the frequencies by the total number of entries as shown below:

> cbind(prop.table(table(HSTravel$hhsize)))
  [,1]
1 0.20
2 0.10
3 0.35
4 0.10
5 0.25
> cbind(table(HSTravel$hhsize)/length(HSTravel$hhsize))
  [,1]
1 0.20
2 0.10
3 0.35
4 0.10
5 0.25

Determining the frequency table for continuous variables is more complicated as we first need to specify the bins. From the summary statistics, we know that the minimum value of income is 41057 and the maximum value is 109798. Let the bins be from 40000 to 110000 at intervals of 10000. We use the seq() command to specify the breaks and the cut() command to specify the categories. Note that adding the argument right = FALSE makes the intervals closed on the left and open on the right.

> breaks = seq(40000, 110000, 10000)
> breaks
[1]  40000  50000  60000  70000  80000  90000 100000 110000
> incomecat1 = cut(HSTravel$income, breaks)
> incomecat1
 [1] (7e+04,8e+04]   (4e+04,5e+04]   (4e+04,5e+04]   (4e+04,5e+04]  
 [5] (4e+04,5e+04]   (5e+04,6e+04]   (4e+04,5e+04]   (6e+04,7e+04]  
 [9] (4e+04,5e+04]   (8e+04,9e+04]   (6e+04,7e+04]   (5e+04,6e+04]  
[13] (4e+04,5e+04]   (5e+04,6e+04]   (9e+04,1e+05]   (1e+05,1.1e+05]
[17] (1e+05,1.1e+05] (6e+04,7e+04]   (5e+04,6e+04]   (4e+04,5e+04]  
7 Levels: (4e+04,5e+04] (5e+04,6e+04] (6e+04,7e+04] ... (1e+05,1.1e+05]
> incomecat2 = cut(HSTravel$income, breaks, right = FALSE)
> incomecat2
 [1] [7e+04,8e+04)   [4e+04,5e+04)   [4e+04,5e+04)   [4e+04,5e+04)  
 [5] [4e+04,5e+04)   [5e+04,6e+04)   [4e+04,5e+04)   [6e+04,7e+04)  
 [9] [4e+04,5e+04)   [8e+04,9e+04)   [6e+04,7e+04)   [5e+04,6e+04)  
[13] [4e+04,5e+04)   [5e+04,6e+04)   [9e+04,1e+05)   [1e+05,1.1e+05)
[17] [1e+05,1.1e+05) [6e+04,7e+04)   [5e+04,6e+04)   [4e+04,5e+04)  
7 Levels: [4e+04,5e+04) [5e+04,6e+04) [6e+04,7e+04) ... [1e+05,1.1e+05)

The frequencies can now be determined using the table() command.

> cbind(table(incomecat1))
                [,1]
(4e+04,5e+04]      8
(5e+04,6e+04]      4
(6e+04,7e+04]      3
(7e+04,8e+04]      1
(8e+04,9e+04]      1
(9e+04,1e+05]      1
(1e+05,1.1e+05]    2
> cbind(table(incomecat2))
                [,1]
[4e+04,5e+04)      8
[5e+04,6e+04)      4
[6e+04,7e+04)      3
[7e+04,8e+04)      1
[8e+04,9e+04)      1
[9e+04,1e+05)      1
[1e+05,1.1e+05)    2

Exercise:

In the HSTravel dataframe

  • Determine the average and standard deviation of income for households whose main earning member is a female
  • Determine the relative frequencies of distance traveled using bins of size 5.
  • Determine the summary statistics of income and travel time to work for households with two vehicles.

Apply() functions

apply() commands can be used to apply functions to each row and column of the dataframes. Consider a data frame containing hhsize, number of vehicles, income, and distance.

> HSTravel4 <- HSTravel[,c(1,3,4,5,6)]
> head(HSTravel4)
  hhsize numveh income dist ttime
1      3      1  71703    5     9
2      5      1  41387   10    16
3      5      2  42450   15    20
4      4      2  42365   18    32
5      1      2  42794   14    19
6      5      3  56645   15    20

The following command applies the mean function to each row of the dataframe. The 1 specifies that mean should be applied row wise.

> x <- apply(HSTravel4,1, mean)
> class(x)
[1] "numeric"
> x
 [1] 14344.2  8283.8  8498.4  8484.2  8566.0 11337.6  8217.4 12483.6
 [9]  9312.0 16345.0 13681.0 10770.4  9755.8 11199.6 18705.0 21969.8
[17] 21492.8 13454.0 10884.2  8354.0

The following command applies the mean function to each column of the dataframe. The 2 specifies that mean should be applied column wise.

> apply(HSTravel4,2, mean)
  hhsize   numveh   income     dist    ttime 
    3.10     2.05 61494.65    13.50    21.40 

R has several functions belonging to the apply() family. For example, the lapply() function returns a list.

> x <- lapply(HSTravel4,mean)
> class(x)
[1] "list"
> x
$hhsize
[1] 3.1

$numveh
[1] 2.05

$income
[1] 61494.65

$dist
[1] 13.5

$ttime
[1] 21.4

Similarly sapply() can be applied to a data frame and it returns a vector. dapply() returns a data frame.

tapply() is another powerful command if you want to break down a variable into groups and apply a function on each of those groups. For example, let us say you want to calculate the average distance for each number of vehicle.

> x <- tapply(HSTravel$dist,HSTravel$numveh, mean)
> class(x)
[1] "array"
> x
       1        2        3 
10.00000 13.77778 16.00000 

The average travel distance is calculated for numveh=1,2, and 3.

If you want to calculate the average travel distance for each combination of hhsize and numveh, then

> x <- tapply(HSTravel$dist,list(HSTravel$hhsize, HSTravel$numveh), mean)
> class(x)
[1] "matrix"
> x
   1     2    3
1  5 12.00 19.0
2 15    NA   NA
3  5 12.25 17.5
4 NA 18.00   NA
5 10 15.00 14.0

The NA indicates that there are no datapoints with hhsize=2 and numveh=2.

Exercise:

  • In the HSTravel dataframe, determine the median and standard deviation of income for households for each gender.
  • In the HSTravel4 dataframe, determine the median for each variable and store the result as a vector.

Reading Input from Files

Download the file HHData400.csv into your working directory. The data set is a CSV or comma separated text file. The following read.csv command reads in the file and stores it as a data frame named HSD.

> HSD <- read.csv("HHData400.csv" , header = TRUE)
> names(HSD)
[1] "numwork" "hhsize"  "numveh"  "gender"  "inc"     "dist"    "ttime"  

Note that the command read.csv() is used for reading values from a comma separated text file. The first item in the bracket HHData.csv corresponds to the file name. If the file is not located in your working directory then you will need to provide the full path of the file location. The second argument header=TRUE indicates that the first row in the data file contains the headers which indicate what each column corresponds to. If your file does not have a header row, then you dont need to provide this argument.

I normally use csv files and hence stick to read.csv(). If the data entries are separated by tab spaces instead of commas we can use read.table(). For importing data sets from excel add on packages can be installed. However for beginners I recommend saving the excel file as a csv file and using read.csv().

If you want to change the name of a specific column, the names() command can be used in the following manner.

> names(HSD)[names(HSD)=="inc"] <- "Income"
> names(HSD)
[1] "numwork" "hhsize"  "numveh"  "gender"  "Income"  "dist"    "ttime"  

If you want to rename the entire set of columns

> names(HSD) <- c ("Num of Workers", "Household Size", "Number of Vehicles" , "Gender", "Income", "Distance", "Travel Time")
> names(HSD)
[1] "Num of Workers"     "Household Size"     "Number of Vehicles"
[4] "Gender"             "Income"             "Distance"          
[7] "Travel Time"       
> str(HSD)
'data.frame':   400 obs. of  7 variables:
 $ Num of Workers    : int  1 3 2 2 1 1 3 2 2 2 ...
 $ Household Size    : int  2 3 3 2 1 3 4 3 2 4 ...
 $ Number of Vehicles: int  2 3 1 1 0 2 4 2 1 3 ...
 $ Gender            : Factor w/ 2 levels "FEMALE","MALE": 1 2 2 2 1 2 2 2 1 2 ...
 $ Income            : num  45290 168343 127862 167421 78439 ...
 $ Distance          : num  50.7 77.3 64.9 30.6 26.4 ...
 $ Travel Time       : num  67.6 106.2 82.6 48.5 39.6 ...

Exercise

Download the titanic dataset titanic.csv to your working directory. The dataset should have 14 variables and 1309 observations. This dataset is the titanic3 dataset created by Thomas Cason of UVA. For more details on the dataset click this link: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt

  • Create a new dataset for class 1, class 2 and class 3 passengers?
  • Determine the number of class 1, 2, and 3 passengers?
  • Determine the proportion of class 1, 2, and 3 passengers who survived?
  • Determine the proportion of males and females in class 1, 2, and 3 who did not survive?

Plots

We will use the HHData400.csv to generate plots.

> HSD <- read.csv("HHData400.csv" , header = TRUE)
> names(HSD)
[1] "numwork" "hhsize"  "numveh"  "gender"  "inc"     "dist"    "ttime"  

The plot() command is the simplest and versatile command for generating plots.

> HSD <- read.csv("HHData400.csv" , header = TRUE)
> names(HSD)
[1] "numwork" "hhsize"  "numveh"  "gender"  "inc"     "dist"    "ttime"  

plot(var1, var2) will create a scatter plot wherevar1andvar2` are vectors. For example to create a scatter plot between distance and travel time use the following command.

> plot(HSD$dist, HSD$ttime)

The command plot() has a number of arguments which can be used to generate different type of plots. For a list of all arguments associated with the plot() command type ?plot and press enter.

> plot(HSD$dist, HSD$ttime, xlab=" Distance (miles)", ylab= "Travel Time (minutes)", main = "Travel Time vs Distance" , col = "red" , frame.plot=FALSE, xlim = c(0,140))

The above command makes the following modifications:

  • Add labels to the x and y axis using xlab and ylab arguments
  • Add a main title to the plot using the main argument
  • Changes the color of the dots to red using col argument
  • Removes the frame using the frame.plot argument
  • Sets the limits on the x axis using the xlim argument

For factor variables plot() generates frequency plots. For discrete variables which are not factors, barplot() command can be combined with table() command to generate frequency plots.

> plot(HSD$gender , xlab = "Gender",  ylab = "Frequency", ylim = c(0,300), main = "Household Main Earning Member")

> freq <- table(HSD$numveh)
> barplot(freq , xlab = "Number of Vehicles",  ylab = "Frequency", ylim = c(0,140), main = "Household Vehicle Ownership")

Histograms are created for the continuous variables using the hist() command.

> hist(HSD$ttime)

The plot can be made more cleaner as shown below.

> hist(HSD$ttime, xlab = "Travel Time (in minutes)" , main = "Travel Time", col = "green")

You can specify equally sized bins of your choosing by adding a breaks argument.

> hist(HSD$ttime, xlab = "Travel Time (in minutes)" , main = "Travel Time", col = "green" , breaks = seq(0,250,50))

Box plots can be generated using the boxplot() command as shown below

> boxplot(HSD$dist, ylab = "Distance (miles)", main = "Distance")

Boxplots can also be generated for continuous variable separated by groups if the variable denoting the groups is represented as a factor variable.

> boxplot(dist ~ gender, data = HSD, xlab = "Gender " , ylab = "Distance (miles)", main = "Distance")

The generated plots can be saved as a PNG image file in the working directory as shown below. Dont forget to run the dev.off() command after the plotting command.

> png("DistBoxPlot.png")
> boxplot(dist ~ gender, data = HSD, xlab = "Gender " , ylab = "Distance (miles)", main = "Distance")
> dev.off()

The generated plots can be saved as a PDF file in the working directory as shown below.

> pdf("DistBoxPlot.pdf")
> boxplot(dist ~ gender, data = HSD, xlab = "Gender " , ylab = "Distance (miles)", main = "Distance")
> dev.off()

Exercise

  • Plot the histogram for income using the following bins of size 50000 starting at 0.
  • Plot the bar plot for household size.
  • Plot parallel box plots for income for each value of number of workers.

Probabilities

R can be used to calculate the following quantities for a distribution:

  • Point Probability
  • Cumulative Probability
  • Percentile also known as Quantile
  • Pseudo Random Numbers

The functions dnorm(), pnorm(), qnorm(), and rnorm() deliver the above quantities for the normal distribution. Let us consider a normal distribution with mean 10 and standard deviation 5. Then the point probability, cumulative probability at 10 and the 60th percentile can be generated as shown below.

> u <- dnorm(10, 10, 5)
> u
[1] 0.07978846
> v <- pnorm(10, 10, 5)
> v
[1] 0.5
> w <- qnorm(0.6, 10, 5)
> w
[1] 11.26674

Fifty random realizations from a normal distribution with mean 10 and standard deviation 5 can be generated as follows:

> x <- rnorm(50, 10, 5)

The bell curve can be plotted to verify the functions:

> x <- seq(-20, 40, 0.1)
> plot(x, dnorm(x, 10, 5), type = "l" , main = "Normal PDF", ylab = "f(x)" , xlab = "x")

The corresponding equivalent functions for a binomial distribution are dbinom(), pbinom(), qbinom(), and rbinom() respectively. Let us consider a binomial distribution with 10 trials and the probability of success in each trial being 0.35. The probability of the outcome being 5, the cumulative probability at 5, the 50th percentile can be generated as shown below.

> u <- dbinom(5, 10, 0.35)
> u
[1] 0.1535704
> v <- dbinom(5, 10, 0.35)
> v
[1] 0.1535704
> w <- qbinom(0.5, 10, 0.35)
> w
[1] 3

The probability density function for the above binomial distribution can be plotted as shown below:

> x <- 1:10
> plot(x, dbinom(x, 10, 0.35), type = "h" , main = "Binomial PDF", ylab = "p(x)" , xlab = "x")

Exercise

Consider a Poisson distribution with lambda parameter of 20.

  • Determine the probability of an outcome being equal to 15
  • Determine the probability of an outcome being less than equal to 15
  • Determine the 75th percentile
  • Generate 50 random realizations from this distribution

Packages

An R package is a collection of function and data sets which add functionality to R. R comes with its base set of packages. The command library() lists all packages which are currently available in your R installation.

> library()

You can add more features to your current R setup by installing more packages. This is done using the install.packages("Package Name") command. At this point R has around 12000 packages.

> install.packages("Package Name")

Some of the popular packages I use are “car”, “dplyr”, and “ggplot2”. The entire list of packages is available at the CRAN website: https://cran.r-project.org/web/packages/.

When you install a specific package, it adds additional functionalities to the base R installation. In order to use those commands, you will have to load the package first using the library() command. For example, if you want to use the ggplot2 plotting capabilities, first install the package and then load it.

> install.packages("ggplot2")
> library(ggplot2)

You only need to install the package once in your system. However, you will have to load the package each time you start an R session if you want to use its functionalities.