Overview
R is a powerful and open source statistical computing and graphical language and environment. The goal of this tutorial is to give a concise introduction to R and is aimed at Civil and Environmental Engineering undergraduate and graduate students. This tutorial is by no means comprehensive. There are several very good detailed sources for R. Some of the popular R books are:
- R in Action: Data Analysis and Graphics with R by Robert Kabacoff: https://www.manning.com/books/r-in-action-second-edition
- R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics) by Jared P. Lander: http://www.jaredlander.com/r-for-everyone/
- Advanced R (Chapman & Hall/CRC The R Series) by Hadley Wickham: https://www.crcpress.com/Advanced-R/Wickham/p/book/9781466586963
- R Graphics Cookbook by Winston Chang: http://www.cookbook-r.com/Graphs/
- ggplot2: Elegant Graphics for Data Analysis (Use R!) by Hadley Wickham: http://www.springer.com/us/book/9780387981413
- The Art of R Programming: A Tour of Statistical Software Design 1st Edition by Norman Matloff: https://www.nostarch.com/artofr.htm
- An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani: http://www.springer.com/us/book/9781461471370#aboutBook
There are also several free resources available. I highly recommend reading the ``Introduction to R’’ pdf document which comes with the R installation. There are also several youtube videos and courses on MOOCs on using R for data analysis.
Note
If the figures in this tutorial appear too small or too big, try a different browser. If the formatting appears odd on multiple browsers, send me an email and I will do my best to fix the issue.
Installing R and R Studio
This section provides instructions for installing R as well as RStudio.
Installing R
- Go to the webpage: http://cran.cnr.berkeley.edu/
- There are three links on the webpage:
- Download R for Linux
- Download R for MacOSX
- Download R for Windows.
- The rest of the instructions assume that you have a windows operating system
- Click on Download R for Windows link.
- You will reach R for windows webpage. Click on ``install R for the first time’’ link
- In the next page click on Download R 3.1.3 for Windows link. Note that 3.1.3 corresponds to the R version number. This will get updated frequently as the software is being developed.
- Save the ``R-3.1.3-win.exe’’ file to any directory on your computer
- Double click the file to start the installation and use the default settings for all installation related settings.
- The Setup Language is English. Choose default options for the rest of the installation steps.
- At the end of the installation you should see an icon with a big R on your desktop. If you have a 64 bit operation system you will see two icons as shown below:
- If you double click on the icon you should get a RGui startup screen as shown below. Close the RGui and proceed with the installation of RStudio.
Installing RStudio
- RStudio is a very good Integrated Development Environment (IDE) for R.
- Go to: http://www.rstudio.com/products/rstudio/download/
- Download the appropriate installer for your operating system.
- Double click on the installation file and follow the default settings.
- Once the installation is complete, open RStudio and you should get the following window:
Basic R commands
From now on we will be using RStudio to run the R commands. Open RStudio.
Your first R commands
R can be used as a calculator. Run the following commands in the console window. Type the following expression and press enter.
> 10 + 20
[1] 30
> 4^2
[1] 16
> 3 + (18/5)
[1] 6.6
Type x <- 20
and press enter. Then type x
and press enter. You should get the following results.
> x <- 20
> x
[1] 20
This is an example of an R expression. <-
is the assignment operator which assigns the value 20 to the R object named x
. When you type x
and press enter, the value stored in x
is displayed.
Setting your working directory
The working directory corresponds to the folder on the computer which R looks for by default when you give the command to read a file or save plots and results. You can obtain the current working directory using the getwd()
command.
> getwd()
[1] "C:/Users/avinash/Dropbox/Teaching/Tutorial/R_Tutorial New"
If I ask R to save a plot, it will save to the directory displayed above by default. You can change the working directory using setwd("Path of Folder")
. I always recommend students to change the working directory to something which is convenient for them. For example if you want to store all R input and output files in the folder C:\Courses\CE410\RStuff
. You can do so by executing the following command
> setwd("C:/Courses/CE410/RStuff")
Note that R uses forward slashes.
Console versus R Scripts
You can run R commands either through the console as we have done until now or using R scripts. For the purpose of this class, I want all of you to get comfortable using R scripts. You will be submitting the final R scripts in your assignments.
In RStudio you can create a R script by clicking File, New File, and selecting R script. Then give the new script a descriptive name by clicking File and then Save as. I am naming this new R Script FirstRScript.R
. The .R
extension signifies that it is an R script. Save the file in your working directory.
Then type the earlier commands as shown in the figure below.
Note that R does not execute a command in an R script when you press enter. In the R script #
symbol helps you write comments. Anything to the right of the #
symbol is not executed. When you write longer scripts, this is really useful as sometimes you may forget why you ran a particular command.
In order to execute the commands, select a line or a group of lines and press the Run button. You will notice that the commands on the selected lines will be executed and the results are shown in the console.
Using a script file is convenient when you have to run the same set of commands repeatedly. For example, if you are working for the Department of Transportation and let us say you need to analyze traffic volume data every month. If you create a script file containing the various statistical analysis procedures you want to conduct every month, then all you will need to do is execute the script file every month with a new set of input data.
Expressions
All standard mathematical operators +, -, / , *, ^
and functions log, exp, sin, sqrt
work in R.
> x <- 20
> x
[1] 20
> y <- x + 5
> y
[1] 25
> z <- sqrt(x)
> z
[1] 4.472136
> u <- x*y
> u
[1] 500
> v <- y^z
> v
[1] 1785574
> w <- log(u + v)
> w
[1] 14.39553
NOTE:
- R is case sensitive. Therefore
u
is different fromU
. - R works on objects. Everything created and manipulated in R is an object. In the boave example,
x, y, z, u, v, w
are all objects. - The command
ls()
displayes all objects in the workspace. - The command
rm("object name 1")
removes the particular object from the workspace. - The command
rm(list=ls())
removes all the objects from the workspace.
Exercise:
- Assign the value 7 to an R object r.
- Calculate the area of a circle with radius equal to 7 and assign the value to a symbol A.
- Evaluate the command
A > 100
. - Evaluate the command
A == 100
.
Atomic Data Types in R
R has the following basic data types:
- character
- integer
- numeric
- logical
- complex
A character object is used to store strings. In the following example x
is a character object which stores the string ``Avi’’. A string object is specified by quotes (single and doubt quotes work)
> x <- "Avi"
> y <- 'nash'
> x
[1] "Avi"
> y
[1] "nash"
Integer objects are used to store integers. We will focus on numeric objects which can store both integers and real numbers. In the example shown below, x
is a numeric object.
> x <- 2.5
A logical object can take only true or false values. More on that later. The class()
and typeof()
are useful commands to get more information about the objects. Note that double is a type of numeric data type.
> x <- "Avi"
> class(x)
[1] "character"
> typeof(x)
[1] "character"
> y <- 2.5
> class(y)
[1] "numeric"
> typeof(y)
[1] "double"
Data Structures in R
R has several data structures:
- Vectors
- Lists
- Matrices
- Data Frames
Vector
The simplest data structure in R is a vector. A vector is a collection of objects of the same type or class. Vectors can be created by the c()
function.
> x <- c(2, 3, 8.2, 12)
> x
[1] 2.0 3.0 8.2 12.0
> x[2]
[1] 3
> x[3]
[1] 8.2
> class(x)
[1] "numeric"
> typeof(x)
[1] "double"
In the above expression the vector x
is assigned the values within the c()
function. Note that x[2]
prints the second element and x[3]
the third element and so on. The class()
and typeof()
are useful commands to get more information about vector objects also.
Sequence of numbers can be generated using the colon operator. In the example below the vector x
contains 1,2,3,...,20
.
> x <- 1:20
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Another way to generate a sequence of numbers is using seq()
function. In the example below, the vector x
is assigned a sequence of numbers from 1 to 5 at intervals of 0.25. The length()
returns the number of elements in the vector.
> x <- seq(1, 5, 0.25)
> x
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
[15] 4.50 4.75 5.00
> length(x)
[1] 17
The rep()
command can be used to create a vector by repeating any R object. For example in the command below, the x
vector contains 5 repeated 10 times.
> x <- rep(5, 10)
> x
[1] 5 5 5 5 5 5 5 5 5 5
Here the sequence 2, 3, 4 is repeated 3 times.
> x <- rep(2:5,3)
> x
[1] 2 3 4 5 2 3 4 5 2 3 4 5
We can perform arithmetic operations on vectors. The commans x*2
multiplies every element in the vector x
by 2. The command log(x)
takes the logarithm of each element of x
.
> x <- 5:10
> x
[1] 5 6 7 8 9 10
> y <- x*2
> y
[1] 10 12 14 16 18 20
> z <- c(100, 300, 500, 700, 900, 1000)
> z*x
[1] 500 1800 3500 5600 8100 10000
> z/x
[1] 20.00000 50.00000 71.42857 87.50000 100.00000 100.00000
For numeric vectors, the average, sample standard deviation, median can be calculated as follows. The summary()
command provides useful summary statistics. The order()
lists the indices in ascending order. Note the default order is ascending.
> x <- c(23, 35, 47,21, 38)
> mean(x)
[1] 32.8
> var(x)
[1] 117.2
> sd(x)
[1] 10.82589
> median(x)
[1] 35
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.0 23.0 35.0 32.8 38.0 47.0
> order(x)
[1] 4 1 2 5 3
> order(x, decreasing=TRUE)
[1] 3 5 2 1 4
> x[order(x)]
[1] 21 23 35 38 47
> x[order(x, decreasing=TRUE)]
[1] 47 38 35 23 21
Until now we have looked at vectors which were a collection of numbers. R has five basic classes of objects: numeric, integer, character, logical, and complex. Vectors can contain only one type class of object. The following are examples of character and logical vectors.
> x <- c("John", "David", "Melissa", "Barbara", "Janice")
> y <- c(FALSE, TRUE, FALSE, TRUE, TRUE)
> typeof(x)
[1] "character"
> typeof(y)
[1] "logical"
Exercise:
- Create a sequence from 70 to 150 at intervals of 5 and assign it to symbol
x
- Evaluate
x[2:5]
- Evaluate
x[c(1,9)]
- Evaluate
x[x > 120]
- Evaluate
x > 120
Matrices and Lists
R has matrices which are two-dimensional vectors. We will not be covering matrices in detail in this tutorials.
A matrix with specified dimensions can be created as shown below. Matrices are filled column wise. the dim
command provides the matrix dimension
> x <- matrix(c(2,4,6,8,100,120), nrow = 2, ncol = 3)
> x
[,1] [,2] [,3]
[1,] 2 6 100
[2,] 4 8 120
> dim(x)
[1] 2 3
> x[2,1] # Accesses the matrix element in the second row and first column
[1] 4
cbind
and rbind
can be used to combine matrices by columns and rows.
> a <- c(2, 4, 6)
> b <- c(10, 12, 16)
> x <- cbind(a,b)
> x
a b
[1,] 2 10
[2,] 4 12
[3,] 6 16
> class(x)
[1] "matrix"
> dim(x)
[1] 3 2
> y <- rbind(a,b)
> y
[,1] [,2] [,3]
a 2 4 6
b 10 12 16
> dim(y)
[1] 2 3
Lists are collection of objects. The difference between vectors and lists is that vectors are collection of objects of the same type. There are no such restrictions with lists.
> nam <- c("Avi", "Lincoln")
> sc <- c(25, 100, 22, 30)
> testlist <- list(names=nam, scores = sc)
> testlist
$names
[1] "Avi" "Lincoln"
$scores
[1] 25 100 22 30
Lists can be accessed using the [[]]
brackets. The different ways of accessing the lists are shown below.
> testlist[[2]]
[1] 25 100 22 30
> testlist[["scores"]]
[1] 25 100 22 30
> testlist$scores
[1] 25 100 22 30
> testlist$scores[2]
[1] 100
Note that there is a difference in the output class if the list elements are access using []
and [[]]
brackets. When the list elements are accessed using []
, the output is another list. Study the outputs carefully for the following piece of code.
> x <- testlist[[2]]
> y <- testlist[2]
> x
[1] 25 100 22 30
> y
$scores
[1] 25 100 22 30
> class(x)
[1] "numeric"
> class(y)
[1] "list"
> x[1]
[1] 25
> y[1]
$scores
[1] 25 100 22 30
Exercise:
- Create three vectors - character vector name which contains “Jim”, “Bob”,“Sam”; numeric vector age which contains 30, 22, and 75; numeric vector scores which contains 91, 86, and 92
- Create a list called newlist which from the three vectors
- Study the output of
str(newlist)
Factors
R uses factors to store categorical data. In the example below grade
is a character vector which stores the grades of six students
> grade <- c("B", "A" , "C" , "A" , "D" , "B")
> grade
[1] "B" "A" "C" "A" "D" "B"
The grade
vector does not capture the order implied, i.e, grade A is obviously better than B. In order to capture these levels, R uses factors
> grade <- factor(grade)
> grade
[1] B A C A D B
Levels: A B C D
In addition to printing the vector the levels are also displayed. The default values of levels are chosen in alphabetical order. In this case there is no issue as the grades are in alphabetical order.
> grade <- c("Poor" , "Excellent" , "Average" , "Average" , "Good" , "Excellent", "Average")
> grade <- factor(grade)
> grade
[1] Poor Excellent Average Average Good Excellent Average
Levels: Average Excellent Good Poor
Now the levels are displayed as Average Excellent Good Poor
which is not what we want. The levels should be Excellent Good Average Poor
or ‘Poor Average Good Excellent’.
> grade <- c("Poor" , "Excellent" , "Average" , "Average" , "Good" , "Excellent", "Average")
> grade <- factor(grade, ordered = TRUE, levels = c("Excellent", "Good", "Average", "Poor"))
> grade
[1] Poor Excellent Average Average Good Excellent Average
Levels: Excellent < Good < Average < Poor
> grade <- factor(grade, ordered = TRUE, levels = c("Poor" , "Average" , "Good" , "Excellent"))
> grade
[1] Poor Excellent Average Average Good Excellent Average
Levels: Poor < Average < Good < Excellent
The term ordered = TRUE
indicates that the levels have an order. This can be removed if the categorical variable simply represents levels which do no correspond to a order.
Exercise:
- Create a vector
test
which contains the following sequence - “low”,“medium”,“low”,“low”,“medium”,“medium”,“high”,“high”,“high”,“low” - Convert
test
to a vector of type factor. Make sure the levels have order. - Assign the 10th element in the vector
test
to ‘very low’ - Print the
test
vector and see if the 10th element has been assigned correctly. If not adjust the levels argument of thetest
vector to include very low.
Data frame
A data frame is a collection of vectors of different types. A data frame can be thought of as a matrix where each column corresponds to a vector of different class. Let us assume you conducted a survey of 20 households in your city and collected the following data and stored them as R vector objects.
> # The number of people in the household
> hhsize <- c(3, 5, 5, 4, 1, 5, 1, 5, 4, 3, 2, 3, 3, 1, 3, 2, 1, 5, 3, 3)
> # Gender of the main earning member
> gender <- c("FEMALE", "MALE", "MALE" , "MALE" , "MALE" , "FEMALE" , "MALE" , "MALE" , "MALE" , "FEMALE" , "MALE" , "FEMALE", "MALE", "MALE", "FEMALE" , "FEMALE" , "FEMALE" , "FEMALE" , "FEMALE" , "MALE")
> gender <- factor(gender)
> # Number of vehicles in the household
> numveh <- c(1, 1, 2, 2, 2, 3, 2, 3, 2, 2, 1, 3, 3, 3, 2, 1, 1, 3, 2, 2)
> # Household Annual Income
> income <- c(71703, 41387, 42450, 42365, 42794, 56645, 41057, 62354, 46500, 81684, 68378, 53783, 48747, 55939, 93491, 109798, 107449, 67245, 54402, 41722)
> # Distance to work in miles
> dist <- c(5, 10, 15, 18, 14, 15, 10, 20, 18, 14, 10, 24, 11, 19, 12, 20, 5, 7, 5, 18)
> # Travel Time in minutes
> ttime <- c(9, 16, 20, 32, 19, 20, 17, 36, 36, 22, 14, 39, 15, 36, 17, 28, 8, 10, 9, 25)
A new dataframe object HSTravel
which combines all the above vectors can be created as shown below.
> HSTravel <- data.frame(hhsize, gender, numveh, income, dist, ttime)
> HSTravel
hhsize gender numveh income dist ttime
1 3 FEMALE 1 71703 5 9
2 5 MALE 1 41387 10 16
3 5 MALE 2 42450 15 20
4 4 MALE 2 42365 18 32
5 1 MALE 2 42794 14 19
6 5 FEMALE 3 56645 15 20
7 1 MALE 2 41057 10 17
8 5 MALE 3 62354 20 36
9 4 MALE 2 46500 18 36
10 3 FEMALE 2 81684 14 22
11 2 MALE 1 68378 10 14
12 3 FEMALE 3 53783 24 39
[ reached getOption("max.print") -- omitted 8 rows ]
All the variable names in the dataset can be displayed using the command names()
. The first few rows can be viewed using the head()
command.
> names(HSTravel)
[1] "hhsize" "gender" "numveh" "income" "dist" "ttime"
> head(HSTravel)
hhsize gender numveh income dist ttime
1 3 FEMALE 1 71703 5 9
2 5 MALE 1 41387 10 16
3 5 MALE 2 42450 15 20
4 4 MALE 2 42365 18 32
5 1 MALE 2 42794 14 19
6 5 FEMALE 3 56645 15 20
The command Data frame name$Variable name
can extract specific variables in the data set. For example, income details can be extracted as
> HSTravel$income
[1] 71703 41387 42450 42365 42794 56645 41057 62354 46500 81684
[11] 68378 53783 48747 55939 93491 109798 107449 67245 54402 41722
Let us say from the original data set we want to create a new dataset HSTravel1 which has only hhsize, income, and distance; that can be accomplished as shown below.
> HSTravel1 <- HSTravel[,c(1,4,5)]
> HSTravel1
hhsize income dist
1 3 71703 5
2 5 41387 10
3 5 42450 15
4 4 42365 18
5 1 42794 14
6 5 56645 15
7 1 41057 10
8 5 62354 20
9 4 46500 18
10 3 81684 14
11 2 68378 10
12 3 53783 24
13 3 48747 11
14 1 55939 19
15 3 93491 12
16 2 109798 20
17 1 107449 5
18 5 67245 7
19 3 54402 5
20 3 41722 18
Let us say from the original data set we want to create a new dataset HSTravel2 which has the first 5 datapoints:
> HSTravel2 <- HSTravel[1:5,]
> HSTravel2
hhsize gender numveh income dist ttime
1 3 FEMALE 1 71703 5 9
2 5 MALE 1 41387 10 16
3 5 MALE 2 42450 15 20
4 4 MALE 2 42365 18 32
5 1 MALE 2 42794 14 19
Let us say from the original data set we want to create a new dataset HSTravel3 which has only the males:
> HSTravel3 <- subset(HSTravel, gender == "MALE")
> HSTravel3
hhsize gender numveh income dist ttime
2 5 MALE 1 41387 10 16
3 5 MALE 2 42450 15 20
4 4 MALE 2 42365 18 32
5 1 MALE 2 42794 14 19
7 1 MALE 2 41057 10 17
8 5 MALE 3 62354 20 36
9 4 MALE 2 46500 18 36
11 2 MALE 1 68378 10 14
13 3 MALE 3 48747 11 15
14 1 MALE 3 55939 19 36
20 3 MALE 2 41722 18 25
Let us take the income variable and create new income category variable catinc: * catinc takes value 1 if income is less than or equal to 30000. * catinc takes value 2 if income is between 30000 and 50000. * catinc takes value 3 if income is greater than 50000.
> HSTravel$catinc[HSTravel$income <= 30000] <- 1
> HSTravel$catinc[HSTravel$income > 30000 & HSTravel$income <= 50000 ] <- 2
> HSTravel$catinc[HSTravel$income > 50000 ] <- 3
> HSTravel$catinc
[1] 3 2 2 2 2 3 2 3 2 3 3 3 2 3 3 3 3 3 3 2
To create a new variable speed which is distance divided by travel time converted to hours:
> HSTravel$speed <- HSTravel$dist*60/HSTravel$ttime
> HSTravel$speed
[1] 33.33333 37.50000 45.00000 33.75000 44.21053 45.00000 35.29412
[8] 33.33333 30.00000 38.18182 42.85714 36.92308 44.00000 31.66667
[15] 42.35294 42.85714 37.50000 42.00000 33.33333 43.20000
> head(HSTravel)
hhsize gender numveh income dist ttime catinc speed
1 3 FEMALE 1 71703 5 9 3 33.33333
2 5 MALE 1 41387 10 16 2 37.50000
3 5 MALE 2 42450 15 20 2 45.00000
4 4 MALE 2 42365 18 32 2 33.75000
5 1 MALE 2 42794 14 19 2 44.21053
6 5 FEMALE 3 56645 15 20 3 45.00000
Exercise:
In the HSTravel
dataframe
- Create a new dataset which contains only the hhsize and number of vehicles
- Create a new variable FEM_TT which takes value 1 if the main earning member is a female and travel time to work is higher than 20 minutes and 0 otherwise.
- Create a new variable for travel time in hours.
Summary Statistics
The str()
command provides summary information about any R object. In the case of a data frame the str()
function provides useful information about the data frame in a compact manner. The summary(Data frame name)
command provides the summary statistics - mean, median, minimum, maximum, first and third quantile of every variable and frequency of each level for each facto variable in the data set. For summary statistic of a specific variable, the command summary(Data frame name$variable name)
can be used.
> str(HSTravel)
'data.frame': 20 obs. of 8 variables:
$ hhsize: num 3 5 5 4 1 5 1 5 4 3 ...
$ gender: Factor w/ 2 levels "FEMALE","MALE": 1 2 2 2 2 1 2 2 2 1 ...
$ numveh: num 1 1 2 2 2 3 2 3 2 2 ...
$ income: num 71703 41387 42450 42365 42794 ...
$ dist : num 5 10 15 18 14 15 10 20 18 14 ...
$ ttime : num 9 16 20 32 19 20 17 36 36 22 ...
$ catinc: num 3 2 2 2 2 3 2 3 2 3 ...
$ speed : num 33.3 37.5 45 33.8 44.2 ...
> summary(HSTravel)
hhsize gender numveh income dist
Min. :1.00 FEMALE: 9 Min. :1.00 Min. : 41057 Min. : 5.0
1st Qu.:2.00 MALE :11 1st Qu.:1.75 1st Qu.: 42708 1st Qu.:10.0
Median :3.00 Median :2.00 Median : 55171 Median :14.0
Mean :3.10 Mean :2.05 Mean : 61495 Mean :13.5
3rd Qu.:4.25 3rd Qu.:3.00 3rd Qu.: 69209 3rd Qu.:18.0
Max. :5.00 Max. :3.00 Max. :109798 Max. :24.0
ttime catinc speed
Min. : 8.00 Min. :2.0 Min. :30.00
1st Qu.:14.75 1st Qu.:2.0 1st Qu.:33.65
Median :19.50 Median :3.0 Median :37.84
Mean :21.40 Mean :2.6 Mean :38.61
3rd Qu.:29.00 3rd Qu.:3.0 3rd Qu.:42.94
Max. :39.00 Max. :3.0 Max. :45.00
> summary(HSTravel$income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
41060 42710 55170 61490 69210 109800
The mean, variance, and standard deviation of individual variables can be obtained as shown below
> mean(HSTravel$income)
[1] 61494.65
> var(HSTravel$income)
[1] 468664164
> sd(HSTravel$income)
[1] 21648.65
The summary()
command can be nested with a subset()
command to obtain summary statistics of subset of dataset. For example, we can obtain summary statistics of all households earning more than 60000 as follows:
> summary(subset(HSTravel, income >= 60000))
hhsize gender numveh income dist
Min. :1.0 FEMALE:6 Min. :1.00 Min. : 62354 Min. : 5.00
1st Qu.:2.0 MALE :2 1st Qu.:1.00 1st Qu.: 68095 1st Qu.: 6.50
Median :3.0 Median :1.50 Median : 76694 Median :11.00
Mean :3.0 Mean :1.75 Mean : 82763 Mean :11.62
3rd Qu.:3.5 3rd Qu.:2.25 3rd Qu.: 96981 3rd Qu.:15.50
Max. :5.0 Max. :3.00 Max. :109798 Max. :20.00
ttime catinc speed
Min. : 8.00 Min. :3 Min. :33.33
1st Qu.: 9.75 1st Qu.:3 1st Qu.:36.46
Median :15.50 Median :3 Median :40.09
Mean :18.00 Mean :3 Mean :39.05
3rd Qu.:23.50 3rd Qu.:3 3rd Qu.:42.48
Max. :36.00 Max. :3 Max. :42.86
The frequency of discrete variables can obtained as:
> table(HSTravel$hhsize)
1 2 3 4 5
4 2 7 2 5
If you want to display the frequencies in columen format:
> cbind(table(HSTravel$hhsize))
[,1]
1 4
2 2
3 7
4 2
5 5
The relative frequencies can be obtained either using the prop.table()
command or by dividing the frequencies by the total number of entries as shown below:
> cbind(prop.table(table(HSTravel$hhsize)))
[,1]
1 0.20
2 0.10
3 0.35
4 0.10
5 0.25
> cbind(table(HSTravel$hhsize)/length(HSTravel$hhsize))
[,1]
1 0.20
2 0.10
3 0.35
4 0.10
5 0.25
Determining the frequency table for continuous variables is more complicated as we first need to specify the bins. From the summary statistics, we know that the minimum value of income is 41057 and the maximum value is 109798. Let the bins be from 40000 to 110000 at intervals of 10000. We use the seq()
command to specify the breaks and the cut()
command to specify the categories. Note that adding the argument right = FALSE
makes the intervals closed on the left and open on the right.
> breaks = seq(40000, 110000, 10000)
> breaks
[1] 40000 50000 60000 70000 80000 90000 100000 110000
> incomecat1 = cut(HSTravel$income, breaks)
> incomecat1
[1] (7e+04,8e+04] (4e+04,5e+04] (4e+04,5e+04] (4e+04,5e+04]
[5] (4e+04,5e+04] (5e+04,6e+04] (4e+04,5e+04] (6e+04,7e+04]
[9] (4e+04,5e+04] (8e+04,9e+04] (6e+04,7e+04] (5e+04,6e+04]
[13] (4e+04,5e+04] (5e+04,6e+04] (9e+04,1e+05] (1e+05,1.1e+05]
[17] (1e+05,1.1e+05] (6e+04,7e+04] (5e+04,6e+04] (4e+04,5e+04]
7 Levels: (4e+04,5e+04] (5e+04,6e+04] (6e+04,7e+04] ... (1e+05,1.1e+05]
> incomecat2 = cut(HSTravel$income, breaks, right = FALSE)
> incomecat2
[1] [7e+04,8e+04) [4e+04,5e+04) [4e+04,5e+04) [4e+04,5e+04)
[5] [4e+04,5e+04) [5e+04,6e+04) [4e+04,5e+04) [6e+04,7e+04)
[9] [4e+04,5e+04) [8e+04,9e+04) [6e+04,7e+04) [5e+04,6e+04)
[13] [4e+04,5e+04) [5e+04,6e+04) [9e+04,1e+05) [1e+05,1.1e+05)
[17] [1e+05,1.1e+05) [6e+04,7e+04) [5e+04,6e+04) [4e+04,5e+04)
7 Levels: [4e+04,5e+04) [5e+04,6e+04) [6e+04,7e+04) ... [1e+05,1.1e+05)
The frequencies can now be determined using the table()
command.
> cbind(table(incomecat1))
[,1]
(4e+04,5e+04] 8
(5e+04,6e+04] 4
(6e+04,7e+04] 3
(7e+04,8e+04] 1
(8e+04,9e+04] 1
(9e+04,1e+05] 1
(1e+05,1.1e+05] 2
> cbind(table(incomecat2))
[,1]
[4e+04,5e+04) 8
[5e+04,6e+04) 4
[6e+04,7e+04) 3
[7e+04,8e+04) 1
[8e+04,9e+04) 1
[9e+04,1e+05) 1
[1e+05,1.1e+05) 2
Exercise:
In the HSTravel
dataframe
- Determine the average and standard deviation of income for households whose main earning member is a female
- Determine the relative frequencies of distance traveled using bins of size 5.
- Determine the summary statistics of income and travel time to work for households with two vehicles.
Apply() functions
apply()
commands can be used to apply functions to each row and column of the dataframes. Consider a data frame containing hhsize, number of vehicles, income, and distance.
> HSTravel4 <- HSTravel[,c(1,3,4,5,6)]
> head(HSTravel4)
hhsize numveh income dist ttime
1 3 1 71703 5 9
2 5 1 41387 10 16
3 5 2 42450 15 20
4 4 2 42365 18 32
5 1 2 42794 14 19
6 5 3 56645 15 20
The following command applies the mean function to each row of the dataframe. The 1
specifies that mean should be applied row wise.
> x <- apply(HSTravel4,1, mean)
> class(x)
[1] "numeric"
> x
[1] 14344.2 8283.8 8498.4 8484.2 8566.0 11337.6 8217.4 12483.6
[9] 9312.0 16345.0 13681.0 10770.4 9755.8 11199.6 18705.0 21969.8
[17] 21492.8 13454.0 10884.2 8354.0
The following command applies the mean function to each column of the dataframe. The 2
specifies that mean should be applied column wise.
> apply(HSTravel4,2, mean)
hhsize numveh income dist ttime
3.10 2.05 61494.65 13.50 21.40
R has several functions belonging to the apply()
family. For example, the lapply()
function returns a list.
> x <- lapply(HSTravel4,mean)
> class(x)
[1] "list"
> x
$hhsize
[1] 3.1
$numveh
[1] 2.05
$income
[1] 61494.65
$dist
[1] 13.5
$ttime
[1] 21.4
Similarly sapply()
can be applied to a data frame and it returns a vector. dapply()
returns a data frame.
tapply()
is another powerful command if you want to break down a variable into groups and apply a function on each of those groups. For example, let us say you want to calculate the average distance for each number of vehicle.
> x <- tapply(HSTravel$dist,HSTravel$numveh, mean)
> class(x)
[1] "array"
> x
1 2 3
10.00000 13.77778 16.00000
The average travel distance is calculated for numveh=1,2,
and 3
.
If you want to calculate the average travel distance for each combination of hhsize and numveh, then
> x <- tapply(HSTravel$dist,list(HSTravel$hhsize, HSTravel$numveh), mean)
> class(x)
[1] "matrix"
> x
1 2 3
1 5 12.00 19.0
2 15 NA NA
3 5 12.25 17.5
4 NA 18.00 NA
5 10 15.00 14.0
The NA indicates that there are no datapoints with hhsize=2
and numveh=2
.
Exercise:
- In the
HSTravel
dataframe, determine the median and standard deviation of income for households for each gender. - In the
HSTravel4
dataframe, determine the median for each variable and store the result as a vector.
Reading Input from Files
Download the file HHData400.csv into your working directory. The data set is a CSV or comma separated text file. The following read.csv
command reads in the file and stores it as a data frame named HSD
.
> HSD <- read.csv("HHData400.csv" , header = TRUE)
> names(HSD)
[1] "numwork" "hhsize" "numveh" "gender" "inc" "dist" "ttime"
Note that the command read.csv()
is used for reading values from a comma separated text file. The first item in the bracket HHData.csv
corresponds to the file name. If the file is not located in your working directory then you will need to provide the full path of the file location. The second argument header=TRUE
indicates that the first row in the data file contains the headers which indicate what each column corresponds to. If your file does not have a header row, then you dont need to provide this argument.
I normally use csv files and hence stick to read.csv()
. If the data entries are separated by tab spaces instead of commas we can use read.table()
. For importing data sets from excel add on packages can be installed. However for beginners I recommend saving the excel file as a csv file and using read.csv()
.
If you want to change the name of a specific column, the names()
command can be used in the following manner.
> names(HSD)[names(HSD)=="inc"] <- "Income"
> names(HSD)
[1] "numwork" "hhsize" "numveh" "gender" "Income" "dist" "ttime"
If you want to rename the entire set of columns
> names(HSD) <- c ("Num of Workers", "Household Size", "Number of Vehicles" , "Gender", "Income", "Distance", "Travel Time")
> names(HSD)
[1] "Num of Workers" "Household Size" "Number of Vehicles"
[4] "Gender" "Income" "Distance"
[7] "Travel Time"
> str(HSD)
'data.frame': 400 obs. of 7 variables:
$ Num of Workers : int 1 3 2 2 1 1 3 2 2 2 ...
$ Household Size : int 2 3 3 2 1 3 4 3 2 4 ...
$ Number of Vehicles: int 2 3 1 1 0 2 4 2 1 3 ...
$ Gender : Factor w/ 2 levels "FEMALE","MALE": 1 2 2 2 1 2 2 2 1 2 ...
$ Income : num 45290 168343 127862 167421 78439 ...
$ Distance : num 50.7 77.3 64.9 30.6 26.4 ...
$ Travel Time : num 67.6 106.2 82.6 48.5 39.6 ...
Exercise
Download the titanic dataset titanic.csv to your working directory. The dataset should have 14 variables and 1309 observations. This dataset is the titanic3 dataset created by Thomas Cason of UVA. For more details on the dataset click this link: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt
- Create a new dataset for class 1, class 2 and class 3 passengers?
- Determine the number of class 1, 2, and 3 passengers?
- Determine the proportion of class 1, 2, and 3 passengers who survived?
- Determine the proportion of males and females in class 1, 2, and 3 who did not survive?
Plots
We will use the HHData400.csv to generate plots.
> HSD <- read.csv("HHData400.csv" , header = TRUE)
> names(HSD)
[1] "numwork" "hhsize" "numveh" "gender" "inc" "dist" "ttime"
The plot()
command is the simplest and versatile command for generating plots.
> HSD <- read.csv("HHData400.csv" , header = TRUE)
> names(HSD)
[1] "numwork" "hhsize" "numveh" "gender" "inc" "dist" "ttime"
plot(var1, var2) will create a scatter plot where
var1and
var2` are vectors. For example to create a scatter plot between distance and travel time use the following command.
> plot(HSD$dist, HSD$ttime)
The command plot()
has a number of arguments which can be used to generate different type of plots. For a list of all arguments associated with the plot()
command type ?plot
and press enter.
> plot(HSD$dist, HSD$ttime, xlab=" Distance (miles)", ylab= "Travel Time (minutes)", main = "Travel Time vs Distance" , col = "red" , frame.plot=FALSE, xlim = c(0,140))
The above command makes the following modifications:
- Add labels to the x and y axis using
xlab
andylab
arguments - Add a main title to the plot using the
main
argument - Changes the color of the dots to red using
col
argument - Removes the frame using the
frame.plot
argument - Sets the limits on the x axis using the
xlim
argument
For factor variables plot()
generates frequency plots. For discrete variables which are not factors, barplot()
command can be combined with table()
command to generate frequency plots.
> plot(HSD$gender , xlab = "Gender", ylab = "Frequency", ylim = c(0,300), main = "Household Main Earning Member")
> freq <- table(HSD$numveh)
> barplot(freq , xlab = "Number of Vehicles", ylab = "Frequency", ylim = c(0,140), main = "Household Vehicle Ownership")
Histograms are created for the continuous variables using the hist()
command.
> hist(HSD$ttime)
The plot can be made more cleaner as shown below.
> hist(HSD$ttime, xlab = "Travel Time (in minutes)" , main = "Travel Time", col = "green")
You can specify equally sized bins of your choosing by adding a breaks argument.
> hist(HSD$ttime, xlab = "Travel Time (in minutes)" , main = "Travel Time", col = "green" , breaks = seq(0,250,50))
Box plots can be generated using the boxplot()
command as shown below
> boxplot(HSD$dist, ylab = "Distance (miles)", main = "Distance")
Boxplots can also be generated for continuous variable separated by groups if the variable denoting the groups is represented as a factor variable.
> boxplot(dist ~ gender, data = HSD, xlab = "Gender " , ylab = "Distance (miles)", main = "Distance")
The generated plots can be saved as a PNG image file in the working directory as shown below. Dont forget to run the dev.off()
command after the plotting command.
> png("DistBoxPlot.png")
> boxplot(dist ~ gender, data = HSD, xlab = "Gender " , ylab = "Distance (miles)", main = "Distance")
> dev.off()
The generated plots can be saved as a PDF file in the working directory as shown below.
> pdf("DistBoxPlot.pdf")
> boxplot(dist ~ gender, data = HSD, xlab = "Gender " , ylab = "Distance (miles)", main = "Distance")
> dev.off()
Exercise
- Plot the histogram for income using the following bins of size 50000 starting at 0.
- Plot the bar plot for household size.
- Plot parallel box plots for income for each value of number of workers.
Probabilities
R can be used to calculate the following quantities for a distribution:
- Point Probability
- Cumulative Probability
- Percentile also known as Quantile
- Pseudo Random Numbers
The functions dnorm(), pnorm(), qnorm()
, and rnorm()
deliver the above quantities for the normal distribution. Let us consider a normal distribution with mean 10 and standard deviation 5. Then the point probability, cumulative probability at 10 and the 60th percentile can be generated as shown below.
> u <- dnorm(10, 10, 5)
> u
[1] 0.07978846
> v <- pnorm(10, 10, 5)
> v
[1] 0.5
> w <- qnorm(0.6, 10, 5)
> w
[1] 11.26674
Fifty random realizations from a normal distribution with mean 10 and standard deviation 5 can be generated as follows:
> x <- rnorm(50, 10, 5)
The bell curve can be plotted to verify the functions:
> x <- seq(-20, 40, 0.1)
> plot(x, dnorm(x, 10, 5), type = "l" , main = "Normal PDF", ylab = "f(x)" , xlab = "x")
The corresponding equivalent functions for a binomial distribution are dbinom(), pbinom(), qbinom()
, and rbinom()
respectively. Let us consider a binomial distribution with 10 trials and the probability of success in each trial being 0.35. The probability of the outcome being 5, the cumulative probability at 5, the 50th percentile can be generated as shown below.
> u <- dbinom(5, 10, 0.35)
> u
[1] 0.1535704
> v <- dbinom(5, 10, 0.35)
> v
[1] 0.1535704
> w <- qbinom(0.5, 10, 0.35)
> w
[1] 3
The probability density function for the above binomial distribution can be plotted as shown below:
> x <- 1:10
> plot(x, dbinom(x, 10, 0.35), type = "h" , main = "Binomial PDF", ylab = "p(x)" , xlab = "x")
Exercise
Consider a Poisson distribution with lambda parameter of 20.
- Determine the probability of an outcome being equal to 15
- Determine the probability of an outcome being less than equal to 15
- Determine the 75th percentile
- Generate 50 random realizations from this distribution
Packages
An R package is a collection of function and data sets which add functionality to R. R comes with its base set of packages. The command library()
lists all packages which are currently available in your R installation.
> library()
You can add more features to your current R setup by installing more packages. This is done using the install.packages("Package Name")
command. At this point R has around 12000 packages.
> install.packages("Package Name")
Some of the popular packages I use are “car”, “dplyr”, and “ggplot2”. The entire list of packages is available at the CRAN website: https://cran.r-project.org/web/packages/.
When you install a specific package, it adds additional functionalities to the base R installation. In order to use those commands, you will have to load the package first using the library()
command. For example, if you want to use the ggplot2 plotting capabilities, first install the package and then load it.
> install.packages("ggplot2")
> library(ggplot2)
You only need to install the package once in your system. However, you will have to load the package each time you start an R session if you want to use its functionalities.