Some languages like C requires to specify the type of the variable first and then assign numbers, e.g. int abc; abc = 2;
. But R does not require such thing. R automatically generates the variable and changes the variable type accordingly.
# the basic value assignment operator that always works
abc <- 2 # in this line, R creates the variable "abc" and assigns 2
abc
## [1] 2
# the easy-to-use operator that rarely, but not never, produces error
abc = 2
abc
## [1] 2
# you can also assign arrays
abc = c(1,2,3,4,5)
abc
## [1] 1 2 3 4 5
# assigning the integer sequence
abc = 1:5
abc
## [1] 1 2 3 4 5
abc = 3:(-2)
abc
## [1] 3 2 1 0 -1 -2
# you can use the function "rep" to create repeating vectors
abc = rep(2, 5)
abc
## [1] 2 2 2 2 2
abc = rep(c(1,2,3), 5)
abc
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
# you can use the function "seq" to assign an equispaced sequence
abc = seq(from=0, to=1, by=0.1)
abc
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
abc = seq(from=0, to=1, length=11)
abc
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
You can write the custom functions in R easily. The functions do not need to be written in separate files while some languages like Matlab requires to do so.
# to write a custom function, assign FUNCTION to a VARIABLE
add = function(x,y) {
return(x+y)
}
square = function(x) {
return(x^2)
}
# the return statement can be omitted. R returns what is written in the last line.
square_noReturnStatement = function(x) {
x^2
}
# to call a function, write the name and put arguments inside the bracket
abc = 3
add(x=abc, y=2)
## [1] 5
square(x=3)
## [1] 9
square_noReturnStatement(x=3)
## [1] 9
# you can omit the labels
add(abc,2)
## [1] 5
# of course you can assign the returned value to a variable
result = square(abc)
result
## [1] 9
By the way, now we can have better understanding of what we mean by R automatically changing the variable types:
# right now, a function is assigned in the variable "add":
add
## function(x,y) {
## return(x+y)
## }
# we know it works as a function
add(abc, 2)
## [1] 5
# let's assign a number to "add":
add = 3;
add
## [1] 3
# then the variable "add" is no longer a function:
add(abc, 2)
## Error in eval(expr, envir, enclos): could not find function "add"
Consider the following code.
myFunc = function(x) {
xSquare = x^2
return(xSquare/3)
}
myFunc(x = 2)
## [1] 1.333333
xSquare
## Error in eval(expr, envir, enclos): object 'xSquare' not found
We see that xSquare
is not defined, even though it should have been used in myFunc
. What happens in R, and in many other programming languages, is the following. When we call myFunc
, R creates a sub-environment (i.e. a storage space) and define xSquare
there. Then, when it returns xSquare/3
and escape myFunc
, it deletes the sub-environment. Therefore, xSquare
is not stored anymore.
The sub-environment is like a child of the original environment, and so we call the original one as the parent environment in the perspective of sub-environment. The very top environment is called the global environment.
We call the variables in the current environment as “local variables”.
Now consider the following code.
myFunc = function(x) {
xSquare = x^2
return(xSquare/3)
}
myFunc2 = function(x) {
xInverse = 1/x
resultMyFunc = myFunc(xInverse)
return(resultMyFunc)
}
myFunc3 = function(x) {
xCube = x^3
resultMyFunc2 = myFunc2(xCube)
return(resultMyFunc2)
}
myFunc3(x = 2)
## [1] 0.005208333
When we call myFunc3
in the last line, R creates a sub-environment for myFunc3
. Then, when myFunc3
calls myFunc2
within its code, R creates an environment for myFunc2
that is a sub-environment of myFunc3
. Similarly, when myFunc2
calls myFunc
, R creates another sub-environment.
When the function creates the sub-environment, it stores the arguments of the function in the environment. In the above example, every environment has x
variable, although they will all have different values.
How much the sub-environment can access variables in the parent environment is different across languages. In R, the functions can read and change values of the variables in the parent environments.
myFuncValue = 0
myFunc5 = function() {
print(myFuncValue)
myFuncValue = 2
myFunc4 = function() {
print(myFuncValue)
myFuncValue = 1
print(myFuncValue)
}
myFunc4()
print(myFuncValue)
}
myFunc5()
## [1] 0
## [1] 2
## [1] 1
## [1] 2
print(myFuncValue)
## [1] 0
In the above code, myFuncValue
is present in all of the environments. The function always first search for the local variable when it needs to access the variable.
Now let’s see what happens if we define myFunc4
not in the environment of myFunc5
but in the global environment.
myFuncValue = 0
myFunc4 = function() {
print(myFuncValue)
myFuncValue = 1
print(myFuncValue)
}
myFunc5 = function() {
print(myFuncValue)
myFuncValue = 2
myFunc4()
print(myFuncValue)
}
myFunc5()
## [1] 0
## [1] 0
## [1] 1
## [1] 2
print(myFuncValue)
## [1] 0
In R, we can also change the value of the variable in the parent environment using <<-
.
# let's add lines that orders the functions to print variables of parent environments:
xx = 2
myFunc6 = function(x) {
xSquare = x^2
myFunc7()
return(xSquare)
}
myFunc7 = function() {
xx <<- 0
}
myFunc6(x = xx)
## [1] 4
xx
## [1] 0
However, accessing variables of the parent environment is not recommended, as it is very likely to produce errors and wrong results. If you need to access variables of the parent environment, always pass it to sub-environment by putting them as arguments.
a = 2
b = 3
myFunc8 = function(x) { # not recommended!
return(x + a + b)
}
myFunc9 = function(x,a,b) {
return(x + a + b) # as mentioned before, the function will use the "local" a and b.
}
There may be other ways, but one way to create a matrix is the following.
mat = matrix(0, nrow=3, ncol=2)
mat
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
sampleArray = c(1,2,3,4,5,6)
mat = matrix(sampleArray, nrow=3, ncol=2)
mat
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
mat = matrix(sampleArray, nrow=3, ncol=2, byrow = TRUE)
mat
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
# you can access the entries in a usual way.
# In some languages such as C, the first entry is labeled 0.
# In R, the first entry is 1.
mat[1,1]
## [1] 1
mat[3,2]
## [1] 6
mat[0,1]
## numeric(0)
# you can also access rows or columns
mat[3,]
## [1] 5 6
mat[,2]
## [1] 2 4 6
# you can also call the submatrices
mat[c(1,2),]
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
# a useful trick
mat[c(1,3,3,3,2,1,3),]
## [,1] [,2]
## [1,] 1 2
## [2,] 5 6
## [3,] 5 6
## [4,] 5 6
## [5,] 3 4
## [6,] 1 2
## [7,] 5 6
The array
function with 2 dimensions also returns a matrix. It is also useful in producing 3 dimensional arrays.
arr = array(0, dim = c(3,2))
arr
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
arr = array(0, dim = c(3,2,4))
arr
## , , 1
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
##
## , , 2
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
##
## , , 3
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
##
## , , 4
##
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
The data.frame
is a variable type that handles the datasets.
ctrs = c("U.S.", "Canada", "Mexico")
pops = c(318.9, 35.16, 122.3)
# you create a data frame as follows:
df = data.frame(Country=ctrs, Population=pops)
df
## Country Population
## 1 U.S. 318.90
## 2 Canada 35.16
## 3 Mexico 122.30
The data.frame
is a matrix with the data manipulation capabilities.
# you can access the entries like the matrix
df[3,2]
## [1] 122.3
# you can also access a column by its name. Use money symbol.
df$Population
## [1] 318.90 35.16 122.30
# you can also subset the dataset
subDF = subset(df, Population >= 100)
subDF
## Country Population
## 1 U.S. 318.9
## 3 Mexico 122.3
subDF = subset(df, Population >= 100, select = c("Country"))
subDF
## Country
## 1 U.S.
## 3 Mexico
We may use data.table
intead of data.frame
in the class. It is an enhancement of data.frame
which is created by contributors of R community.
data.table
is not automatically installed with base R. We need to install a “package” for data.table
, which is basically a collection of codes. We install the package in our computer by the following code. R uses internet and downloads the package from the R package database.
# the name of the package that allows us to use data.table is "data.table".
install.packages("data.table")
Now we have installed the package. As we do when we install and use computer programs like MS Office, We need to “execute” or “load” the package to use it. We load the package by the following code:
library(data.table)
Now we discuss how data.frame
works. It works like data.frame
.
# recall:
ctrs
## [1] "U.S." "Canada" "Mexico"
pops
## [1] 318.90 35.16 122.30
# create datatable
dt = data.table(Country=ctrs, Population=pops)
dt
## Country Population
## 1: U.S. 318.90
## 2: Canada 35.16
## 3: Mexico 122.30
We can also transform data.frame into data.table.
# recall:
df
## Country Population
## 1 U.S. 318.90
## 2 Canada 35.16
## 3 Mexico 122.30
class(df)
## [1] "data.frame"
# transform.
dt = as.data.table(df)
dt
## Country Population
## 1: U.S. 318.90
## 2: Canada 35.16
## 3: Mexico 122.30
class(dt)
## [1] "data.table" "data.frame"
data.table
has additional features compared to data.frame
. For example, to subset a data.table, do the following.
# subset rows with pop >= 100
dt[Population >= 100]
## Country Population
## 1: U.S. 318.9
## 2: Mexico 122.3
# subset rows but choose the country column only.
dt[Population >= 100, Country]
## [1] U.S. Mexico
## Levels: Canada Mexico U.S.
# if you want to be still in the data.table format:
dt[Population >= 100, list(Country)]
## Country
## 1: U.S.
## 2: Mexico
class(dt[Population >= 100, list(Country)])
## [1] "data.table" "data.frame"
We will see other features of data.table
soon. For more information, look at https://github.com/Rdatatable/data.table/wiki/Getting-started, which is a webpage created by the author of data.table
package.
list
is a variable type that can store various sub-variables. It is like struct
in C.
abc = list(a=3, b="John Doe", c=123.456)
# to access the sub-variables, call them using the money symbol:
abc$a
## [1] 3
abc$b
## [1] "John Doe"
abc$c
## [1] 123.456
Inside the list
, you can save everything as a sub-variable. You can save functions, you can save matrices, you can save a list inside a list, and so on. The following example stores various model primitives of a matching model in Labor economics.
# model primitives
model = list(
r = 0.05, # interest rate
s = 0.02, # separation rate
matchFunc = function(u,v) { 0.4 * u^0.5 * v^0.5 }, # matching function
b = 0, # unemployment benefit
c = 0, # cost of vacancy
alpha = 0.5, # bargaining power
prodFunc = function(x,y) { (x^(-2) + y^(-2))^(1/(-2)) + 0.2 } # production function
)
We can call a function inside the list in a usual way:
model$prodFunc(1,2)
## [1] 1.094427
Instead of assigning names to the sub-variables, we can leave the names blank, in which case the serial numbers are assigned.
abc = list(3, "John Doe", 123.456)
abc
## [[1]]
## [1] 3
##
## [[2]]
## [1] "John Doe"
##
## [[3]]
## [1] 123.456
# to access the elements of the list, use the double brackets:
abc[[1]]
## [1] 3
abc[[2]]
## [1] "John Doe"
abc[[3]]
## [1] 123.456
If you are familiar with syntax of other languages such as matlab, stata, python, etc., the syntax is very much similar in R. For example, the for
loop looks like this:
# the canonical for loop with natural numbers
sum = 0
for(i in 1:4) {
sum = sum + i
}
sum
## [1] 10
# the loop works for ANY array
greekLetterArray = c("Alpha", "Beta", "Gamma", "Delta")
for(greekLetter in greekLetterArray) {
print(greekLetter)
}
## [1] "Alpha"
## [1] "Beta"
## [1] "Gamma"
## [1] "Delta"
I will not spend time on the syntax, but you can use google to learn how to write a certain syntax, or you can come to the office hour.
The plyr
package is a package for data manipulation. Let’s look at how the functions in the plyr
package are used by example. First, let’s take a dataset from the MASS
package. The MASS
package is pre-installed in R, and so we can simply load it without installing it.
library(MASS)
# the MASS package contains a dataset named "Insurance" which is a dataset about the insurance claims.
Insurance
## District Group Age Holders Claims
## 1 1 <1l <25 197 38
## 2 1 <1l 25-29 264 35
## 3 1 <1l 30-35 246 20
## 4 1 <1l >35 1680 156
## 5 1 1-1.5l <25 284 63
## 6 1 1-1.5l 25-29 536 84
## 7 1 1-1.5l 30-35 696 89
## 8 1 1-1.5l >35 3582 400
## 9 1 1.5-2l <25 133 19
## 10 1 1.5-2l 25-29 286 52
## 11 1 1.5-2l 30-35 355 74
## 12 1 1.5-2l >35 1640 233
## 13 1 >2l <25 24 4
## 14 1 >2l 25-29 71 18
## 15 1 >2l 30-35 99 19
## 16 1 >2l >35 452 77
## 17 2 <1l <25 85 22
## 18 2 <1l 25-29 139 19
## 19 2 <1l 30-35 151 22
## 20 2 <1l >35 931 87
## 21 2 1-1.5l <25 149 25
## 22 2 1-1.5l 25-29 313 51
## 23 2 1-1.5l 30-35 419 49
## 24 2 1-1.5l >35 2443 290
## 25 2 1.5-2l <25 66 14
## 26 2 1.5-2l 25-29 175 46
## 27 2 1.5-2l 30-35 221 39
## 28 2 1.5-2l >35 1110 143
## 29 2 >2l <25 9 4
## 30 2 >2l 25-29 48 15
## 31 2 >2l 30-35 72 12
## 32 2 >2l >35 322 53
## 33 3 <1l <25 35 5
## 34 3 <1l 25-29 73 11
## 35 3 <1l 30-35 89 10
## 36 3 <1l >35 648 67
## 37 3 1-1.5l <25 53 10
## 38 3 1-1.5l 25-29 155 24
## 39 3 1-1.5l 30-35 240 37
## 40 3 1-1.5l >35 1635 187
## 41 3 1.5-2l <25 24 8
## 42 3 1.5-2l 25-29 78 19
## 43 3 1.5-2l 30-35 121 24
## 44 3 1.5-2l >35 692 101
## 45 3 >2l <25 7 3
## 46 3 >2l 25-29 29 2
## 47 3 >2l 30-35 43 8
## 48 3 >2l >35 245 37
## 49 4 <1l <25 20 2
## 50 4 <1l 25-29 33 5
## 51 4 <1l 30-35 40 4
## 52 4 <1l >35 316 36
## 53 4 1-1.5l <25 31 7
## 54 4 1-1.5l 25-29 81 10
## 55 4 1-1.5l 30-35 122 22
## 56 4 1-1.5l >35 724 102
## 57 4 1.5-2l <25 18 5
## 58 4 1.5-2l 25-29 39 7
## 59 4 1.5-2l 30-35 68 16
## 60 4 1.5-2l >35 344 63
## 61 4 >2l <25 3 0
## 62 4 >2l 25-29 16 6
## 63 4 >2l 30-35 25 8
## 64 4 >2l >35 114 33
# of course we can also use data.table
dtInsurance = as.data.table(Insurance)
dtInsurance
## District Group Age Holders Claims
## 1: 1 <1l <25 197 38
## 2: 1 <1l 25-29 264 35
## 3: 1 <1l 30-35 246 20
## 4: 1 <1l >35 1680 156
## 5: 1 1-1.5l <25 284 63
## 6: 1 1-1.5l 25-29 536 84
## 7: 1 1-1.5l 30-35 696 89
## 8: 1 1-1.5l >35 3582 400
## 9: 1 1.5-2l <25 133 19
## 10: 1 1.5-2l 25-29 286 52
## 11: 1 1.5-2l 30-35 355 74
## 12: 1 1.5-2l >35 1640 233
## 13: 1 >2l <25 24 4
## 14: 1 >2l 25-29 71 18
## 15: 1 >2l 30-35 99 19
## 16: 1 >2l >35 452 77
## 17: 2 <1l <25 85 22
## 18: 2 <1l 25-29 139 19
## 19: 2 <1l 30-35 151 22
## 20: 2 <1l >35 931 87
## 21: 2 1-1.5l <25 149 25
## 22: 2 1-1.5l 25-29 313 51
## 23: 2 1-1.5l 30-35 419 49
## 24: 2 1-1.5l >35 2443 290
## 25: 2 1.5-2l <25 66 14
## 26: 2 1.5-2l 25-29 175 46
## 27: 2 1.5-2l 30-35 221 39
## 28: 2 1.5-2l >35 1110 143
## 29: 2 >2l <25 9 4
## 30: 2 >2l 25-29 48 15
## 31: 2 >2l 30-35 72 12
## 32: 2 >2l >35 322 53
## 33: 3 <1l <25 35 5
## 34: 3 <1l 25-29 73 11
## 35: 3 <1l 30-35 89 10
## 36: 3 <1l >35 648 67
## 37: 3 1-1.5l <25 53 10
## 38: 3 1-1.5l 25-29 155 24
## 39: 3 1-1.5l 30-35 240 37
## 40: 3 1-1.5l >35 1635 187
## 41: 3 1.5-2l <25 24 8
## 42: 3 1.5-2l 25-29 78 19
## 43: 3 1.5-2l 30-35 121 24
## 44: 3 1.5-2l >35 692 101
## 45: 3 >2l <25 7 3
## 46: 3 >2l 25-29 29 2
## 47: 3 >2l 30-35 43 8
## 48: 3 >2l >35 245 37
## 49: 4 <1l <25 20 2
## 50: 4 <1l 25-29 33 5
## 51: 4 <1l 30-35 40 4
## 52: 4 <1l >35 316 36
## 53: 4 1-1.5l <25 31 7
## 54: 4 1-1.5l 25-29 81 10
## 55: 4 1-1.5l 30-35 122 22
## 56: 4 1-1.5l >35 724 102
## 57: 4 1.5-2l <25 18 5
## 58: 4 1.5-2l 25-29 39 7
## 59: 4 1.5-2l 30-35 68 16
## 60: 4 1.5-2l >35 344 63
## 61: 4 >2l <25 3 0
## 62: 4 >2l 25-29 16 6
## 63: 4 >2l 30-35 25 8
## 64: 4 >2l >35 114 33
## District Group Age Holders Claims
# some useful functions for looking at the dataset
head(Insurance, n=5) # you can omit n, in which case it is set to 6.
## District Group Age Holders Claims
## 1 1 <1l <25 197 38
## 2 1 <1l 25-29 264 35
## 3 1 <1l 30-35 246 20
## 4 1 <1l >35 1680 156
## 5 1 1-1.5l <25 284 63
tail(Insurance) # ditto
## District Group Age Holders Claims
## 59 4 1.5-2l 30-35 68 16
## 60 4 1.5-2l >35 344 63
## 61 4 >2l <25 3 0
## 62 4 >2l 25-29 16 6
## 63 4 >2l 30-35 25 8
## 64 4 >2l >35 114 33
str(Insurance) # displays structure of the data.frame
## 'data.frame': 64 obs. of 5 variables:
## $ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ Group : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Age : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Holders : int 197 264 246 1680 284 536 696 3582 133 286 ...
## $ Claims : int 38 35 20 156 63 84 89 400 19 52 ...
To begin discussion, let’s install plyr
.
install.packages("plyr")
The plyr
package has useful functions that summarize information in data.frame
. One useful function is ddply
which can generate various summary statistics of a data.frame
.
# let's load the package
library(plyr)
# this command computes average number of holders and claims for each district
ddply(.data = Insurance, .variables = .(District), .fun = summarize, meanHolders=mean(Holders), meanClaims=mean(Claims))
## District meanHolders meanClaims
## 1 1 659.0625 86.3125
## 2 2 415.8125 55.6875
## 3 3 260.4375 34.5625
## 4 4 124.6250 20.3750
# this command computes total number of holders and claims for each age group
ddply(.data = Insurance, .variables = .(Age), .fun = summarize, totHolders=sum(Holders), totClaims=sum(Claims))
## Age totHolders totClaims
## 1 <25 1138 229
## 2 25-29 2336 404
## 3 30-35 3007 453
## 4 >35 16878 2065
# we can also summarize numbers for each district+age group:
ddply(.data = Insurance, .variables = .(District, Age), .fun = summarize, totHolders=sum(Holders), totClaims=sum(Claims))
## District Age totHolders totClaims
## 1 1 <25 638 124
## 2 1 25-29 1157 189
## 3 1 30-35 1396 202
## 4 1 >35 7354 866
## 5 2 <25 309 65
## 6 2 25-29 675 131
## 7 2 30-35 863 122
## 8 2 >35 4806 573
## 9 3 <25 119 26
## 10 3 25-29 335 56
## 11 3 30-35 493 79
## 12 3 >35 3220 392
## 13 4 <25 72 14
## 14 4 25-29 169 28
## 15 4 30-35 255 50
## 16 4 >35 1498 234
# also see what happens when we put "mutate" instead of "summarize":
ddply(.data = Insurance, .variables = .(District, Age), .fun = mutate, totHolders=sum(Holders), totClaims=sum(Claims))
## District Group Age Holders Claims totHolders totClaims
## 1 1 <1l <25 197 38 638 124
## 2 1 1-1.5l <25 284 63 638 124
## 3 1 1.5-2l <25 133 19 638 124
## 4 1 >2l <25 24 4 638 124
## 5 1 <1l 25-29 264 35 1157 189
## 6 1 1-1.5l 25-29 536 84 1157 189
## 7 1 1.5-2l 25-29 286 52 1157 189
## 8 1 >2l 25-29 71 18 1157 189
## 9 1 <1l 30-35 246 20 1396 202
## 10 1 1-1.5l 30-35 696 89 1396 202
## 11 1 1.5-2l 30-35 355 74 1396 202
## 12 1 >2l 30-35 99 19 1396 202
## 13 1 <1l >35 1680 156 7354 866
## 14 1 1-1.5l >35 3582 400 7354 866
## 15 1 1.5-2l >35 1640 233 7354 866
## 16 1 >2l >35 452 77 7354 866
## 17 2 <1l <25 85 22 309 65
## 18 2 1-1.5l <25 149 25 309 65
## 19 2 1.5-2l <25 66 14 309 65
## 20 2 >2l <25 9 4 309 65
## 21 2 <1l 25-29 139 19 675 131
## 22 2 1-1.5l 25-29 313 51 675 131
## 23 2 1.5-2l 25-29 175 46 675 131
## 24 2 >2l 25-29 48 15 675 131
## 25 2 <1l 30-35 151 22 863 122
## 26 2 1-1.5l 30-35 419 49 863 122
## 27 2 1.5-2l 30-35 221 39 863 122
## 28 2 >2l 30-35 72 12 863 122
## 29 2 <1l >35 931 87 4806 573
## 30 2 1-1.5l >35 2443 290 4806 573
## 31 2 1.5-2l >35 1110 143 4806 573
## 32 2 >2l >35 322 53 4806 573
## 33 3 <1l <25 35 5 119 26
## 34 3 1-1.5l <25 53 10 119 26
## 35 3 1.5-2l <25 24 8 119 26
## 36 3 >2l <25 7 3 119 26
## 37 3 <1l 25-29 73 11 335 56
## 38 3 1-1.5l 25-29 155 24 335 56
## 39 3 1.5-2l 25-29 78 19 335 56
## 40 3 >2l 25-29 29 2 335 56
## 41 3 <1l 30-35 89 10 493 79
## 42 3 1-1.5l 30-35 240 37 493 79
## 43 3 1.5-2l 30-35 121 24 493 79
## 44 3 >2l 30-35 43 8 493 79
## 45 3 <1l >35 648 67 3220 392
## 46 3 1-1.5l >35 1635 187 3220 392
## 47 3 1.5-2l >35 692 101 3220 392
## 48 3 >2l >35 245 37 3220 392
## 49 4 <1l <25 20 2 72 14
## 50 4 1-1.5l <25 31 7 72 14
## 51 4 1.5-2l <25 18 5 72 14
## 52 4 >2l <25 3 0 72 14
## 53 4 <1l 25-29 33 5 169 28
## 54 4 1-1.5l 25-29 81 10 169 28
## 55 4 1.5-2l 25-29 39 7 169 28
## 56 4 >2l 25-29 16 6 169 28
## 57 4 <1l 30-35 40 4 255 50
## 58 4 1-1.5l 30-35 122 22 255 50
## 59 4 1.5-2l 30-35 68 16 255 50
## 60 4 >2l 30-35 25 8 255 50
## 61 4 <1l >35 316 36 1498 234
## 62 4 1-1.5l >35 724 102 1498 234
## 63 4 1.5-2l >35 344 63 1498 234
## 64 4 >2l >35 114 33 1498 234
The function summarize
is a function of the plyr
package that computes its followed arguments, like totHolders=sum(Holders), totClaims=sum(Claims)
in the above example, for each subgroup specified in .variables
.
Of course, you can use functions other than summarize
. The following example computes the mean absolute deviation for each district.
# a brute-force way is the following.
# note that the "ddply" first subsets the data and give the subset to a function as its argument.
# For example, a subset will look like this:
subset(Insurance, District == 1)
## District Group Age Holders Claims
## 1 1 <1l <25 197 38
## 2 1 <1l 25-29 264 35
## 3 1 <1l 30-35 246 20
## 4 1 <1l >35 1680 156
## 5 1 1-1.5l <25 284 63
## 6 1 1-1.5l 25-29 536 84
## 7 1 1-1.5l 30-35 696 89
## 8 1 1-1.5l >35 3582 400
## 9 1 1.5-2l <25 133 19
## 10 1 1.5-2l 25-29 286 52
## 11 1 1.5-2l 30-35 355 74
## 12 1 1.5-2l >35 1640 233
## 13 1 >2l <25 24 4
## 14 1 >2l 25-29 71 18
## 15 1 >2l 30-35 99 19
## 16 1 >2l >35 452 77
# and then, this subset becomes the "x" in the following function that we will use:
computeMAD = function(x) {
holderMAD = mean(abs(x$Holders - mean(x$Holders)))
return(holderMAD)
}
# look at the result of applying the "ddply" function:
ddply(.data = Insurance, .variables = ~District, .fun = computeMAD)
## District V1
## 1 1 620.2188
## 2 2 404.9688
## 3 3 274.2109
## 4 4 126.2656
# if you don't like the "V1" label, we can modify the above code as follows:
computeMAD = function(x) {
holderMAD = mean(abs(x$Holders - mean(x$Holders)))
return(data.frame(MADofHolders=holderMAD))
}
ddply(.data = Insurance, .variables = ~District, .fun = computeMAD)
## District MADofHolders
## 1 1 620.2188
## 2 2 404.9688
## 3 3 274.2109
## 4 4 126.2656
The following code seems to return the same result, but we will see the difference later.
# The same result can be obtained by using "summarize" and a more general function:
computeMAD = function(x) {
mad = mean(abs(x-mean(x)))
return(mad)
}
ddply(.data = Insurance, .variables = ~District, .fun = summarize, MADofHolders=computeMAD(Holders))
## District MADofHolders
## 1 1 620.2188
## 2 2 404.9688
## 3 3 274.2109
## 4 4 126.2656
data.table
allows to do the same thing by a simple syntax. The rule is the following:
nameOftheDataTable[,functionToUse,by=groupingVariable]
For example, the following two codes provide the same result.
ddply(.data = Insurance, .variables = .(District), .fun = summarize, meanHolders=mean(Holders), meanClaims=mean(Claims))
## District meanHolders meanClaims
## 1 1 659.0625 86.3125
## 2 2 415.8125 55.6875
## 3 3 260.4375 34.5625
## 4 4 124.6250 20.3750
dtInsurance[,.(meanHolders=mean(Holders), meanClaims=mean(Claims)), by=District]
## District meanHolders meanClaims
## 1: 1 659.0625 86.3125
## 2: 2 415.8125 55.6875
## 3: 3 260.4375 34.5625
## 4: 4 124.6250 20.3750
Sometimes we want to transform the data.frame
into the matrix
or the array
form. This is done by daply
, where the prefix da
means transforming data.frame
to array
. In fact, the name ddply
means transformation from data.frame
to another data.frame
.
Below is an example of transforming a data.frame
into a matrix
.
computeTotHolders = function(x) {
return(sum(x$Holders))
}
result = daply(.data = Insurance, .variables = .(District, Age), .fun = computeTotHolders)
result
## Age
## District <25 25-29 30-35 >35
## 1 638 1157 1396 7354
## 2 309 675 863 4806
## 3 119 335 493 3220
## 4 72 169 255 1498
result[3,2]
## [1] 335
Note that the “summarize” function would work badly here. To see why, let’s use summarize
:
result = daply(.data = Insurance, .variables = .(District, Age), .fun = summarize, sum(Holders))
result
## Age
## District <25 25-29 30-35 >35
## 1 638 1157 1396 7354
## 2 309 675 863 4806
## 3 119 335 493 3220
## 4 72 169 255 1498
result[3,2]
## [[1]]
## [1] 335
Note that result[3,2]
returned a list with one sub-variable. It is not a number, and so the usual operations will not work:
335 + 1
## [1] 336
result[3,2] + 1
## Error in result[3, 2] + 1: non-numeric argument to binary operator
Other functions like adply
, aaply
, etc. are also useful sometimes, but I skip them here. To see the explanation and the examples of the functions, type ?adply
, ?aaply
, etc., in R. Or, alternatively, type something like plyr in r
at google, and you will see many webpages that explain how to use plyr
.
Often, you want to transform a data.frame into the corresponding latex code. There should be many ways to do it, and one way to do it is to use the xtable
package. First, let’s install the package:
install.packages("xtable")
To convert a data.frame
into a latex code, you do the following.
# let's use this data frame as an example.
sampleDF = ddply(.data = Insurance, .variables = ~District, .fun = summarize, MADofHolders=computeMAD(Holders))
sampleDF
## District MADofHolders
## 1 1 620.2188
## 2 2 404.9688
## 3 3 274.2109
## 4 4 126.2656
# load library
library(xtable)
# use "xtable" function
xtable(sampleDF)
## % latex table generated in R 3.3.3 by xtable 1.8-2 package
## % Thu Jan 11 09:17:30 2018
## \begin{table}[ht]
## \centering
## \begin{tabular}{rlr}
## \hline
## & District & MADofHolders \\
## \hline
## 1 & 1 & 620.22 \\
## 2 & 2 & 404.97 \\
## 3 & 3 & 274.21 \\
## 4 & 4 & 126.27 \\
## \hline
## \end{tabular}
## \end{table}
# the following returns the same result
print(xtable(sampleDF), type = "latex")
## % latex table generated in R 3.3.3 by xtable 1.8-2 package
## % Thu Jan 11 09:17:30 2018
## \begin{table}[ht]
## \centering
## \begin{tabular}{rlr}
## \hline
## & District & MADofHolders \\
## \hline
## 1 & 1 & 620.22 \\
## 2 & 2 & 404.97 \\
## 3 & 3 & 274.21 \\
## 4 & 4 & 126.27 \\
## \hline
## \end{tabular}
## \end{table}
The xtable
function can also generate an html table.
print(xtable(sampleDF), type = "html")
## <!-- html table generated in R 3.3.3 by xtable 1.8-2 package -->
## <!-- Thu Jan 11 09:17:30 2018 -->
## <table border=1>
## <tr> <th> </th> <th> District </th> <th> MADofHolders </th> </tr>
## <tr> <td align="right"> 1 </td> <td> 1 </td> <td align="right"> 620.22 </td> </tr>
## <tr> <td align="right"> 2 </td> <td> 2 </td> <td align="right"> 404.97 </td> </tr>
## <tr> <td align="right"> 3 </td> <td> 3 </td> <td align="right"> 274.21 </td> </tr>
## <tr> <td align="right"> 4 </td> <td> 4 </td> <td align="right"> 126.27 </td> </tr>
## </table>
You can type ?xtable
for more options in this function.
Now we will move to the next topic, an introduction to the ggplot2
package.
The ggplot2
package is a package for plotting, as its name suggests. What it does is best explained by examples, and so let’s look at the examples.
We use the mtcars
dataset in the datasets
package which is preinstalled with base R.
library(datasets)
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## wt2
## Mazda RX4 6.864400
## Mazda RX4 Wag 8.265625
## Datsun 710 5.382400
## Hornet 4 Drive 10.336225
## Hornet Sportabout 11.833600
## Valiant 11.971600
## Duster 360 12.744900
## Merc 240D 10.176100
## Merc 230 9.922500
## Merc 280 11.833600
## Merc 280C 11.833600
## Merc 450SE 16.564900
## Merc 450SL 13.912900
## Merc 450SLC 14.288400
## Cadillac Fleetwood 27.562500
## Lincoln Continental 29.419776
## Chrysler Imperial 28.569025
## Fiat 128 4.840000
## Honda Civic 2.608225
## Toyota Corolla 3.367225
## Toyota Corona 6.076225
## Dodge Challenger 12.390400
## AMC Javelin 11.799225
## Camaro Z28 14.745600
## Pontiac Firebird 14.784025
## Fiat X1-9 3.744225
## Porsche 914-2 4.579600
## Lotus Europa 2.289169
## Ford Pantera L 10.048900
## Ferrari Dino 7.672900
## Maserati Bora 12.744900
## Volvo 142E 7.728400
Let’s start from the scatterplot.
install.packages("ggplot2")
library(ggplot2)
ggplot(data=mtcars) + geom_point(aes(x=wt,y=mpg))
The ggplot2
package works by first initializing the plot by ggplot(data=mtcars)
and then specifying various plotting options by adding functions with a +
. The geom_point
function specifies that we want to draw a scatterplot. The aes
argument specifies the variables that represent the x axis and the y axis. aes
is an abbreviation for “aesthetics”.
We can put more options by adding functions by a +
. The following example specifies the labels of the x and the y axes.
ggplot(data=mtcars) + geom_point(aes(x=wt,y=mpg)) + xlab("Weight(lb)") + ylab("Miles per gallon")
In order to avoid typing full command every time, the ggplot2
package allows you to save a plot in a variable and combine the variable with additional options with a +
.
gg = ggplot(data=mtcars) + geom_point(aes(x=wt,y=mpg))
gg
gg + xlab("Weight(lb)") + ylab("Miles per gallon")
Also, you can specify the aesthetics when you initialize ggplot.
gg = ggplot(data=mtcars, aes(x=wt,y=mpg))
gg + geom_point()
The aesthetics do not only specify the variables but also specify other things.
ggplot(data=mtcars) + geom_point(aes(x=wt,y=mpg,color=cyl))
# the function "factor" changes real numbers into categorical variables.
ggplot(data=mtcars) + geom_point(aes(x=wt,y=mpg,color=factor(cyl)))
The ggplot2
package has a very detailed online documentation in http://docs.ggplot2.org/ with tons of examples like the above. For example, look at http://docs.ggplot2.org/current/geom_point.html for the documentation of the geom_point
.
You may feel that the fonts and the legends are too small. In order to change them, you need to modify the theme of the ggplot. You can do it for each plot by + theme(...)
(look at the online documentation http://docs.ggplot2.org/current/theme.html). To permanently change the theme, use the theme_update
function:
# enlarges ggplot2 font and legend
theme_update(axis.text=element_text(colour="grey50",size=14),
axis.title=element_text(size=14),
strip.text=element_text(size=14), # these three lines are for larger axes labels and annotations
legend.position="right", # this line specifies the position of the legend
legend.key.size=grid::unit(2.5,"lines"),
legend.title=element_text(size=14,face="bold"),
legend.text=element_text(size=14)) # these three lines are for larger legend texts
ggplot(data=mtcars) + geom_point(aes(x=wt,y=mpg,color=factor(cyl)))
To save your last displayed plot, you can use the ggsave
function. You can specify the file extension, and DO NOT save in the bitmap image like .jpg
or .bmp
since the bitmaps suck when you scale your figures to insert it in your writeup. Save your plot in the vector image like .pdf
.
# the basic command
ggsave("myplot.pdf")
# you can also specify the size.
ggsave("myplot.pdf", width=8, height=5)
The file will be saved in your “working directory”. Before saving the file, specify your working directory by the setwd
function.
setwd("C:/GoogleDrive/MyWorkingDirectory/")
We finish our session by looking at more geom_xxx
functions. Once you know the name of the function, you can google geom_xxx ggplot2
and find the ggplot2 online documentation which explains the function way better than me.
ggplot(data=economics) + geom_line(aes(x=date, y=unemploy))
gg = ggplot(data=economics) + geom_line(aes(x=date, y=unemploy), size=1.2)
gg + ggtitle("Number of unemployed in the U.S. (in thousands)")
gg + geom_smooth(aes(x=date,y=unemploy))
## `geom_smooth()` using method = 'loess'
gg + geom_smooth(aes(x=date,y=unemploy), method="lm", se=FALSE)
gg = ggplot(data = Insurance) + geom_point(aes(Holders, Claims))
gg
gg + facet_grid(. ~ District)
gg + scale_x_log10() + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
gg + scale_x_continuous(trans = "log") + scale_y_continuous(trans = "log")
## Warning: Transformation introduced infinite values in continuous y-axis
gg = ggplot(data=mtcars)
gg + geom_density(aes(x=mpg))
gg + geom_density(aes(x=mpg), size=1.2)
gg + geom_density(aes(x=mpg, color=factor(gear)), size=1.2)
gg + geom_density(aes(x=mpg, color=factor(gear), linetype=factor(gear)), size=1.2)
gg + geom_density(aes(x=mpg, color=factor(gear), linetype=factor(gear), fill=factor(gear)), size=1.2)
gg + geom_density(aes(x=mpg, color=factor(gear), linetype=factor(gear), fill=factor(gear)), size=1.2, alpha=0.2)
gg = ggplot(data=economics) + geom_path(aes(x=unemploy/pop, y=psavert), size=1.2)
gg
# "geom_hline" does the same thing as "geom_abline" with slope=0.
gg + geom_abline(intercept = mean(economics$psavert), slope = 0) + geom_vline(xintercept = mean(economics$unemploy/(economics$pop)))