NB. Much of the following text is simply a rewrite of parts of "An Introduction to R", which I wrote as part of my own learning process. This text is not suitable as a substitute for "An Introduction to R", but can hopefully complement it.
In SPSS the data is always in the same form, with cases as rows and variables as columns. In R, data can be in many forms, and you have to make sure that data are in the form that are expected by the functions that you apply on the data object(s).
R has one form of object that very closely assembles the organisation of data in SPSS: the data frame. In a data frame rows are cases and columns are variables, just as in SPSS. However, in a data frame the cases has names.
Here is an extract (the first five rows and six columns) from a dataframe where the names are equal to the row number. (<NA> means missing data).
> flyktingenkät[1:5,1:6] KODNR F2 F1 F3 F4 F6 1 1 1967 kvinna <NA> <NA> <NA> 2 2 1974 kvinna <NA> ensamstående med barn inga barn 3 3 1974 man Afrika gift utan barn inga barn 4 4 1967 kvinna Afrika gift m barn mer än 4 barn 5 5 1972 man <NA> gift m barn inga barn
Here is an example of a dataframe where the rows have human readable names:
> yrken[1:5,1:6] YRKESOMR SAMHRANG TYPVÄRDE SAMHM HÖMELÅ INDRANG Läkare 2 2 9 8.1515 1 1 Domare 2 3 9 8.1383 1 3 Professor 2 4 9 8.1305 1 4 Advokat 2 5 9 7.9553 1 8 Pilot 3 6 8 7.8125 1 5
The name is not a variable, and there is no column header for this column. Some functions, e.g. the correspondence analysis function ca() in the package with the same name needs a dataframe with named rows. The function for multiple joint correspondence analysis in the ca-package does not use the row names, however.
There are a couple of concepts that one need to grasp: data.frame, matrix, vector, array,
To construct a vector, use the function c(element1, element2, ..., elementn).
An introduction to R says on vectors:
R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers.
Vectors with the logical values 'TRUE', 'FALSE', 'NA' as elements. When used in arithmetic, FALSE counts as 0 and TRUE as 1.
Note: NA means Not available (missing), NaN means Not A Number.
Elements are strings.
By adding a [<expression>]
after the name of a vector (or an object that evaluates to a vector) you can select a subset of that vector. <expression> must be a vector itself and can be of four types: positive numerical, negative numerical, strings or logical. That a vector is a type means that all elements of that vector is of that type.
If <expression> is positive numerical, the elements of the vector at the positions corresponding to the elements of <expression> is selected.
Example: Select element 1 and 3 in a vector of strings (note that we use the c() function to construct both the vector "fruit" and the anonoymous vector used to select the subset of fruit.
> fruit <- c("orange", "banana", "apple", "peach") > print(fruit[c(1,3)]) [1] "orange" "apple"
Example: Select a range of elements starting at position 1 and ending at position 3:
> print(fruit[c(1:3)]) [1] "orange" "banana" "apple"
Negative index positions is used for excluding elements.
Example: Select all elements in fruit but those elements at position 1 and 3.
> print(fruit[-c(1,3)]) [1] "banana" "peach"
Example: Select all elements in fruit but the range of elements starting at position 1 and 3.
> print(fruit[-c(1:3)]) [1] "peach"
Elements can be named using the name() function. Named elements is included in a subset (selected) if <expression> is a string vector.
Example: Name the elements of the fruit vector (here we use the terms of fruits in another language as the name).
> names(fruit) <- c("apelsin", "banan", "äpple", "persika") > print(fruit) > print(fruit["banan"]) banan "banana"
In this case, print()
prints both the name and the value.
The <expression>
must hold the same number of elements as the vector of values. The selected vector will only contain those element in the original vector that corresponded to 'TRUE' in the <expression>
.
Example: only include non-missing values.
fruit <- c("orange", "banana", "apple", "peach", NA, "grapes")
print(fruit[!is.na(fruit)])
1. "orange" "banana" "apple" "peach" "grapes"
In this example <expression>
is a something that evaluates to a vector: !is.na(fruit)
which is a function that returns a vector of TRUE and FALSE values. (Actually, !is.na(fruit)
consists of two functions: !
and is.na(fruit)
where !
means the logical negation, and thus means the reversed function of the vector it is given as an argument.)
A data.frame is matrix where the columns can be of different kinds. Unlike in SPSS, the types of variables are not limited to numericals, strings and factors. In R dataframe is a more general structure which can contain variables that themselves are dataframes, or lists, or vectors. For most uses, the data.frame can be seen as the two-dimensional data matrix you are used to from SPSS.
The same principles for selecting object(s) apply to dataframes as to vectors, but here you specify two vectors, one for the rows and one for the columns. If you only give one vector, it is used as a specfication of the columns.
Example: To select all columns but only the first 15 rows of a data.frame called wg93, use select by including (positive numeric) a range.
> wg93[1:15,]
The comma makes R use the given vector as a specification of what rows that should be included. Without it, R would have tried to select the first fithteen columns, but since this particular data set only holds seven columns, we would have got an error, like this:
> wg93[1:15] Error in `[.data.frame`(wg93, 1:15) : undefined columns selected
Example: To select the first 5 columns and the first 15 rows of a data.frame called wg93, use select by including (positive numeric) a range.
> wg93[1:15,1:5]
Since columns in a dataframe are named, we can select columns by name, like this:
> wg93[1:5,c("sex","age")] sex age 1 2 2 2 1 3 3 2 3 4 1 2 5 1 5
If you have a data.frame with a column in it that is really a factor, but is not coded as such by R, e.g. when R imported data from an external source or format, you can coerce it into a factor by using factor().
Ex. read.spss imported data from an spss-file into a R data.frame. A variable that is a factor was not encoded as a factor by read.spss. Here how to fix that.
temp <- factor(per.individ$UTBILD) attr(temp, "value.labels") <- attr(per.individ$UTBILD, "value.labels") factor(per.individ$UTBILD) <- temp rm(temp)
If the factor has no labels but is ordered, then use ordered():
per.individ$F6_1 <- ordered(per.individ$F6_1)
To change many variables in the same time, refer to them by number:
per.individ[10:15] <- ordered(per.individ[10:15])
Unlike in many programming languages, in R, you can use vectors as arguments to many functions. This means that rather using a for loop, you just give the list of cases on which you want to apply the function. In R, the for loop is rarely used, for two reasons, it is comparatively slow, and code with for loops are only adding unnecessary bloat. There are special functions for applying other functions to lists (the apply-family of functions).
foo <- 1:50000
system.time(for(i in foo){ foo+1 })
user system elapsed
22.529 0.072 22.987
system.time(sapply(foo, function(x) {x+1} ))
user system elapsed
0.172 0.000 0.173
Setting a variable to missing can done manually with is.na(), but can also be done at data-creation time by giving factor() a vector of values that should be treated as missing. Here is how to define a value of 9 as missing in an existing factor and that values are ordered (to declare missing values in unordered factors, use factor()).
yrken$UTBILD <- ordered(yrken$UTBILD, exclude = 9)
Q: How do I "compute" a new variable based on already existing variabels (in SPSS, "compute")
A: 1. Add the new variable to the data.frame, and set it some default value such as "0" or NA. Below the variable invbak2
as added to the data.frame sd.1995, and all units gets the value 0 on this new variable
sd.1995[["invbak2"]] <- 02. Change values on the rows that match the conditions you want.
sd.1995[["invbak2"]][which(fodland1 == "Sverige" & pfodland == "Annat land" & mfodland == "Annat land")] <- 1
Q: How do I merge levels from a factor variable into a new numeric variable?
A: If the factor levels of the original variable are integers, see the R FAQ, if they are strings, read on. the variable nyfinalsei2
is a factor of 9 levels. As it happens, the numerical representation of the factors can be used directly when computing the new variable.
table(sd.1995[["nyfinalsei2"]]) Arbetslös utan uppgift om senaste yrke 6 Studerande utan uppgift om senaste yrke 2 Ej facklärd arbetare 1138 Facklärd arbetare 754 Lägre tjänsteman 859 Tjänsteman mellannivå 1369 Högre tjänsteman/akademikeryrke 1108 Egen företagare 189 Lantbrukare 99 > head(sd.1995[["nyfinalsei2"]]) [1] Facklärd arbetare Ej facklärd arbetare Lägre tjänsteman [4] Facklärd arbetare Lägre tjänsteman Facklärd arbetare 9 Levels: Arbetslös utan uppgift om senaste yrke ... > head(as.integer(sd.1995[["nyfinalsei2"]])) [1] 4 3 5 4 5 4
Here we do the computation of the new variable and joins together a few levels.
sd.1995[["ses"]] <- 0 my.tmp <- as.integer(nyfinalsei2) sd.1995[["ses"]][which(tmp >= 7)] <- 1 sd.1995[["ses"]][which(tmp == 5 || tmp == 6)] <- 2 sd.1995[["ses"]][which(tmp == 4)] <- 3 sd.1995[["ses"]][which(tmp <= 3)] <- 4
To inspect the result, tabulate the original variable with the new variable.
table(sd.1995[["ses"]], sd.1995[["nyfinalsei2"]]) Arbetslös utan uppgift om senaste yrke 0 0 1 0 3 0 4 6 Studerande utan uppgift om senaste yrke Ej facklärd arbetare 0 0 0 1 0 0 3 0 0 4 2 1138 Facklärd arbetare Lägre tjänsteman Tjänsteman mellannivå 0 0 859 1369 1 0 0 0 3 754 0 0 4 0 0 0 Högre tjänsteman/akademikeryrke Egen företagare Lantbrukare 0 0 0 0 1 1108 189 99 3 0 0 0 4 0 0 0 >
Q: How do I sort a data.frame on a variable (column)? A: use sort.list with the $-construct.
example: (foo is name of the dataframe, bar is the name of the variable)
foo[order(foo$bar),]
Q: How do I sort a data.frame on a variable which is given as an argument to a function? A: Enclose the expression with [[]] instead of using the $-construct.
example: (foo is name of the dataframe, baz is the second argument to the function and contains the name of the variable to sort on)
myfunc <- function(data.frame=foo, baz=rating1) { ## this will not work: ## data.frame <- data.frame[order(data.frame$baz),] ## this works: data.frame <- data.frame[order(data.frame[[baz]]),] }
The [[]]
-construct is generally more usable within functions, when the expression that is used as reference is dynamic (not known at programming-time). This construct operates on strings and numbers alike.
Q:How to recode foo[[1]], say all "99" to NA (not available) A:foo[which(foo[1] == 99),1] <- NA
Q: How do I create a list of all rows (cases) that have missing data in at least one column (in a matrix)? A:
row-based (slow solution)
## b is a matrix which(sapply(1:length(b[,1]), function(x) { length(which(is.na(b[x,]))) } ) > 0)
column based solution, much faster if you have many more rows than columns:
my.missing <- function(b) { my.union <- vector() for(i in 1:length(b[1,])) { my.union <- union(my.union, which(is.na(b[,i]))) } my.union }
Real world example: this one prints information about how many new cases with missing data each column adds, handy when looking for variables that ruin your factor analysis :-)
## adad is a data.frame variable.list <- seq(from = 321, to = 375, by = 3) variable.list <- 1:length(adad[1,]) my.union <- vector() for(i in 1:length(variable.list)) { old.length <- length(my.union) missing.in.this.variable <- which(is.na(adad[,i])) my.union <- union(my.union, missing.in.this.variable) print(paste("variable", variable.list[i], "has", length(missing.in.this.variable), "missing cases, of which", length(my.union)-old.length, "are new missing cases")) }
library(foreign) read.spss("ylva-hela.sav", to.data.frame=TRUE) xtabs(~KON+F8_2, data = yrken) mosaicplot(xtabs(~KON+F8_2, data = yrken)) prop.table(table(yrken$KON, yrken$F8_2),1)*100 chisq.test(table(yrken$KON, yrken$F8_2)) chisq.test(xtabs(~KON+F8_2, data = yrken)) library(gmodels) CrossTable(yrken$F8_2, yrken$KON , digits=2, prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE, asresid=TRUE, format="SPSS")
utvardering.pm1606.ht09 <- edit(data.frame()) names(utvardering.pm1606.ht09) <- c("Lärandemål", "Integrering", "Antal timmar", "Kravnivå", "Examinationsform", "Tydliga krav", "Hur mycket genusp.", "Lagom genusp.", "Sammanfattande intryck")
The object insatser
holds data which R consider to be strings (character vectors), but the are - really - numbers with "," used as decimal point.
tail(insatser[,13])
1. "1272462,9" "1279508,26" "1276436" "1275418" "1272278"
6. "1270646"
To convert to numerical, use type.convert
: