using-r

Some notes on using R (from the perspective of an SPSS-user)

NB. Much of the following text is simply a rewrite of parts of "An Introduction to R", which I wrote as part of my own learning process. This text is not suitable as a substitute for "An Introduction to R", but can hopefully complement it.

Forms of objects

In SPSS the data is always in the same form, with cases as rows and variables as columns. In R, data can be in many forms, and you have to make sure that data are in the form that are expected by the functions that you apply on the data object(s).

R has one form of object that very closely assembles the organisation of data in SPSS: the data frame. In a data frame rows are cases and columns are variables, just as in SPSS. However, in a data frame the cases has names.

Here is an extract (the first five rows and six columns) from a dataframe where the names are equal to the row number. (<NA> means missing data).

> flyktingenkät[1:5,1:6]
   KODNR   F2     F1             F3                    F4            F6
1      1 1967 kvinna           <NA>                  <NA>          <NA>
2      2 1974 kvinna           <NA> ensamstående med barn     inga barn
3      3 1974    man         Afrika        gift utan barn     inga barn
4      4 1967 kvinna         Afrika           gift m barn mer än 4 barn
5      5 1972    man           <NA>           gift m barn     inga barn

Here is an example of a dataframe where the rows have human readable names:

> yrken[1:5,1:6]
                          YRKESOMR SAMHRANG TYPVÄRDE  SAMHM HÖMELÅ INDRANG
Läkare                           2        2        9 8.1515      1       1
Domare                           2        3        9 8.1383      1       3
Professor                        2        4        9 8.1305      1       4
Advokat                          2        5        9 7.9553      1       8
Pilot                            3        6        8 7.8125      1       5

The name is not a variable, and there is no column header for this column. Some functions, e.g. the correspondence analysis function ca() in the package with the same name needs a dataframe with named rows. The function for multiple joint correspondence analysis in the ca-package does not use the row names, however.

There are a couple of concepts that one need to grasp: data.frame, matrix, vector, array,

vectors

To construct a vector, use the function c(element1, element2, ..., elementn).

numerical vectors

An introduction to R says on vectors:

R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers.

logical vectors

Vectors with the logical values 'TRUE', 'FALSE', 'NA' as elements. When used in arithmetic, FALSE counts as 0 and TRUE as 1.

Note: NA means Not available (missing), NaN means Not A Number.

character vectors

Elements are strings.

selecting and using subsets of vectors (or more generally, of a data set)

By adding a [<expression>] after the name of a vector (or an object that evaluates to a vector) you can select a subset of that vector. <expression> must be a vector itself and can be of four types: positive numerical, negative numerical, strings or logical. That a vector is a type means that all elements of that vector is of that type.

including by position: positive numerical

If <expression> is positive numerical, the elements of the vector at the positions corresponding to the elements of <expression> is selected.

Example: Select element 1 and 3 in a vector of strings (note that we use the c() function to construct both the vector "fruit" and the anonoymous vector used to select the subset of fruit.

> fruit <- c("orange", "banana", "apple", "peach")
> print(fruit[c(1,3)])
[1] "orange" "apple"

Example: Select a range of elements starting at position 1 and ending at position 3:

> print(fruit[c(1:3)])
[1] "orange" "banana" "apple"
excluding by position: negative numerical

Negative index positions is used for excluding elements.

Example: Select all elements in fruit but those elements at position 1 and 3.

> print(fruit[-c(1,3)])
[1] "banana" "peach"

Example: Select all elements in fruit but the range of elements starting at position 1 and 3.

> print(fruit[-c(1:3)])
[1] "peach"
including by name: string

Elements can be named using the name() function. Named elements is included in a subset (selected) if <expression> is a string vector.

Example: Name the elements of the fruit vector (here we use the terms of fruits in another language as the name).

> names(fruit) <- c("apelsin", "banan", "äpple", "persika")
> print(fruit)
> print(fruit["banan"])
   banan
"banana"

In this case, print() prints both the name and the value.

include elements according to a logical vector: logical

The <expression> must hold the same number of elements as the vector of values. The selected vector will only contain those element in the original vector that corresponded to 'TRUE' in the <expression>.

Example: only include non-missing values.

fruit <- c("orange", "banana", "apple", "peach", NA, "grapes")
print(fruit[!is.na(fruit)])

1. "orange" "banana" "apple" "peach" "grapes"

In this example <expression> is a something that evaluates to a vector: !is.na(fruit) which is a function that returns a vector of TRUE and FALSE values. (Actually, !is.na(fruit) consists of two functions: ! and is.na(fruit) where ! means the logical negation, and thus means the reversed function of the vector it is given as an argument.)

data.frame

A data.frame is matrix where the columns can be of different kinds. Unlike in SPSS, the types of variables are not limited to numericals, strings and factors. In R dataframe is a more general structure which can contain variables that themselves are dataframes, or lists, or vectors. For most uses, the data.frame can be seen as the two-dimensional data matrix you are used to from SPSS.

Selecting a subset of a dataframe

The same principles for selecting object(s) apply to dataframes as to vectors, but here you specify two vectors, one for the rows and one for the columns. If you only give one vector, it is used as a specfication of the columns.

Example: To select all columns but only the first 15 rows of a data.frame called wg93, use select by including (positive numeric) a range.

> wg93[1:15,]

The comma makes R use the given vector as a specification of what rows that should be included. Without it, R would have tried to select the first fithteen columns, but since this particular data set only holds seven columns, we would have got an error, like this:

> wg93[1:15]
Error in `[.data.frame`(wg93, 1:15) : undefined columns selected

Example: To select the first 5 columns and the first 15 rows of a data.frame called wg93, use select by including (positive numeric) a range.

> wg93[1:15,1:5]

Since columns in a dataframe are named, we can select columns by name, like this:

> wg93[1:5,c("sex","age")]
  sex age
1   2   2
2   1   3
3   2   3
4   1   2
5   1   5
Factors

If you have a data.frame with a column in it that is really a factor, but is not coded as such by R, e.g. when R imported data from an external source or format, you can coerce it into a factor by using factor().

Ex. read.spss imported data from an spss-file into a R data.frame. A variable that is a factor was not encoded as a factor by read.spss. Here how to fix that.

temp <- factor(per.individ$UTBILD) attr(temp, "value.labels") <- attr(per.individ$UTBILD, "value.labels") factor(per.individ$UTBILD) <- temp rm(temp)

If the factor has no labels but is ordered, then use ordered():

per.individ$F6_1 <- ordered(per.individ$F6_1)

To change many variables in the same time, refer to them by number:

per.individ[10:15] <- ordered(per.individ[10:15])

Unlike in many programming languages, in R, you can use vectors as arguments to many functions. This means that rather using a for loop, you just give the list of cases on which you want to apply the function. In R, the for loop is rarely used, for two reasons, it is comparatively slow, and code with for loops are only adding unnecessary bloat. There are special functions for applying other functions to lists (the apply-family of functions).

foo <- 1:50000 system.time(for(i in foo){ foo+1 }) user system elapsed 22.529 0.072 22.987 system.time(sapply(foo, function(x) {x+1} )) user system elapsed 0.172 0.000 0.173

* Missing data There are two facilities for treating missing data: NA and missing(). NA is what R returns when there is no value to return, missing(x) returns TRUE if x was not given as an argument to the function within which missing() is called. ??

Setting a variable to missing can done manually with is.na(), but can also be done at data-creation time by giving factor() a vector of values that should be treated as missing. Here is how to define a value of 9 as missing in an existing factor and that values are ordered (to declare missing values in unordered factors, use factor()).

yrken$UTBILD <- ordered(yrken$UTBILD, exclude = 9)

FAQ

Q: How do I "compute" a new variable based on already existing variabels (in SPSS, "compute") A: 1. Add the new variable to the data.frame, and set it some default value such as "0" or NA. Below the variable invbak2 as added to the data.frame sd.1995, and all units gets the value 0 on this new variable

sd.1995[["invbak2"]] <- 0
2. Change values on the rows that match the conditions you want.
sd.1995[["invbak2"]][which(fodland1 == "Sverige" & pfodland == "Annat land" & mfodland == "Annat land")] <- 1

Q: How do I merge levels from a factor variable into a new numeric variable? A: If the factor levels of the original variable are integers, see the R FAQ, if they are strings, read on. the variable nyfinalsei2 is a factor of 9 levels. As it happens, the numerical representation of the factors can be used directly when computing the new variable.

table(sd.1995[["nyfinalsei2"]])

   Arbetslös utan uppgift om senaste yrke
                                        6
  Studerande utan uppgift om senaste yrke
                                        2
                     Ej facklärd arbetare
                                     1138
                        Facklärd arbetare
                                      754
                         Lägre tjänsteman
                                      859
                    Tjänsteman mellannivå
                                     1369
          Högre tjänsteman/akademikeryrke
                                     1108
                          Egen företagare
                                      189
                              Lantbrukare
                                       99
> head(sd.1995[["nyfinalsei2"]])
[1] Facklärd arbetare       Ej facklärd arbetare    Lägre tjänsteman
[4] Facklärd arbetare       Lägre tjänsteman        Facklärd arbetare
9 Levels: Arbetslös utan uppgift om senaste yrke ...
> head(as.integer(sd.1995[["nyfinalsei2"]]))
[1] 4 3 5 4 5 4

Here we do the computation of the new variable and joins together a few levels.

sd.1995[["ses"]] <- 0
my.tmp <- as.integer(nyfinalsei2)
sd.1995[["ses"]][which(tmp >= 7)] <- 1
sd.1995[["ses"]][which(tmp == 5 || tmp == 6)] <- 2
sd.1995[["ses"]][which(tmp == 4)] <- 3
sd.1995[["ses"]][which(tmp <= 3)] <- 4

To inspect the result, tabulate the original variable with the new variable.

table(sd.1995[["ses"]], sd.1995[["nyfinalsei2"]])

        Arbetslös utan uppgift om senaste yrke
  0                                      0
  1                                      0
  3                                      0
  4                                      6

    Studerande utan uppgift om senaste yrke     Ej facklärd arbetare
  0                                       0                    0
  1                                       0                    0
  3                                       0                    0
  4                                       2                 1138

        Facklärd arbetare           Lägre tjänsteman           Tjänsteman mellannivå
  0                 0              859                  1369
  1                 0                0                     0
  3               754                0                     0
  4                 0                0                     0

              Högre tjänsteman/akademikeryrke     Egen företagare Lantbrukare
  0                               0               0           0
  1                            1108             189          99
  3                               0               0           0
  4                               0               0           0
>

Q: How do I sort a data.frame on a variable (column)? A: use sort.list with the $-construct.

example: (foo is name of the dataframe, bar is the name of the variable)

foo[order(foo$bar),]

Q: How do I sort a data.frame on a variable which is given as an argument to a function? A: Enclose the expression with [[]] instead of using the $-construct.

example: (foo is name of the dataframe, baz is the second argument to the function and contains the name of the variable to sort on)

myfunc <- function(data.frame=foo, baz=rating1) {
  ## this will not work:
  ## data.frame <- data.frame[order(data.frame$baz),]
  ## this works:
  data.frame <- data.frame[order(data.frame[[baz]]),]
}

The [[]]-construct is generally more usable within functions, when the expression that is used as reference is dynamic (not known at programming-time). This construct operates on strings and numbers alike.

Q:How to recode foo[[1]], say all "99" to NA (not available)
A:foo[which(foo[1] == 99),1] <- NA

Q: How do I create a list of all rows (cases) that have missing data in at least one column (in a matrix)? A:

row-based (slow solution)

## b is a matrix
which(sapply(1:length(b[,1]), function(x) { length(which(is.na(b[x,]))) } ) > 0)

column based solution, much faster if you have many more rows than columns:

my.missing <- function(b) {
my.union <- vector()
for(i in 1:length(b[1,])) {
 my.union <- union(my.union, which(is.na(b[,i])))
}
my.union
}

Real world example: this one prints information about how many new cases with missing data each column adds, handy when looking for variables that ruin your factor analysis :-)

## adad is a data.frame
variable.list <- seq(from = 321, to = 375, by = 3)
variable.list <- 1:length(adad[1,])
my.union <- vector()
for(i in 1:length(variable.list)) {
 old.length <- length(my.union)
 missing.in.this.variable <- which(is.na(adad[,i]))
 my.union <- union(my.union, missing.in.this.variable)
 print(paste("variable", variable.list[i], "has", length(missing.in.this.variable), "missing cases, of which", length(my.union)-old.length, "are new missing cases"))
}

A sample session

library(foreign)
read.spss("ylva-hela.sav", to.data.frame=TRUE)
xtabs(~KON+F8_2, data = yrken)
mosaicplot(xtabs(~KON+F8_2, data = yrken))
prop.table(table(yrken$KON, yrken$F8_2),1)*100
chisq.test(table(yrken$KON, yrken$F8_2))
chisq.test(xtabs(~KON+F8_2, data = yrken))
library(gmodels)
CrossTable(yrken$F8_2, yrken$KON , digits=2, prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE, asresid=TRUE, format="SPSS")

Creating a new data-set

utvardering.pm1606.ht09 <- edit(data.frame())
names(utvardering.pm1606.ht09) <- c("Lärandemål", "Integrering", "Antal timmar", "Kravnivå", "Examinationsform", "Tydliga krav", "Hur mycket genusp.", "Lagom genusp.", "Sammanfattande intryck")

Converting text strings with decimal commas to numerical data

The object insatser holds data which R consider to be strings (character vectors), but the are - really - numbers with "," used as decimal point.

tail(insatser[,13])

1. "1272462,9" "1279508,26" "1276436" "1275418" "1272278"

6. "1270646"

To convert to numerical, use type.convert:


comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: 2007-10-30