Stat programming workshop – week 6 tasks

April 2013 update: this post is part 6 of 6 that were designed to help beginning R programmers get up and running with some simple data analyses. They were originally private for a specific course in Summer 2012, but they’re now public in case the tips might be useful for a broader audience. -Brian

In week 6 we look at loops in R and how to avoid them at all costs. If you’ve had some programming experience in other languages like C, C++, or python you know that looping is an important part of many programs, but looping in R is really slow. When dealing with large data sets it becomes really important to avoid loops so your programs finish in a reasonable amount of time. Fortunately R has several built-in functions that are very good at replacing loops, although it is tricky to learn to think in R’s “vectorized” fashion.

  1. The first task is to get the average of each row in a matrix.Create a 10 x 5 matrix:
    m = matrix(1:50, nrow=10, byrow=T)
  2. Use a for loopto find the row averages. Here’s how I would do it:
    avg = rep(NA, 10)
    for(i in 1:10) {
        avg[i] = mean(m[i,])
  3. The apply function allows us to avoid the for loop in this case. Check the help file for apply then use it to find the row means. Make sure the answer is correct.
    avg = apply(m, 1, mean)

    Note the second argument is a 1 (instead of a 2) because we want to apply the function mean to each row of our matrix. Making this a 2 would find the column means.

  4. For common operations, R often has built-in functions that accomplish the task most efficiently. In this case we can use the function rowMeans to do this job.
    avg = rowMeans(m)
  5. The second task is to compute statistics about a data set arranged as a “ragged array”. The cars data set that we’ve used in previous weeks is a good example of this. Suppose we want to find the mean gas mileage for each year. This is a ragged array because there are a different number of rows for each of the years.
  6. Use the table command to confirm to get a list of the years and the number of observations for each year.
  7. Think about how you would use a for-loop to find the mean gas mileage for each year. It’s kind of a pain.
  8. The tapply command is more elegant and much faster than the for-loop solution, so let’s use that instead. Look up the command’s help file, and use it to find the mean mileage for each year.
    avg = tapply(cars$mpg, INDEX=cars$year, FUN=mean)
This entry was posted in teaching. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s