In weeks 1-3 we installed and opened R, imported data from a text file, and started editing and saving code in source files. This week’s focus is R’s basic plot functions. One of the most important parts of any statistical analysis is communicating the results to other people, and plots are often a very effective way to do this.
- Open R and load the car data from weeks 2 and 3. Name the variables in the car data appropriately (see the week 3 tasks for the variable names).
- Last week we used the table command to see how many cars there are in our data set for each number of cylinders. Now we’ll check out the same information visually, using the command hist to produce a histogram. In the command window, type hist(cardata$cylinders).
- Save this plot, by using a sequence of three commands: 1. png([filename].png); 2. hist([some variable]); 3. dev.off(). If you want a pdf file instead of a png image, you can substitute the command pdf in place of png. Similarly, we will work with various plot commands in addition to hist; just substitute your command in place of hist.
- Figure out how to make the plot pretty: change the default title, and axis labels. Hint: use the help(hist) command to see descriptions of the arguments to hist.
- Now make a histogram of the weight variable. Notice that hist chooses a default number of break points, because the weights don’t naturally fall into a nice number of bins like the number of cylinders. Re-do the plot with 20 bins.
- Suppose we want to investigate how mpg changes over the years. Make a scatterplot of mpg vs year with the plot(year, mpg) command.
- Bonus: Add a horizontal line to the plot that shows where the overall mean of mpg is. Use the command abline(h=mean(cardata$mpg)). Try to make this line thicker than the default, and try to make it red (default is black).
In week 1 we looked at installing and running R, while week 2 focused on importing data. This week we start exploring how to write re-usable R code.
- There are many advantages to doing statistical analysis with a programming language like R. Obviously, we use it to make the computer do what we want, but more importantly R enables us to write reusable code. This means a couple things: first, we want to be able to come back and run our programs long after we wrote them; second, we want to save time by not re-writing everything from scratch for each analysis; and lastly we want our analyses to be reproducible by other researchers.
- Open R, and open the text editor of your choice. If you’re using R on Windows or Mac you can use R’s built-in text editor. On Macs, go to File >> New Document. It should be similar in Windows. You can also use a separate text editor like Notepad (windows), gedit (linux), or TextEdit (mac).
- Enter the code for opening the car MPG data (from week 2 – you should have it saved as ‘cars.txt’) in the new document. cardata = read.table(‘cars.txt’)
- On the next line edit the names of the variables in the cardata data frame. names(cardata) = c(‘mpg’, ‘cylinders’, ‘displacement’, ‘horsepower’, ‘weight’, ‘acceleration’, ‘year’, ‘origin’, ‘name’)
- On the next line compute and display a table that shows the unique values in the cylinders column, and how many observations there are for each number of cylinders. print(table(cardata$cylinders)). Note the $ syntax to refer to a variable name in the cardata data frame.
- Save your program with a .R extension. For example, mine is called “mpg_analysis.R”.
- Run your program. Switch back to the R command line and type source([your program name]). For me, this is source(‘mpg_analysis.R’). You should see the table of cylinder values and counts come up.
Hopefully after working through last week’s tasks you have installed R, can open R, and can look up help files for new commands. This week’s tasks are intended to lead you through importing a data set and computing some basic summary statistics.
- Copy the car MPG data at http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data and paste it into a text file. Save the file as “cars.txt”.
- Open R and change the working directory to the location where “cars.txt” is saved. Hint: the setwd() is what you want for this, and the getwd() command can be used to check that you’re in the right directory. List the files in the directory with the dir() command to make sure “cars.txt” is there.
- Import the data in “cars.txt” into R with the read.table() command. cars = read.table(“cars.txt”)
- How many rows of data are there? How many variables? Hint: check out the dim() command.
- View the first 10 rows of the cars data.
- The second variable (called “V2” unless you’ve already added column names) has integer values. Figure out what the unique values are for this variable. Hint 1: check out the unique() command. Hint 2: to access an individual column of our data set use the $ operator. For example, to get the unique values of variable 8, you could do unique(cars$V8).
- Bonus: How many rows are there for each value of variable 2?
This week’s tasks are intended to get you up and running with the statistical programming language R. We are not meeting on Thursday this week, but you are welcome to post blog comments below and/or come to office hours on Friday.
- Install R on your favorite computer. Detailed instructions can be found at the top of the R home page at http://cran.r-project.org/.
- Start R.
- Find two R tutorials on the web. Hint: a good one is on the official R site.
- Find online help for the read.table command. Don’t worry about what the command does, just practice looking up help for R commands. Hint: use Google.
- Figure out how to open help files in the R program itself. Hint: type help(read.table) in R to see the documentation for the read.table command.
- Bonus: load the built-in iris data set.