10 R packages I wish I knew about earlier

上通下達(dá) 2013-03-02

展開全文

10 R packages I wish I knew about earlier

February 10, 2013 by Yhat

I started using R about 3 years ago. It was slow going at first. R had tricky and less intuitive syntax than languages I was used to, and it took a while to get accustomed to the nuances. It wasn't immediately clear to me that the power of the language was bound up with the community and the diverse packages available.

R can be more prickly and obscure than other languages like Python or Java. The good news is that there are tons of packages which provide simple and familiar interfaces on top of Base R. This post is about ten packages I love and use everyday and ones I wish I knew about earlier.

sqldf

install.packages("sqldf")

One of the steepest parts of the R learning curve is the syntax. It took me a while to get over using <- instead of =. I hear people say a lot of times How do I just do a VLOOKUP?!? R is great for general data munging tasks, but it takes a while to master. I think it's safe to say that sqldf was my R "training wheels".

sqldf let's you perform SQL queries on your R data frames. People coming over from SAS will find it very familiar and anyone with basic SQL skills will have no trouble using it--sqldf uses SQLite syntax.

    
      
        
        
          library(sqldf)
 
sqldf("SELECT
        day
        , avg(temp) as avg_temp
      FROM beaver2
      GROUP BY
        day;")
 
#   day avg_temp
#1 307 37.57931
#2 308 37.71308
 
#beavers1 and beavers2 come with base R
beavers <- sqldf("select * from beaver1
                  union all
                  select * from beaver2;")
#head(beavers)
#  day time  temp activ
#1 346  840 36.33     0
#2 346  850 36.34     0
#3 346  900 36.35     0
#4 346  910 36.42     0
#5 346  920 36.55     0
#6 346  930 36.69     0
 
movies <- data.frame(
  title=c("The Great Outdoors", "Caddyshack", "Fletch", "Days of Thunder", "Crazy Heart"),
  year=c(1988, 1980, 1985, 1990, 2009)
  )
boxoffice <- data.frame(
  title=c("The Great Outdoors", "Caddyshack", "Fletch", "Days of Thunder","Top Gun"),
  revenue=c(43455230, 39846344, 59600000, 157920733, 353816701)
  )
 
sqldf("SELECT
        m.*
        , b.revenue
      FROM
        movies m
      INNER JOIN
        boxoffice b
          ON m.title = b.title;")
 
#               title year   revenue
#1 The Great Outdoors 1988  43455230
#2         Caddyshack 1980  39846344
#3             Fletch 1985  59600000
#4    Days of Thunder 1990 157920733

        
      

  

          view raw
          sqldf_examples.R
          This Gist brought to you by GitHub.
        

forecast

install.packages("forecast")

I don't do time series analysis very often, but when I do forecast is my library of choice. forecast makes it incredibly easy to fit time series models like ARIMA, ARMA, AR, Exponential Smoothing, etc.

    
          library(forecast)
 
# mdeaths: Monthly Deaths from Lung Diseases in the UK
fit <- auto.arima(mdeaths)
#customize your confidence intervals
forecast(fit, level=c(80, 95, 99), h=3)
#         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95    Lo 99    Hi 99
#Jan 1980       1822.863 1564.192 2081.534 1427.259 2218.467 1302.952 2342.774
#Feb 1980       1923.190 1635.530 2210.851 1483.251 2363.130 1345.012 2501.368
#Mar 1980       1789.153 1495.048 2083.258 1339.359 2238.947 1198.023 2380.283
 
plot(forecast(fit), shadecols="oldstyle")

          view raw
          forecast_example.R
          This Gist brought to you by GitHub.

My favorite feature is the resulting forecast plot.

plyr

install.packages("plyr")

When I first started using R, I was using basic control operations for manipulating data (for, if, while, etc.). I quickly learned that this was an amateur move, and that there was a better way to do it.

In R, the apply family of functions is the preferred way to call a function on each element of a list or vector. While Base R has this out of the box, its usage can be tricky to master. I've found the plyr package to be an easy to use substitute for split, apply, combine functionality in Base R.

plyr gives you several functions (ddply, daply, dlply, adply, ldply) following a common blueprint: Split a data structure into groups, apply a function on each group, return the results in a data structure.

ddply splits a data frame and returns a data frame (hence the dd). daply splits a data frame and results an array (hence the da). Hopefully you're getting the idea here.

    
          library(plyr)
 
# split a data frame by Species, summarize it, then convert the results
# into a data frame
ddply(iris, .(Species), summarise,
      mean_petal_length=mean(Petal.Length)
)
#     Species mean_petal_length
#1     setosa             1.462
#2 versicolor             4.260
#3  virginica             5.552
 
# split a data frame by Species, summarize it, then convert the results
# into an array
unlist(daply(iris[,4:5], .(Species), colwise(mean)))
#    setosa.Petal.Width versicolor.Petal.Width  virginica.Petal.Width 
#                 0.246                  1.326                  2.026 

          view raw
          plyr_examples.R
          This Gist brought to you by GitHub.

stringr

install.packages("stringr")

I find base R's string functionality to be extremely difficult and cumbersome to use. Another package written by Hadley Wickham, stringr, provides some much needed string operators in R. Many of the functions use data strcutures that aren't commonly used when doing basic analysis.

stringr is remarkably easy to use. Nearly all of the functions (and all of the important ones) are prefixed with "str" so they're very easy to remember.

    
          library(stringr)
 
names(iris)
#[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
names(iris) <- str_replace_all(names(iris), "[.]", "_")
names(iris)
#[1] "Sepal_Length" "Sepal_Width"  "Petal_Length" "Petal_Width"  "Species"
 
s <- c("Go to Heaven for the climate, Hell for the company.")
str_extract_all(s, "[H][a-z]+ ")
#[[1]]
#[1] "Heaven " "Hell "  

          view raw
          stringr_example.R
          This Gist brought to you by GitHub.

The database driver package of your choice

install.packages("RPostgreSQL")
install.packages("RMySQL")
install.packages("RMongo")
install.packages("RODBC")
install.packages("RSQLite")

Everyone does it when they first start (myself included). You've just written an awesome query in your preferred SQL editor. Everything is perfect - the column names are all snake case, the dates have the right datatype, you finally debugged the "must appear in the GROUP BY clause or be used in an aggregate function" issue. You're ready to do some analysis in R, so you run the query in your SQL editor, copy the results to a csv (or...God forbid... .xlsx) and read into R. You don't have to do this!

R has great drivers for nearly every conceivable database. On the off chance you're using a database which doesn't have a standalone driver (SQL Server), you can always use RODBC.

    
      
        
        
          library(RPostgreSQL)
 
drv <- dbDriver("PostgreSQL")
db <- dbConnect(drv, dbname="ncaa",
                 user="YOUR USER NAME", password="YOUR PASSWORD")
 
q <- "SELECT
        *
      FROM
        game_scores;"
 
data <- dbGetQuery(db, q)
head(data)
#id         school  game_date spread school_score   opponent opp_score was_home
#1 45111 Boston College 1985-11-16    6.0           21   Syracuse        41    False
#2 45112 Boston College 1985-11-02   13.5           12 Penn State        16    False
#3 45113 Boston College 1985-10-26  -11.0           17 Cincinnati        24    False
#4 45114 Boston College 1985-10-12   -2.0           14       Army        45    False
#5 45115 Boston College 1985-09-28    5.0           10      Miami        45     True
#6 45116 Boston College 1985-09-21    6.5           29 Pittsburgh        22    False
nrow(data)
#[1] 30932
ncol(data)
#[1] 8

        
      

  

          view raw
          dbdriver_example.R
          This Gist brought to you by GitHub.
        

Next time you've got that perfect query written, just paste it into R and execute it using RPostgreSQL, RMySQL, RMongo, RMongo, or RODBC. In addition to preventing you from having tens of hundreds of CSV files sitting arround, running the query in R saves you time both in I/O but also in converting datatypes. Dates, times, and datetimes will be automatically set to their R equivalent. It also makes your R script reproducible, so you or someone else on your team can easily produce the same results.

lubridate

install.packages("lubridate")

I've never had great luck with dates in R. I've never fully grasped the idiosyncracies of working with POSIXs vs. R Dates. Enter lubridate.

lubridate is one of those magical libraries that just seems to do exactly what you expect it to. The functions all have obvious names like year, month, ymd, and ymd_hms. It's similar to Moment.js for those familiar with javascript.

    
          library(lubridate)
 
year("2012-12-12")
#[1] 2012
day("2012-12-12")
#[1] 12
ymd("2012-12-12")
#1 parsed with %Y-%m-%d
#[1] "2012-12-12 UTC"

          view raw
          lubridate_example.R
          This Gist brought to you by GitHub.

Here's a really handy reference card that I found in a paper. It covers just about everything you might conveivably want to do to a date. I've also found this Date Cheat Sheet to be a handy reference.

ggplot2
```
install.packages("ggplot2")
```
Another Hadley Wickham pacakge and probably his most widely known one. ggplot2 ranks high on everyone's list of favorite R pacakges. It's easy to use and it produces some great looking plots. It's a great way to present your work, and there are many resources available to help you get started.

qcc

install.packages("qcc")

qcc is a library for statistical quality control. Back in the 1950s, the now defunct Western Electric Company was looking for a better way to detect problems with telephone and eletrical lines. They came up with a set of rules to help them identify problematic lines. The rules look at the historical mean of a series of datapoints and based on the standard deviation, the rules help judge whether a new set of points is experiencing a mean shift.

The classic example is monitoring a machine that produces lug nuts. Let's say the machine is supposed to produce 2.5 inch long lug nuts. We measure a series of lug nuts: 2.48, 2.47, 2.51, 2.52, 2.54, 2.42, 2.52, 2.58, 2.51. Is the machine broken? Well it's hard to tell, but the Western Electric Rules can help.

    
          library(qcc)
 
# series of value w/ mean of 10 with a little random noise added in
x <- rep(10, 100) + rnorm(100)
# a test series w/ a mean of 11
new.x <- rep(11, 15) + rnorm(15)
# qcc will flag the new points
qcc(x, newdata=new.x, type="xbar.one")

          view raw
          qcc_example.R
          This Gist brought to you by GitHub.

While you might not be monitoring telephone lines, qcc can help you monitor transaction volumes, visitors or logins on your website, database operations, and lots of other processes.

reshape2

install.packages("reshape2")

I always find that the hardest part of any sort of analysis is getting the data into the right format. reshape2 is yet another package by Hadley Wickham that specializes in converting data from wide to long format and vice versa. I use it all the time in conjunction with ggplot2 and plyr.

    
      
        
        
          library(reshape2)
 
# generate a unique id for each row; this let's us go back to wide format later
iris$id <- 1:nrow(iris)
 
iris.lng <- melt(iris, id=c("id", "Species"))
head(iris.lng)
#  id Species     variable value
#1  1  setosa Sepal.Length   5.1
#2  2  setosa Sepal.Length   4.9
#3  3  setosa Sepal.Length   4.7
#4  4  setosa Sepal.Length   4.6
#5  5  setosa Sepal.Length   5.0
#6  6  setosa Sepal.Length   5.4
 
iris.wide <- dcast(iris.lng, id + Species ~ variable)
head(iris.wide)
#  id Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#1  1  setosa          5.1         3.5          1.4         0.2
#2  2  setosa          4.9         3.0          1.4         0.2
#3  3  setosa          4.7         3.2          1.3         0.2
#4  4  setosa          4.6         3.1          1.5         0.2
#5  5  setosa          5.0         3.6          1.4         0.2
#6  6  setosa          5.4         3.9          1.7         0.4
 
library(ggplot2)
 
# plots a histogram for each numeric column in the dataset
p <- ggplot(aes(x=value, fill=Species), data=iris.lng)
p + geom_histogram() +
  facet_wrap(~variable, scales="free")

        
      

  

          view raw
          reshape_example.R
          This Gist brought to you by GitHub.
        

It's a great way to quickly take a look at a dataset and get your bearings. You can use the melt function to convert wide data to long data, and dcast to go from long to wide.

randomForest

install.packages("randomForest")

This list wouldn't be complete without including at least one machine learning package you can impress your friends with. Random Forest is a great algorithm to start with. It's easy to use, can do supervised or unsupervised learning, it can be used with many differnet types of datasets, but most importantly it's effective! Here's how it works in R.

    
      
        
        
          library(randomForest)
 
# download Titanic Survivors data
data <- read.table("http://math./RTutorial/titanic.txt", h=T, sep="\t")
# make survived into a yes/no
data$Survived <- as.factor(ifelse(data$Survived==1, "yes", "no"))                 
 
# split into a training and test set
idx <- runif(nrow(data)) <= .75
data.train <- data[idx,]
data.test <- data[-idx,]
 
# train a random forest
rf <- randomForest(Survived ~ PClass + Age + Sex, 
             data=data.train, importance=TRUE, na.action=na.omit)
 
# how important is each variable in the model
imp <- importance(rf)
o <- order(imp[,3], decreasing=T)
imp[o,]
#             no      yes MeanDecreaseAccuracy MeanDecreaseGini
#Sex    51.49855 53.30255             55.13458         63.46861
#PClass 25.48715 24.12522             28.43298         22.31789
#Age    20.08571 14.07954             24.64607         19.57423
 
# confusion matrix [[True Neg, False Pos], [False Neg, True Pos]]
table(data.test$Survived, predict(rf, data.test), dnn=list("actual", "predicted"))
#      predicted
#actual  no yes
#   no  427  16
#   yes 117 195

        
      

  

          view raw
          randomforest_example.R
          This Gist brought to you by GitHub.
        

Other Resources:

Yhat is the easiest way to operationalize predictive models. Contact us at info@ for details.

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。