Thursday, August 25, 2011

String functions in R

Here's a quick cheat-sheet on string manipulation functions in R, mostly cribbed from Quick-R's list of String Functions with a few additional links.

  • substr(x, start=n1, stop=n2)
  • grep(pattern,x, value=FALSE, ignore.case=FALSE, fixed=FALSE)
  • gsub(pattern, replacement, x, ignore.case=FALSE, fixed=FALSE)
  • gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE, fixed=FALSE)
  • strsplit(x, split)
  • paste(..., sep="", collapse=NULL)
  • sprintf(fmt, ...)
  • toupper/tolower(x)
  • nchar(x)

Also see Regular Expressions as used in R and R String processing.

Note: Just to be clear, R is far from an ideal platform for processing text. For anything where that's the major concern, you're better off going to Python or Ruby.

Monday, August 15, 2011

MySQL and R

Using MySQL with R is pretty easy, with RMySQL. Here are a few notes to keep me straight on a few things I always get snagged on.

Typically, most folks are going to want to analyze data that's already in a MySQL database. Being a little bass-ackwards, I often want to go the other way. One reason to do this is to do some analysis in R and make the results available dynamically in a web app, which necessitates writing data from R into a database. As of this writing, INSERT isn't even mentioned in the RMySQL docs, sadly for me, but it works just fine.

The docs are a bit clearer for RS-DBI, which is the standard R interface to relational databases and of which RMySQL is one implementation.

Opening and closing connections

The best way to close DB connections, like you would do in a finally clause in Java, is to use on.exit, like this:

con <- dbConnect(MySQL(),
         user="me", password="nuts2u",
         dbname="my_db", host="localhost")
on.exit(dbDisconnect(con))

Building queries

Using sprintf to build the queries feels a little primitive. As far as I can tell, there's no prepared statements in RMySQL. I don't suppose SQL-injection is a concern here, but prepared statements might be a little tidier, anyway.

Processing query results

You can process query results row by row, in blocks or all at once. The highly useful function dbGetQuery(con, sql) returns all query results as a data frame. With dbSendQuery, you can get all or partial results with fetch.

con <- dbConnect(MySQL(), user="network_portal", password="monkey2us", dbname=db.name, host="localhost")
rs <- dbSendQuery(con, "select name from genes limit 10;")
data <- fetch(rs, n=10)
huh <- dbHasCompleted(rs)
dbClearResult(rs)
dbDisconnect(con)

If there's no more results, fetch returns a data frame with 0 columns and 0 rows. dbHasCompleted is supposed to indicate whether there are more records to be fetched, but seems broken. The value of huh in the code above is false, which seems wrong to me.

Retrieving AUTO_INCREMENT IDs

A standard newbie question with MySQL is how to retrieve freshly generated primary keys from AUTO_INCREMENT fields. That's what MySQL's LAST_INSERT_ID() is for.

You can retrieve the most recent AUTO_INCREMENT value with the LAST_INSERT_ID() SQL function or the mysql_insert_id() C API function. These functions are connection-specific, so their return values are not affected by another connection which is also performing inserts.

The same works with RMySQL, but there are some traps to watch out for. Let's say you're inserting a row into a table of networks. Don't worry about the specifics. You want to insert related data in another table, so you need the ID of the newly inserted row.

create.network <- function(species.id, network.name, data.source, description) {
  
  con <- dbConnect(MySQL(),
           user="super_schmuck", password="nuts2u",
           dbname="my_db", host="localhost")
  on.exit(dbDisconnect(con))

  sql <- sprintf("insert into networks
                  (species_id, name, data_source, description, created_at)
                  values (%d, '%s', '%s', '%s', NOW());",
                 species.id, network.name, data.source, description)
  rs <- dbSendQuery(con, sql)
  dbClearResult(rs)

  id <- dbGetQuery(con, "select last_insert_id();")[1,1]

  return(id)
}

Don't forget to clear the result of the insert. If you do, you'll get 0 from the last_insert_id(). Also, using dbGetQuery for the insert produces an strange error when you go to call last_insert_id:

Error in mysqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not run statement: Commands out of sync; you can't run this command now)

Alternatively, you can also combine both SQL statements into one call to dbSendQuery, but, you have to remember to set a flag when you make the connection: client.flag=CLIENT_MULTI_STATEMENTS. Trying to use multiple queries seems not to work with dbGetQuery.

create.network <- function(species.id, network.name, data.source, description) {

  con <- dbConnect(MySQL(),
           user="super_schmuck", password="nuts2u",
           dbname="my_db", host="localhost",
           client.flag=CLIENT_MULTI_STATEMENTS)
  on.exit(dbDisconnect(con))

  sql <- sprintf("insert into networks
                  (species_id, name, data_source, description, created_at)
                  values (%d, '%s', '%s', '%s', NOW());
                  select last_insert_id();",
                 species.id, network.name, data.source, description)

  rs <- dbSendQuery(con, sql)

  if (dbMoreResults(con)) {
    rs <- dbNextResult(con)
    id <- fetch(rs)[1,1]
  } else {
    stop('Error getting last inserted id.')
  }

  dbClearResult(rs)

  return(id)
}

Any effort saved by combining the SQL queries is lost in the extra house-keeping so I prefer the first method.

In spite of these few quirks, RMySQL generally works fine and is pretty straightforward.

Wednesday, August 03, 2011

Microarrays

Microarrays are one of the workhorses of modern biology. Measuring transcript levels enables studies of differential expression - asking what the difference is, at the gene expression level, for example, between cancer tumor cells and normal cells.

Bruz Marzolf, who up 'til recently ran my local microarray facility, spoke recently, tracing the journey of microarrays through the full technology life-cycle, starting in 1995 with the publication of Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray in Science. Bruz put microarrays in the category of a utility technology, but not quite to the point of commoditization as there remain major differences between manufacturers.

  • Affymetrix, first to commercialize microarray technologies, is the 800 pound gorilla. Their photolithography process borrows from computer chip manufacturing and their standardized probe sets are well supported by tools such as Bioconductor. The technology is robust but producing the masks is quite expensive, thus custom arrays are not economical.
  • Agilent, which spun out of HP, uses ink-jet technology. Custom arrays can be designed using Agilent's eArray software. Agilent arrays come in a variety of resolutions including 8x60k, 1x244k and 1x1m with 60mer probes.
  • Illumina builds arrays out of beads coated in oligo probes. Beads are laid out randomly on the slides, necessitating a layout discovery step. These chips have extra redundancy to account for randomness in bead-probe count.
  • Nimblegen's maskless photolithography process is more flexible for custom arrays. Nimblegen provides arrays in 385K, 4x72K, and 12x135K resolutions using 60mer probes. They emphasize high array-to-array data reproducibility.

As an aside, our group uses custom spotted arrays and Agilent arrays. We tried Nimblegen and found that inter-array consistency was excellent, but inter-probe consistency was not. Below we see the ribosomal RNAs and adjacent genes with total RNA measured by a custom Agilent array (in blue) plotted next to a custom Nimblegen array (in green). To be fair there might be other explanations for what we saw, but it certainly looks like there is significant variability between probes that we would expect to have identical readings.

In RNA-Seq: a revolutionary tool for transcriptomics (Nature Reviews Genetics, 2009), Zhong Wang, Mark Gerstein & Michael Snyder show this comparison between microarrays and RNA-Seq.

While RNA-seq, no doubt, has a higher dynamic range, does it really have less noise? Some folks say so. With tens or hundreds of thousands of probes, fairly dense coverage of whole microbial genomes is possible. If you know what you're looking for, microarrays are still cheaper. Discovery oriented work is going increasingly toward sequencing.

Links