Monday, April 26, 2010

Using R for Introductory Statistics, Chapters 1 and 2

I'm working my way through Using R for Introductory Statistics, by John Verzani, a free version of which is available as SimpleR.

Chapter 1

...covers basics of R such as arithmetic, loading libraries and reading data. We also get an introduction to vectors and indexing.

Chapter 2: Univariate Data

The book divides data into three types: categorical, discrete numerical and continuous numerical. Other books talk about levels or scales of measurement: nominal (same as categorical), ordinal (rank), interval (arbitrary zero), and ratio (true zero).

The table command tabulates categorical observations.

> table(central.park.cloud)
central.park.cloud
        clear partly.cloudy        cloudy 
           11            11             9

We can use cut to bin numeric data.

> attach(faithful)
> bins = seq(42,109,by=10)
> freqs <- table(cut(waiting,bins))

For summarizing a data series, use the summary command, or its cousin fivenum. Fivenum gives the Tukey five number summary (minimum, lower-hinge, median, upper-hinge, maximum). Hinges are the medians of the left and right halves of the data, which is only slightly different than quartiles.

> summary(waiting)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   43.0    58.0    76.0    70.9    82.0    96.0 

The two most common measures of central tendency are mean and median. Variance and standard deviation measure how much variation there is from the mean. They are measures of dispersion or spread.

The standard deviation, the square root of the variance, has the same units as the original data.

I've personally always wondered why we square the differences rather than take the distance or mean absolute deviation. Apparently, it's a matter of some debate.

Other measures of variability or dispersion are quantiles (quantile) and inter-quartile range (IQR).

Histograms are a graphical way to look at how data points are distributed over a range. To construct a histogram, we first divide the data into bins. Then, for each bin, we draw a rectangle whose area is proportional to the frequency of data that falls into that bin. Drawing histograms in R is done with the hist command.

par(fg=rgb(0.6,0.6,0.6))
hist(waiting, breaks='scott', prob=T,
     col=rgb(0.9,0.9,0.9),
     main='Time between eruptions of Old Faithful',
     ylab=NULL, xlab='minutes')
par(fg='black')
lines(density(waiting))
abline(v=mean(waiting), col=rgb(0.5,0.5,0.5))
abline(v=median(waiting), lty=3, col=rgb(0.5,0.5,0.5))
abline(v=mean(waiting)+sd(waiting), lty=2, col=rgb(0.7,0.7,0.7))
abline(v=mean(waiting)-sd(waiting), lty=2, col=rgb(0.7,0.7,0.7))
rug(waiting)

Boxplots give another way of viewing the shape of data which works for comparing several distributions, although this example shows only one.

library(UsingR)
attach(alltime.movies)
f = fivenum(Gross)
boxplot(Gross, ylab='all-time gross sales', col=rgb(0.8,0.8,0.8))
text(rep(1.35,5), f, labels=c('minimum', 'lower hinge', 'median', 'upper hinge', 'maximum'), cex=0.6)

Links

Wednesday, April 21, 2010

Analytics vs Transaction processing

Analytics is driving developments in software engineering these days the same way transaction processing did in the 70's and 80's. Machine learning and data mining, often of big data sets with graph topologies, and often done in the cloud, applied to fields as diverse as social networks, business intelligence and scientific computing are giving rise to new software architectures made from ingredients like map-reduce and NoSQL.

Relational databases are probably the best example of real engineering in software - predictable performance based solidly on theory. But all tools are shaped by the problem they were designed to solve. For relational databases, that problem was transaction processing, as exemplified by the ATM network and the airline reservation system. Typically in these types of apps, small amounts of data are relevant to any given transaction and the transactions are algorithmically uncomplicated. The challenge comes from supporting masses of concurrent updates.

In contrast, imagine the kinds of questions Walmart might want to ask its data.

Who are credit-worthy sports fans with high-end buying habits who haven't recently purchased big-screen TVs?
Find home owners with kids within an age range whose spending patterns are uncorrelated with the business cycle and who have a history of responding to promotions.
What products should be in the seasonal section? Where in the store should that section be located?

The shapes of these questions are very different from traditional online transaction processing. Data for analytics is written once, updated only additively, and mined for patterns. And flexibility is a big deal as new types of data are required. As problem and solution become increasingly mismatched, friction increases. So, it's good that people are experimenting with alternatives.

If OLTP shaped relational databases, it might not be stretching too far to say that object-oriented programming was shaped by graphical user interfaces. At least, the design patterns book is chock-full of GUI examples. So, it's interesting to note the success of the functional-programming inspired map-reduce pattern for distributed computing, as part of Google's search. And search is just a query on unstructured data. These days, map-reduce is being applied to all sorts of problems, particularly in machine learning.

In scientific computing, the data deluge is turning to data-driven science as more data becomes available for analyzing and more research questions are being addressed by machine learning. Even computer science is changing to due to big data.

If you like programming with linked data, these are interesting times. There's a lot of creativity swirling around the intersection if distributed computing and storage, big data, machine learning and data mining. Someone once said, “Shape your tools or be shaped by them.” So, after years of being shaped by the transaction processing toolkit, it's refreshing to see a new generation of software tools being shaped by analytics.

Link stew

Related posts

Wednesday, April 14, 2010

HTML CSS and JavaScript References

Here's a place for web and html related reference material.

CSS margin and padding

  • top, right, bottom, left

HTML Entities

Result Description Entity Name Entity Number
  non-breaking space &nbsp; &#160;
< less than &lt; &#60;
> greater than &gt; &#62;
& ampersand &amp; &#38;
" quotation mark &quot; &#34;
' apostrophe  &apos; &#39;
left double quote &ldquo; &#147 / &#8220;
right double quote &rdquo; &#148 / &#8221;
× multiplication &times; &#215;
÷ division &divide; &#247;
© copyright &copy; &#169;

HTML template

<html>

<head>

<title></title>

<link rel="stylesheet" type="text/css" href="style.css" />

</head>

<body>

<h1>Example Page</h1>
<p></p>

</body>
</html>

Style

<link rel="stylesheet" type="text/css" href="style.css" />
<style type="text/css">
.myborder
{
border:1px solid black;
}
</style>

Table

<h2>Table</h2>

<table>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</table>

List

<h2>List</h2>

<ul>
<li><a href=""></a></li>
<li><a href=""></a></li>
</ul>

Thursday, April 01, 2010

Computational Biology Conferences

ISMB 2010

Also see