Saturday, December 27, 2008

Functional Programming

I've harbored a secret desire to learn Haskell for a few years now. Simon Peyton-Jones is one of the key people behind Haskell. His web site at MSR has tons of papers, a tutorial on concurrent programming in Haskell, and a video lecture of A taste of Haskell. There's also a Simon Peyton-Jones podcast at SE-Radio.

What is Haskell

Haskell is a programming language that is

  • purely functional
  • lazy
  • higher order
  • strongly typed
  • general purpose

Why should I care?

Functional programming will make you think differently about programming

  • Mainstream languages are all about state
  • Functional programming is all about values

Whether or not you drink the Haskell Kool-Aid, you'll be a better programmer in whatever language you regularly use

I should read a Haskell book or two, and, in related functional goodness, I keep reading how great Practical Common Lisp is. I also need to fulfill my quest to finish SICP. I've read the first three chapters twice, doing the examples once in Scheme and again in OCAML. I've read chapter 4 on interpreters. I need to work through the examples in that chapter and take in the final fifth chapter.

Sunday, December 21, 2008

Principle of least surprise my ass

Let's start off by saying I'm not anti-Ruby. I like Ruby. Ruby is cool. Matz is cool. But, a while back I was wondering, What is a Ruby code block? My feeble curiosity has been revealed for the half-assed dilettantery it is by Paul Cantrell. Mr. Cantrell chases this question down, grabs it by the scruff of the neck, and wrings it out like a bulldog with a new toy. He also rocks on the piano, by the way.

So in fact, there are no less than seven -- count 'em, SEVEN -- different closure-like constructs in Ruby:
  1. block (implicitly passed, called with yield)
  2. block (&b => f(&b) => yield)
  3. block (&b => b.call)
  4. Proc.new
  5. proc
  6. lambda
  7. method

This is quite a dizzing array of syntactic options, with subtle semantics differences that are not at all obvious, and riddled with minor special cases. It's like a big bear trap from programmers who expect the language to just work. Why are things this way? Because Ruby is:

  1. designed by implementation, and
  2. defined by implementation.

Again, neither I nor Mr. P.C. are bashing Ruby. He shows how to pull off some tasty functional goodness like transparent lazy lists later in the article. Thanks to railspikes for the link.

Saturday, December 20, 2008

Random Introduction to Bioinformatics

I was asked recently to recommend some introductory reading material about bioinformatics, which made me realize I haven't read enough myself. Here's what I came up with plus a few additions.

If you're into stats, both of these are highly regarded, but miles over my head.

Papers

Dr. Larry Ruzzo at UW teaches a Computational Biology course. Some of the links above are from his reading list, particularly the Sean Eddy Primer articles from Nature Biotechnology.

In winter of 2008, some UW CS grad students held a seminar course on data management issues in life sciences. In case that link doesn't stay up forever, here's some of the reading list:

Intro to Biology

Overview on biological data integration

Specific tools and techniques

Also on the subject of data: Dynamic Fusion of Web Data.

Books I wanna read

Finally, here are some books that I haven't read, will probably never get the time to read, but I wish I would read.

Friday, December 12, 2008

Educating Software Developers

scholarSlashdot linked to Bjarne Stroustrup on Educating Software Developers which follows up on an earlier article, The 'Anti-Java' Professor and the Jobless Programmers. The Anti-Java professor is Robert Dewar at NYU, who coauthored a short paper, Computer Science Education: Where Are the Software Engineers of Tomorrow? They contend that computer science curricula have been dumbed down to counter falling enrollment post-dot-com-crash and partially blame Java, which fosters reliance on libraries and garbage collection. But, not all of their critique can be written off as language bigotry. The result?

We are training easily replaceable professionals.

Dewar advocates:

  • mentoring
  • reading code
  • working in groups
  • learning to reuse code

Those sound like solid points to me. One thing the field of medicine really gets right is an emphasis on mentoring. Mentoring is the heart of residency, which depending on specialty can last from 3 to 7 years. By the time a physician graduates from residency, they will have performed hundreds of procedures and seen thousands of patients under the guidance of an attending physician. I've often wished there was more of this in the computing field.

Over the years, I've accumulated a list of topics I wish I'd been exposed to as a CS undergrad.

  • source control
  • command line foo
  • linking and dependency management
  • security
  • software design: interfaces, refactoring, patterns, components, APIs
  • software architecture: design with large components - app servers, databases, message queues, transactions, etc.
  • data modeling: schema design and not just relational DBs. Hierarchical (XML) and graph (OO, RDF) representation
  • models of computation
    • imperative
    • OO
    • functional
    • logical
    • pipes and filters
    • and how they're are related.
  • scientific/engineering computing: MatLab, R
  • probability and statistics, statistical computing, analytics
  • writing

Of course, then my undergraduate degree would have taken 7 years... On second thought, only my Dad would have complained.

What do you wish you'd learned in college? Post a comment!

Tuesday, December 09, 2008

Firefox and bioinformatics

Firefox LogoI wondered who else might be working on bioinformatics related extensions for Firefox besides Firegoose. One interesting project is iHOPerator, which builds on Greasemonkey. And, there's a hint of something to come here.

It seems like there was a flurry of interest around 2005, in the early days of AJAX and mash-ups, which produced biobar along with two now-dead projects - bioFox and NCBI Search Toolbar. Back in those days, John Udell asked, How do you design a remixable Web application? Nifty developments like the REST API in EMBL's STRING 8.0 are starting to provide answers.

Bioinformatics as a Queryable Knowledge Map: the Pygr Project

Pygr is a hypergraph database in Python with applications in bioinformatics written by Christopher Lee, a faculty member at UCLA. There's a 30 minute video of talk about Pygr and a bunch of other resources on the Lee Lab website and Lee's thinking bioinformatics blog.

Thesis: Hypergraphs are a general model for bioinformatics and Python’s core models are already a good model of Bioinformatics Data
  • Sequence: protein and nucleic acid sequences 
  • Mapping / Graphs: alignment, annotation 
  • Attributes: schema, i.e. relations between data 
  • Namespace (import): the ontology of all bioinformatics data 
Pygr aims to show that these Pythonic patterns are a general and scalable solution for bioinformatics.

The general idea is not entirely different from the data types behind Gaggle, especially in the emphasis on basic data structures without a heavy semantic component.

Dr. Lee is also writing a textbook on probabilistic inference.

Saturday, December 06, 2008

Dynamic Fusion of Web Data

I happened across a very cool project on web data integration at the University of Leipzig. Their paper Dynamic Fusion of Web Data is worth a look. They're working towards a theory of on-the-fly data integration for mashup applications that they refer to as dynamic data fusion. Data integration in mashups is dynamic in that it occurs as runtime. This provides for a pay-as-you-go model, rather than a large up-front semantic mapping task that limits the scalability of traditional data integration methods like data warehouses.

They describe mashups as workflow-like. Do they mean mashups are programmatic as opposed to declarative? In place of SQL, this group's iFuice system uses a scripting language with "set operations (e.g., union, intersection, and difference) and data transformation (e.g., fuse, aggregate) which can be used to post-process query results". Other key features are instance-level mapping and accommodation of structured and unstructured data.

This definitely gets at what Firegoose is good for - using the web as a channel for structured data - an approach that does for data integration what loose coupling does for software. Firegoose, part of the Gaggle framework, is a toolbar for Firefox that allows data to be exchanged between desktop software and the web. Firegoose can read microformats, call web services, query databases, or even perform nasty dirty screen scraping. Unlike a mashup, data integration in Firegoose and Gaggle requires user participation, although the user never deals with schemas, only instances of the Gaggle data types - mainly lists of identifiers, matrices of numeric data, networks, and tuples. The identifiers serve in a role somewhat analogous to primary keys.

More papers in a similar vein

Tuesday, December 02, 2008

Browsing genomes

I may as well come clean and admit that I'm developing a genome browser. What? Another genome browser? Why? You may well ask these questions. Well, it's a long story. But here is a completely non-exhaustive list of existing genome browsers.

Note: updated in Sept. 2009 to reflect the fact that everyone and their uncle built a genome browser this past couple of years. See Brother, can you spare a genome browser?

Note: updated again in May of 2010 and again in Feb 2011 to add Savant.

Monday, December 01, 2008

UCSC Genome Browser

A while back, I wrote a little hack to to download and parse genome data from NCBI, but was flummoxed by NCBI's format for eukaryotes. A couple of local bioinformatics gurus directed me to UCSC as an alternate data source. UCSC's Genome Browser provides a nice interface to it's underlying data through a Table Browser. The main genome browser has data for eukaryotes, while archaea (and other prokaryotes) are in a separate project. The Table Browser for the archaeal genome browser is a little tricky to find, but it's there.