Sunday, December 28, 2008
Saturday, December 27, 2008
I've harbored a secret desire to learn Haskell for a few years now. Simon Peyton-Jones is one of the key people behind Haskell. His web site at MSR has tons of papers, a tutorial on concurrent programming in Haskell, and a video lecture of A taste of Haskell. There's also a Simon Peyton-Jones podcast at SE-Radio.
What is Haskell
Haskell is a programming language that is
- purely functional
- higher order
- strongly typed
- general purpose
Why should I care?
Functional programming will make you think differently about programming
- Mainstream languages are all about state
- Functional programming is all about values
Whether or not you drink the Haskell Kool-Aid, you'll be a better programmer in whatever language you regularly use
I should read a Haskell book or two, and, in related functional goodness, I keep reading how great Practical Common Lisp is. I also need to fulfill my quest to finish SICP. I've read the first three chapters twice, doing the examples once in Scheme and again in OCAML. I've read chapter 4 on interpreters. I need to work through the examples in that chapter and take in the final fifth chapter.
Sunday, December 21, 2008
Let's start off by saying I'm not anti-Ruby. I like Ruby. Ruby is cool. Matz is cool. But, a while back I was wondering, What is a Ruby code block? My feeble curiosity has been revealed for the half-assed dilettantery it is by Paul Cantrell. Mr. Cantrell chases this question down, grabs it by the scruff of the neck, and wrings it out like a bulldog with a new toy. He also rocks on the piano, by the way.
So in fact, there are no less than seven -- count 'em, SEVEN -- different closure-like constructs in Ruby:
- block (implicitly passed, called with yield)
- block (&b => f(&b) => yield)
- block (&b => b.call)
This is quite a dizzing array of syntactic options, with subtle semantics differences that are not at all obvious, and riddled with minor special cases. It's like a big bear trap from programmers who expect the language to just work. Why are things this way? Because Ruby is:
- designed by implementation, and
- defined by implementation.
Again, neither I nor Mr. P.C. are bashing Ruby. He shows how to pull off some tasty functional goodness like transparent lazy lists later in the article. Thanks to railspikes for the link.
Saturday, December 20, 2008
- An Introduction to Bioinformatics Algorithms, Jones and Pevzner: a fun and easy read, if you already have a CS background.
- Genetics: From Genes to Genomes
- Algorithms on Strings, Trees and Sequences, by Daniel Gusfield: Who doesn't love string algorithms?
If you're into stats, both of these are highly regarded, but miles over my head.
- Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin and Eddy
- Statistical Methods in Bioinformatics: An Introduction by Warren J. Ewens and Gregory Grant
- Uri Alon's Network motifs: theory and experimental approaches
- Creating a bioinformatics nation by Lincoln Stein is entertaining reading and may let you know the kind of morass you're getting yourself into.
- Integrating biological databases, also by Lincoln Stein
- An overview of sequence comparison algorithms in molecular biology, Tech. Rep. TR-91-29, Dept. of Computer Science, Univ. of Arizona. Eugene Myers
- SR Eddy, "What is Bayesian statistics?" Nat. Biotechnol., 22, #9 (2004) 1177-8.
- SR Eddy, "What is a hidden Markov model?" Nat. Biotechnol., 22, #10 (2004) 1315-6.
- Foundations for engineering biology, Drew Endy
- Engineering life: Building a fab for biology, from Scientific American, 2006
- Tangentially related and well worth reading is: Software Design Patterns for Information Visualization by Jeffrey Heer
- Executable Cell Biology
In winter of 2008, some UW CS grad students held a seminar course on data management issues in life sciences. In case that link doesn't stay up forever, here's some of the reading list:Intro to Biology
- Bioinformatics - An Introduction for Computer Scientists
- Life and Its Molecules: A Brief Introduction
Overview on biological data integration
- Data integration and genomic medicine
- Addressing the problems with life-science databases for traditional uses and systems biology
- Integration of biological sources: current systems and challenges ahead
Specific tools and techniques
- BioGuideSRS: Querying Multiple Sources with a user-centric perspective
- Path-based systems to guide life scientists in the maze of biological data sources
- (Almost) Hands-Off Information Integration for the Life Sciences
- Gaggle: (cheers to us!) Gaggle paper and Firegoose paper.
- Bioverse: functional, structural and contextual annotation of proteins and proteomes
- Computational representation of biological systems
- The BioMediator system as a tool for integrating biologic databases on the Web
Also on the subject of data: Dynamic Fusion of Web Data.
Books I wanna read
Finally, here are some books that I haven't read, will probably never get the time to read, but I wish I would read.
- An Introduction to Systems Biology: Design Principles of Biological Circuits by Uri Alon
- Systems Biology: Properties of Reconstructed Networks by Bernhard O. Palsson
- Dynamic Models in Biology by Stephen P. Ellner, John Guckenheimer
- Evolutionary Dynamics: Exploring the Equations of Life by Martin A. Nowak
- System Modeling in Cellular Biology: From Concepts to Nuts and Bolts by Zoltan Szallasi
- The Music of Life: Biology beyond the Genome by Denis Noble
- Bioinformatics and Functional Genomics By Jonathan Pevsner
Friday, December 12, 2008
Slashdot linked to Bjarne Stroustrup on Educating Software Developers which follows up on an earlier article, The 'Anti-Java' Professor and the Jobless Programmers. The Anti-Java professor is Robert Dewar at NYU, who coauthored a short paper, Computer Science Education: Where Are the Software Engineers of Tomorrow? They contend that computer science curricula have been dumbed down to counter falling enrollment post-dot-com-crash and partially blame Java, which fosters reliance on libraries and garbage collection. But, not all of their critique can be written off as language bigotry. The result?
We are training easily replaceable professionals.
- reading code
- working in groups
- learning to reuse code
Those sound like solid points to me. One thing the field of medicine really gets right is an emphasis on mentoring. Mentoring is the heart of residency, which depending on specialty can last from 3 to 7 years. By the time a physician graduates from residency, they will have performed hundreds of procedures and seen thousands of patients under the guidance of an attending physician. I've often wished there was more of this in the computing field.
Over the years, I've accumulated a list of topics I wish I'd been exposed to as a CS undergrad.
- source control
- command line foo
- linking and dependency management
- software design: interfaces, refactoring, patterns, components, APIs
- software architecture: design with large components - app servers, databases, message queues, transactions, etc.
- data modeling: schema design and not just relational DBs. Hierarchical (XML) and graph (OO, RDF) representation
- models of computation
- pipes and filters
- and how they're are related.
- scientific/engineering computing: MatLab, R
- probability and statistics, statistical computing, analytics
Of course, then my undergraduate degree would have taken 7 years... On second thought, only my Dad would have complained.
What do you wish you'd learned in college? Post a comment!
Tuesday, December 09, 2008
I wondered who else might be working on bioinformatics related extensions for Firefox besides Firegoose. One interesting project is iHOPerator, which builds on Greasemonkey. And, there's a hint of something to come here.
It seems like there was a flurry of interest around 2005, in the early days of AJAX and mash-ups, which produced biobar along with two now-dead projects - bioFox and NCBI Search Toolbar. Back in those days, John Udell asked, How do you design a remixable Web application? Nifty developments like the REST API in EMBL's STRING 8.0 are starting to provide answers.
Pygr is a hypergraph database in Python with applications in bioinformatics written by Christopher Lee, a faculty member at UCLA. There's a 30 minute video of talk about Pygr and a bunch of other resources on the Lee Lab website and Lee's thinking bioinformatics blog.
Thesis: Hypergraphs are a general model for bioinformatics and Python’s core models are already a good model of Bioinformatics Data
Pygr aims to show that these Pythonic patterns are a general and scalable solution for bioinformatics.
- Sequence: protein and nucleic acid sequences
- Mapping / Graphs: alignment, annotation
- Attributes: schema, i.e. relations between data
- Namespace (import): the ontology of all bioinformatics data
The general idea is not entirely different from the data types behind Gaggle, especially in the emphasis on basic data structures without a heavy semantic component.
Dr. Lee is also writing a textbook on probabilistic inference.
Saturday, December 06, 2008
I happened across a very cool project on web data integration at the University of Leipzig. Their paper Dynamic Fusion of Web Data is worth a look. They're working towards a theory of on-the-fly data integration for mashup applications that they refer to as dynamic data fusion. Data integration in mashups is dynamic in that it occurs as runtime. This provides for a pay-as-you-go model, rather than a large up-front semantic mapping task that limits the scalability of traditional data integration methods like data warehouses.
They describe mashups as workflow-like. Do they mean mashups are programmatic as opposed to declarative? In place of SQL, this group's iFuice system uses a scripting language with "set operations (e.g., union, intersection, and difference) and data transformation (e.g., fuse, aggregate) which can be used to post-process query results". Other key features are instance-level mapping and accommodation of structured and unstructured data.
This definitely gets at what Firegoose is good for - using the web as a channel for structured data - an approach that does for data integration what loose coupling does for software. Firegoose, part of the Gaggle framework, is a toolbar for Firefox that allows data to be exchanged between desktop software and the web. Firegoose can read microformats, call web services, query databases, or even perform nasty dirty screen scraping. Unlike a mashup, data integration in Firegoose and Gaggle requires user participation, although the user never deals with schemas, only instances of the Gaggle data types - mainly lists of identifiers, matrices of numeric data, networks, and tuples. The identifiers serve in a role somewhat analogous to primary keys.
More papers in a similar vein
Tuesday, December 02, 2008
I may as well come clean and admit that I'm developing a genome browser. What? Another genome browser? Why? You may well ask these questions. Well, it's a long story. But here is a completely non-exhaustive list of existing genome browsers.
- The classics: UCSC Genome Browser Home and paper and it's microbial sibling.
- Argo a Java rich-client genome browser built at the Broad Institute.
- Integrative Genomics Viewer also from Broad (see press release).
- Affymetrix spun out it's Integrated Genome Browser into an open source project, along with a library of re-usable components called GenoViz.
- x:map an AJAX genome browser based on the Google Maps API.
- The biology of extemophiles lab at the University of Paris Sud have a nice little web based browser for Sulfolobus.
- NCBI has a new AJAX tool called Sequence Viewer
- Flash based OmicBrowse is apparently big in Japan.
- MochiView is really nice. Read the paper in BMC Biology.
- GenomeGraphs: integrated genomic data visualization with R
- Visualization guru Ben Fry (of processing fame) wrote at least two:cd36 browser and a handheld genome browser
- A guy who calls himself Saaien Tist implemented a circular genome browser in ruby-processing
- Back in 2002, some canucks built BioViz an SVG based genome browser
- The Savant Genome Browser is a desktop visualization tool for genomic data. It was primarily developed for visualizing high throughput (aka next generation) sequencing data... Savant comes out of the Computational Biology Lab at the University of Toronto - also home of Cytoscape Web.
Note: updated in Sept. 2009 to reflect the fact that everyone and their uncle built a genome browser this past couple of years. See Brother, can you spare a genome browser?
Note: updated again in May of 2010 and again in Feb 2011 to add Savant.