Saturday, November 22, 2014

Haskell class, so far

Well, I'm about 5 weeks into Introduction to Functional Programming, a.k.a. FP101x, an online class taught in Haskell by Eric Meijer. The class itself is a couple weeks ahead of that; I'm lagging a bit. So, how is it so far, you ask?

The first 4 weeks covered basic functional concepts and how to express them in Haskell, closely following chapters 1-7 of the book, Graham Hutton's Programming in Haskell:

  • Defining and applying functions
  • Haskell's type system
    • parametric types
    • type classes
    • type signatures of curried functions
  • pattern matching
  • list comprehensions
  • recursion
  • higher-order functions

Haskell's hierarchy of type classes is elegant, but some obvious things seem to be missing. For example, you can't show a function. But, it would be really helpful to show something like a docstring, or at least the function's type signature. Also machine-word sized Int's don't automatically promote, so if n is an Int, n/5 produces a type error.

Most of the concepts were familiar already from other functional languages, Scheme via SICP, OCAML via Dan Grossman's programming languages class, and Clojure via The Joy of Clojure. So, this early part was mostly a matter of learning Haskell's syntax.

Some nifty examples

  • a recursive definition of factorial:

    factorial :: Integer -> Integer
    factorial 0 = 1
    factorial n = n * (factorial (n-1))
  • sum of the first 8 powers of 2:

    sum (map (2^) [0..7])
  • a recursive definition of map:

    map :: (a -> b) -> [a] -> [b]
    map f [] = []
    map f (x:xs) = f x : map f xs
  • get all adjacent pairs of elements from a list:

    pairs :: [a] -> [(a,a)]
    pairs xs = zip xs (tail xs)
  • check if a list of elements that can be ordered is sorted by confirming that each pair of elements is ordered:

    sorted :: Ord a => [a] -> Bool
    sorted xs = and [x <= y |(x,y) <- pairs xs]

Haskell attains its sparse beauty by leaving a lot implied. One thing I figured out during my brief time with OCAML also seems to apply to Haskell. Although these languages lack the forest of parentheses you'll encounter in Lispy languages, it's not that the parentheses aren't there; you just can't see them. A key to reading Haskell is understanding the rules of precedence, associativity and fixity that imply the missing parentheses.

Pre- cedence Left associative Non- associative Right associative

9

!!

.

8

^,^^,**

7

*, /, `div`, `mod`, `rem`, `quot`

6

+,-

5

:,++

4

==, /=, <, <=, >, >=, `elem`, `notElem`

3

&&

2

||

1

>>, >>=

0

$, $!, `seq`

Another key is reading type signatures of curried functions, as currying is the default in Haskell and is relied upon extensively in composing functions, particularly in the extra terse "point-free" style.

Currently, I'm trying to choke down Graham Hutton's Addendum on Monads. If I end up understanding that, it'll get me a code-monkey merit badge, for sure.

Tuesday, November 11, 2014

The DREAM / RECOMB Conference 2014

The RECOMB/ISCB Conference on Regulatory and Systems Genomics, with DREAM Challenges and Cytoscape Workshops is running this week in San Diego.

A bunch of us from Sage Bionetworks are here to connect with the DREAM community. In introductory remarks, Stephen Friend framed the challenges as piloting new modes of collaboration and engagement addressing multidimensional problems based on the idea that open innovation will trump closed silos.

Lincoln Stein: The Future of Genomic Databases

I first heard Lincoln Stein speak at an O'Reilly conference in 2002, on building a bioinformatics nation. The same themes of openness and integration reappeared in Stein's talk on The Future of Genomic Databases.

Stein asks, "Open Data + open source = reproducible science?" Not exactly. Stein presents some emerging solutions to the remaining obstacles: big data sets, complex workflows, unportable code and data access restrictions.

Cloud computing, specifically colocation of data and compute, enables handling big data. Containers (ie Docker) address the problem of code portability. The Global Alliance is working towards providing APIs both to encapsulate technical complexity and to provide a control point at which to enforce restrictions.

In case we're wondering what to do with all the machine cycles made available by Amazon and Google, bioinformatics workflows are growing in complexity. Workflow managers like Seqware and Galaxy provide a formalized description of multistep processes and manage tools and their dependencies.

Legal restrictions hinder data integration. But, donors want their samples to contribute to research. Licensure for data access combined with uniform consent could reduce the friction resulting in a streamlined data access process. On the other hand, technical solutions involve homomorphic encryption and agent based federated queries.

As a parting thought, Stein notes that digital infrastructure enables experiments in incentive structures and economic models, citing micropayments, ratings, and challenges.

Andrea Califano

Andrea Califano spoke on the genotype to phenotype linkage in cancer. Thinking of the cell as an integrator of signals, Califo's group traces from gene or protein expression signatures of cell states (normal, neoplastic, metastatic) back through the network to the master regulators responsible for that signature. One related paper is Identification of Causal Genetic Drivers of Human Disease through Systems-Level Analysis of Regulatory Networks.

Paul Boutros Somatic Mutation Calling Challenge

Paul Boutros presented the Somatic Mutation Calling Challenge (SMC-DNA). He announced the intention for the SMC challenges to become a living benchmark, an objective standard against which future methods will be tested.

Paul also crowned the Broad Institute's MuTect (single nucleotide) and novoBreak (structural variants) by Ken Chen's lab at MD Anderson the winners of the synthetic tumor phase of SMC-DNA. The plan is to announce winners on real tumor data in February after experimental validation.

The Winners

The SMC challenge is a bit unique for DREAM in its level of specialization. In the other challenge, a couple of methods were highlighted: Gaussian process regression and dictionary learning for sparse representation.

But, increasingly, the main differentiator is application of biological domain knowledge, especially with respect to selecting and processing features. Li Liu of Arizona State's Biodesign Institute, for example, won part of the Accute Myoloid Leukemia challenge by weighting proteins based on their evolutionary conservation.

Another theme is that genetic features seem to have poor signal compared to more downstream features, gene expression or clinical variables. Peddinti Gopalacharyulu, a top performer in the Gene Essentiality Challenge, commented that perhaps the way to use genetics is to extract the component of gene expression that is not explained by genetic features.

DREAM 9.5

Two of the Dream 9.5 challenges are follow-ups to the Somatic Mutation Calling challenge from the 8.5 round. The SMC empire expands into RNA and tumor heterogeneity. In the olfaction challenge, the goal is to predict, from molecular features, odor as described by human subjects. The Prostate cancer challenge asks participants to classify patients according to survival using data sourced from the comparator arms of clinical trials.

For the DREAM 10 round, there's an imaging challenge in the works and a sequel to the ALS challenge challenge from DREAM 7.

On to RECOMB

That's just the DREAM part of the meeting, or, really, the subset that fit into my brain. As an added bonus, there were several representatives from Cytoscape-related projects and some conversation about the Global Alliance for Genomics and Health.