While flipping through a recent issue of Nature, I came across an article discussing the finishing of the human genome sequence. The public version of the project began in 1990, and cost 3 billion dollars to ‘complete’ (Celera’s private initiative cost about a tenth of that, though they did benefit from having access to data from the public effort). Almost ten years after the draft sequence was released in 2001, the project is still not finished. There were major updates to the draft sequence in 2003, 2004, and 2006, when, in an event consistent with the eschatological beliefs of the evangelical head of the project, Francis Collins, the finished, annotated sequence of the final chromosome, chromosome 1, was published. The annotation work will keep researchers busy as long as there are humans capable of doing advanced science. Although, in comparison to the functional annotation, the sequencing work may appear straightforward, it too will not be finished any time soon. The biggest issue is gaps in the genomic sequence, caused by long stretches of DNA repeats that are difficult to sequence and assemble. It’s a tough problem to solve technically. New methodologies, like next-generation sequencing, make it possible to fill in some of the gaps, but it is likely that any truly finished version of the genome will require a technology not yet invented. The following diagram shows the extent of the problem areas throughout the genome:

Part of the problem is also hinted at by the repeated reference to ‘the genome’ sequence. In fact, it’s not a single genome that’s been sequenced. Rather it’s an amalgam of 4 genomes (2 men and 2 women) selected from 20 candidate donors. The author writes:

“It was put together this way to maintain anonymity for those who contributed the DNA and to ensure that the sequence represented all humanity — “our shared inheritance”, as then-head of the project, Francis Collins, said.“
Maintaining sample anonymity this way makes sense; it’s also how HapMap does it. But, to me, it would have been wiser to limit the possibility of any confounding sequence variability by going with a single sample. The problem, perhaps, was that only a male sample would have the full complement of autosomal and sex chromosomes. Now, maybe there were good scientific reasons for choosing this particular sampling strategy, but it reminds me of that ’Bad Idea Jeans‘ commercial from SNL:
“I thought, hey, no one else has sequenced the genome of a free-living organism, let alone something as complicated as a human, so why not add some potentially complicating factors?”
Of course, what they were more likely thinking was:
“Individuals and human groups are nearly the same genetically, so it should make little difference if we used a mixed sample.”
If the only potentially confounding variation came from SNPs, then there might not be a big problem. But extensive structural variations (in particular, CNVs) result in more complex differences between genomes (and whose existence should not have surprised anyone whose name wasn’t Lewontin or Gould), and are causing problems in creating a finished assembly. The two pitch lines for ‘Bad Idea Genomics’ are silly. The premise upon which the second is based, though, is no laughing matter. And it’s everywhere. Let’s hope the Acme Anti-Racist Corporation that uses this in selling its genocidal products goes belly up.