IMPROVING SEQUENCE ASSEMBLIES USING HIGH-QUALITY OVERLAPS

Mike Roberts, James Yorke, Brian Hunt, Wayne Hayes, Cevat Ustun, Aleksey Zimin
      University of Maryland,
                    and
    Paul Havlak, Baylor College of Medicine.

Finishing a genome costs about as much as the initial assembly, with
most of that cost directed towards filling gaps (Celniker et al, Genome
Biology 2002-03-12).  Since initial assemblies typically get 95-99%
of the sequence, any improvement in quality and amount of sequence to
bring us closer to 100%, no matter how small, translates into an
enormous cost savings for the finishing step.

Recall that one of the first steps in genome sequence assembly is
determining which reads overlap.  In this talk we will present recent
results from a collaboration between the University of Maryland and
the Baylor College of Medicine which measures the effect on assembly of
various techniques for computing overlaps, while the remainder of the
assembly process remains unchanged. The efficacy of some of the Maryland
techniques have already been demonstrated last year in collaboration
with Celera Genomics in their assembly of Drosophila melanogaster; here
we study their effect on the assembly of the genome of Rattus norvegicus.
As a basis for comparison, we test our assemblies against a small amount
of independently finished sequence which exists for R. norvegicus.

The Atlas assembly at Baylor has already produced a high-quality draft
sequence for R. norvegicus.  Nonetheless, this still leaves some five
percent of the mapped scaffolds in gaps.  We find that when the set of
overlaps are more carefully selected before being fed to Atlas, the
quality of the scaffolds improves over the already high quality assembly.
Specifically, the total amount of sequence produced, correctness of
individual bases, and contig length improve.

Read Extension.
Trimmed reads have far fewer bases than untrimmed reads.  Making use
of some of the low quality region is of considerable value since the U.S.
government alone spends roughly $100 million generating these sequences
annually. We use multi-read-comparison based error correction to generate
a consensus sequence across long stretches of low-quality bases.  We find
that several moderately low-quality overlapping sequences can give us as
much information as a single high-quality sequence.