CSC2431: Topics in Computational Biology
Analysis of Next Generation Sequencing Data

Winter 2008

Classes: W 11-1 in Bahen 025
Instructor: Michael Brudno
Office: Pratt (PT) 286C & CCBR 604
Office Hours: By appointment

Announcements.
General information.
Topics & Reading
A guideline for writing paper summaries

Announcements

Note that the April 2nd class has been rescheduled for April 4th (same time, same room: 11am BA 025)
We have a new room -- B025. Hopefully it will fit all of us
1/26 -- Readings for this week are finally up.
1/22 -- We now have a google group: uoft-csc2431. Please sign up for it. Still no word on a room change. Additionally I've put together a very brief guide on what I expect from the paper summaries.
1/19 -- Reading for Jan 23 is posted below. Watch this space for a room change announcement (requested, but no word yet).

Next Generational Sequencing (NGS) technologies, such as Illumina/Solexa, AB SOLiD and 454 Pyrosequencing are revolutionizing the acquisition of genomics data. These platforms offer much reduced costs and an increased speed of data acquisition, but the length of the sequences acquired is much reduced, from 500-1000 base pairs, to as little as 25 base pairs per read. Simultaneously the methodologies offer several important advantages, for example the ability to acquire paired reads on a very large scale.

The development of NGS is forcing a reconsideration of the computational methods used for genome analysis, with the problems of read mapping and genome assembly becoming much more complex. Simultaneously, NGS is enabling the development of methods to address problems which were previously not addressed with genome sequencing, such as the prediction of structural or copy number polymorphisms. The NGS data has a very different error model, requiring modifications to classical algorithms, and the sheer size of the data requires the use of effective algorithms, appropriate hardware, and effective implementations. In this class we will explore the features of NGS data that make it different from classical sequencing data, and try to determine what are the possible methods to address some of these differences. Because of the novelty of the data and of the problems, the emphasis will be on discovering the right solutions, rather than just learning about them.

The prerequisite is CSC 2417 -- Algorithms for Genome Analysis, or permission of the instructor. The permission will be given if you have a basic knowledge of molecular biology (transcription, etc), a strong background in algorithms (at least CSC 373 level), and basic probability theory.

Grading:
The basic requirements for the class will be a course project (60% of the grade), paper presentations and participation (20% of the grade) and written paper summaries (20% of the grade).

Syllabus & Readings

January 9 -- Organizational Meeting
January 16th -- Next Generation Sequencing Platforms. Presenter: Michael Brudno
Reading: Nature Methods -- Method of the Year 2007. pp 11-21.
Background:
DNA Sequencing
Shotgun sequencing
January 23rd -- Mapping Reads, Long & Short
Q-gram filtering techniques Presenter: Joe Whitney
A review of spaced seeds Presenter: Michael Brudno
Note: this is rather long. You only need to read it through Section 4.3, and also Section 4.5 (up to, but not including 4.5.1)
January 30th -- Detecting Variation with Short Reads
PolyBayes. You may also be interested in pyroBayes, however it is NOT one of the assigned readings for the class. Presenter: Ruslan Salakhutdinov Slides
SHRiMP (pre-print, only from on-campus) Presenter: Steve Rumble Slides
February 6th -- Structural Variation with Clone-end data
Structural variation in the Human Genome Presenter: Hilal Kosucu Slides
Robust method for detecting Structural Variations Presenter: Seunghak Lee Slides
Another paper that may be of interest (but is not required) is Korbel et al.
February 13th -- Finding Copy Number Variation with NGS Data
Microarray methods for Copy Number Variants. You may want to refer to this paper for an overview of the field. Presenter: Justin Ho Slides
Ab initio assembly with Short Reads Concentrate on the copy count prediction part (ignore the matepair stuff if you wish). Presenter: Lucas Lochovsky Slides
February 20th -- No class - reading week
February 27th -- Specialized Hardware for Read Mapping and other applications
ClawHMMER. You may want to read the original HMMER user guide if you need more background Presenter: Tim Smith Slides
MUMmerGPU Presenter: Steve Rumble Slides
March 5th -- Genome assembly algorithms
Arachne Assembler Presenter: Ilya Sutskever Slides
Genome co-assembly Presenter: Vlad Yanovsky Slides
March 12th -- Assembly Algorithms for NGS Data
Note these papers are only accessible from on-campus
Velvet Assembler Presenter: Seunghak Lee Slides
Allpaths Assembleri Presenter: Tim Smith Slides
March 19th -- More Assembly & Chip-Seq
Edena assembler Presenter: Lucas Lochovsky Slides
CHiP-Seq: Johnson et al Presenter: Leo Li Slides
March 26th -- Transcriptome Profiling
Gene Expression Profiling Presenter: Russ Salakhutdinov Slides
De novo transcriptome sequencing Presenter: Ilya Sutskever Slides
April 4th NOTE UNUSUAL DATE -- micro RNAs & Datastorage Issues
microRNAs & NGS Presenter: Hilal Kosucu
Compressed Sequence Alignment Presenter: Vlad Yanovsky
April 9th -- HCI & Other
Hawkeye Assembly Viewer Presenter: Justin Ho
Variation discvery with NGS Presnter: Joe Whitney

Writing paper summaries

Each person taking the class for credit is responsible for submitting a one page summary of *at least two* of the assigned papers before every class. The system for grading them will be a simple check-off, so no need to sweat too much. From the writeup I am looking for evidence that you read the papers and thought about them. Some evidence of this would be talking about 1. the weaknesses of the paper (the strengths are in the abstract :)), 2. if the method is not directly applicable to NGS how it can be used there. The writeup need not be long or thoroughly polished; it is supposed to be evidence that you've done the work, not work in itself. If you are presenting aa paper, you are exempt from doing a writeup that week.

The whole point of the paper summaries is to make sure that you've read the papers before coming to class. However I will allow you to hand in no more than two summaries up to 2 days late (by Friday of the same week).

Administrative details:

The class will satisfy the 2c breadth.

CSC2431: Topics in Computational Biology Analysis of Next Generation Sequencing Data

Winter 2008 Classes: W 11-1 in Bahen 025 Instructor: Michael Brudno Office: Pratt (PT) 286C & CCBR 604 Office Hours: By appointment