Next Gen Sequence Analysis

Current approach:

Clean reads, discard reads with poor quality regions, remove any adapters that can be found. Especially if coverage is high, reads should be ULTRA clean. A coverage of ~20x with very clean reads will produce a better assembly than a coverage of 40x with poor quality reads. Coverage above 60x seems to confuse assemblers and represents a lost opportunity to clean your reads better.

Cleaning Illumina PE reads consists of these steps implemented in my Python script "QualityScreen_Illumina.py" and include:

Steps borrowed from Pathogens: Genes and Genomes

Trim reads for quality ---> discard poor quality reads instead, see -q and -m
Eliminate singletons ---> -i and -d
Throw out reads with N ---> trimming first/last N, and trimming to remove internal Ns if possible, otherwise read is discarded
Throw out reads with low average quality --> see#1

As well as a number of home-grown ideas:

Hardclipping - allows user to indicate clip points which will be applied to all reads
Retain SE - keeps good single-end reads in order to improve coverage if needed
Discard TruSeq reads - finds reads with at least 8 bases of TruSeq adaptors and discards them using fast, fuzzy regular expression library, "tre" to allow one mismatch/indel in the match.
quality types - supports input of old and new Illumina quality types, see -t
read-id reformatting - reformatts read IDs so that they are compatible with Newbler 2.6, but retains information from original read name.

Use Newbler to assemble 454 + Illumina PE reads in a hybrid assembly. Why Newbler?

Newbler seems to handle homopolymers better than other assemblers.
Newbler makes an attempt at scaffolding contigs with short-insert PE reads which often results in longer sequences.
If the data is carefully cleaned, Newbler is much faster than many other assemblers.
Newbler reliably creates good assemblies while other assemblers sometimes do bizarre things.

Order and orient contigs using optical mapping data and alignments against related genomes if available. If not, alignments against close relatives are the only real option.

Problems with the current approach:

Newbler often reports near 100% read incorporation, with very few multiply mapped reads. Despite this there are a number of odd things about Newbler

Unless the -rip switch is used, Newbler will split reads across contigs. Details of how reads were split are reported in 454ReadStatus.txt. In some assemblies I have observed 2000+ reads being split between the same pair of contigs (suggesting a repeat because coverage was ~80x), yet for some reason Newbler keeps the contigs split.
Although PE reads were used, Newbler often reports <40% of PE reads assembling into the same contig. It isn't clear whether this is due to an error in Newbler's code (perhaps they divide the number of properly mapped PE reads by the sum of forward and reverse reads?) or because Newbler is just really poor at using PE information.
Newbler reports information about "edges" between contigs in the 454ContigGraph.txt file. In some cases I've found a single link between a set of 3 contigs (i.e. a---b---c) but these contigs didn't end up in the same scaffold, or even better in the same contig.

The current approach isn't too bad. Contigs are often in the 500k+ range, sometimes larger than 1mb. However, I am constantly experimenting with alternatives in hopes of getting better results.

New approach:

Clean reads as before.
Use the package Quake to fix remaining errors in Illumina PE reads.
Assemble Illumina PE reads into long SE contigs using GapFiller (note that there is another insteresting software package called GapFiller from a company called BASECLEAR... as if NGS wasn't complex enough, now we have multiple tools with the same name!!)
Assemble Illumina contigs with 454 reads in Newbler.

Results:

Quake: /bio/software/HT_sequence/Quake/Quake/bin/quake.py -f reads -k 16

Quake complains about there not being enough coverage. I can't find anything in the documentation that describes the minimum coverage. I should have ~43x coverage with this dataset, but apparently that isn't enough? Despite this I was able to correct them using:

/bio/software/HT_sequence/Quake/Quake/bin/correct -f reads -k 16 -c 1.0 -m reads.qcts -p 4

Which produced the following results for read 1:

Validated: 878625
Corrected: 26251
Trimmed: 135749
Trimmed only: 132454
Removed: 1717

And for read 2:

Validated: 768091
Corrected: 38010
Trimmed: 239833
Trimmed only: 231000
Removed: 1946

I then attempted to assemble with the GapFiller/IGAassembler:

IGAassembler --short-1 YeastMiSeq_PE1.cor.fastq --short-2 YeastMiSeq_PE2.cor.fastq --max-length 1500 --read-length 250 --output igaresults --statistics igastats --short-ins 600 --short-var 210

But, that results in an ugly software crash:

Compute number of Reads
initialization: VM: 23872; RSS: 1684
....................short as seed: 1

total number of reads is 2072262
total number of seeds is 0
time needed for read sequences: 33.31 s
reads memorized in readsMultiTmp VM: 453504; RSS: 430788
Hash created VM: 616184; RSS: 593676
HASH->readsMulti populated VM: 616184; RSS: 593676
readMultiTMP deleted VM: 616184; RSS: 593676
HASH populated: VM: 616184; RSS: 593676
time needed to populate HASH: 36.24 s
*** glibc detected *** IGAassembler: double free or corruption (out): 0x0000000001e02920 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7ec96)[0x7f9ce9444c96]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSsD1Ev+0x23)[0x7f9ce9f50c13]
IGAassembler[0x40fc2f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f9ce93e776d]
IGAassembler[0x411bf1]
======= Memory map: ========
00400000-00449000 r-xp 00000000 08:06 14419938                           /bio/local/bin/IGAassembler
00648000-00649000 r--p 00048000 08:06 14419938                           /bio/local/bin/IGAassembler
00649000-0064a000 rw-p 00049000 08:06 14419938                           /bio/local/bin/IGAassembler
........
7fff165ff000-7fff16600000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted (core dumped)

to be continued.....

Suppose you have something like this:

[1] "contig00001_1_1 # 686 # 1906 # -1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp"
[2] "contig00001_1_2 # 1921 # 3102 # -1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp"
[3] "contig00001_1_3 # 3260 # 4159 # 1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp"
[4] "contig00001_1_4 # 4594 # 5715 # 1 # ID=1_4;partial=00;start_type=GTG;rbs_motif=AGGAG;rbs_spacer=5-10bp"
[5] "contig00001_1_5 # 5731 # 6051 # 1 # ID=1_5;partial=00;start_type=ATG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp"
[6] "contig00001_1_6 # 6051 # 7484 # 1 # ID=1_6;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp"

You want to split it on the "#" character, and the pull out the 1st, 2nd, and 3rd columns. Using a handy little function named '[' in R you can do this in one line:

details = t(sapply(sapply(names(seqs), strsplit, "#"), '[', c(1,2,3)))

(Thanks to Matt for the '[' trick!)

Next Gen Sequence Analysis

Friday, October 26, 2012

Experimenting with a new assembly approach

Friday, January 6, 2012

SAMstat

Wednesday, November 9, 2011

RStudio

Wednesday, November 2, 2011

Subsetting lists in Python and R

Tuesday, November 1, 2011

Nifty R trick..

Thursday, October 20, 2011

Bacterial vs Eukaryotic genomes