ASAP pipeline

Description of Method

The metagenomic data was analyzed using an internal pipeline Automatic Read-based Metagenomic Analysis Pipeline (ARMAP version 1.6). The Illumina HiSeq2000 fastq files were first subjected to quality evaluation using FastQC to check the quality profile, duplication rates and contamination rates. CD-HIT [1] was used to remove duplicates with identity cutoff of 100%. NGS QC Toolkit (version 2.3.3) [2] was used for quality trimming and filtering. Poor-quality bases with quality score <20 were trimmed from 3 end until the first base with quality score >=20. Trimmed reads with length of >120 were further filtered with average score cutoff of 20. Reads with >1 ambiguous base were removed. High-quality reads were then converted to format of fasta and split into multiple partitions prior to DIAMOND [3] search (BLASTx) [3] against NR database (Jan 2016) with E value cutoff of 1e-5, coverage cutoff of 0.5 and maximum target number of 50. The DIAMOND search is a compute-intensive process and was conducted in a supercomputer with >200 nodes (~20 CPU in each node). The BLASTx outputs were submitted to MEGAN6 (Ultimate Edition, version 6.6) [4] for taxonomic classification and function profiling with parameter of top percent of hits 10%, minimum score 50 and minimum support 1. The functional profiles of SEED Subsystem (3 levels), KEGG (4 levels), eggnog (3 levels) and Interpro2Go (3 levels) were exported and the relative abundances were normalized by dividing the read numbers of functions with the total number of high-quality reads. Metaxa2 [5] was used to identify 16S and 18S rRNA genes from the metagenomic reads for taxonomic classification.

1. Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658-1659.
2. Patel RK, Jain M: NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PloS one 2012, 7(2):e30619.
3. Buchfink B, Xie C, Huson DH: Fast and sensitive protein alignment using DIAMOND. Nat Methods 2015, 12(1):59-60.
4. Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome research 2007, 17(3):377-386.
5. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: metaxa2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Mol Ecol Resour 2015, 15(6):1403-1414.