Metagenomic Species (MGS) and Co-Abundance gene Groups (CAGs) of the Human Gut
This page contains data that supports the article:
Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes
Abstract
Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial
diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these
to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.
H Bjørn Nielsen, Mathieu Almeida, Agnieszka Sierakowska Juncker, Simon Rasmussen, Junhua Li, Shinichi Sunagawa, Damian R Plichta,
Laurent Gautier, Anders G Pedersen, Emmanuelle Le Chatelier, Eric Pelletier, Ida Bonde, Trine Nielsen, Chaysavanh Manichanh,
Manimozhiyan Arumugam, Jean-Michel Batto, Marcelo B Quintanilha dos Santos, Nikolaj Blom, Natalia Borruel, Kristoffer S Burgdorf,
Fouad Boumezbeur, Francesc Casellas, Joël Doré, Piotr Dworzynski, Francisco Guarner, Torben Hansen, Falk Hildebrand, Rolf S Kaas,
Sean Kennedy, Karsten Kristiansen, Jens Roat Kultima, Pierre Leonard, Florence Levenez, Ole Lund, Bouziane Moumen, Denis Le Paslier,
Nicolas Pons, Oluf Pedersen, Edi Prifti, Junjie Qin, Jeroen Raes, Søren Sørensen, Julien Tap, Sebastian Tims, David W Ussery,
Takuji Yamada, Pierre Renault, Thomas Sicheritz-Ponten, Peer Bork, Jun Wang, Søren Brunak & S Dusko Ehrlich
Nature Biotechnology (2014) doi:10.1038/nbt.2939.
See also News and Views by Eran Mick & Rotem Sorek
For the mouse metagenomic species see
here
Supplementary information at Nature Biotechnology
Supplementary information at Nature Biotechnology
Data for Fig. 2B
R object: list
MGS and CAGs
R object describing the CAGs and MGS
This file contains an R object that describes all the CAGs and the MGS.
The object is a list containing the elements:
gene_sets: vectors of gene identifiers [list]
profiles_11M: median abundance profiles for each CAG based on counts from the 11M sequence reads per sample [matrix]
attributes: a table containing attributed for the CAGs, inc. size, number of samplewise observations, internal correlation to profile, GC content and more [data.frame]
bestTax: the most specific taxonomy for each MGS [table]
taxonomy: taxonomy statistics at species, genus and phylum level [list of data.frames]
sampleInfo: sample information, incl. health status [data.frame]
phagelike: boolean identifying the phage-like CAGs [vector]
MGS: boolean identifying the metagenomic species [vector]
Co-Abundance gene Groups (CAG) table
Table linking CAG and MGS identifiers to the catalog genes
CAG abundance profiles (downsized to 11M reads)
Abundance profiles of the CAG and MGS throughout 396 stool samples. The gene abundances were based on Illumina sequence data downsized to 11 M sequence reads per sample.
In this matrix 'NA' indicate that the sample did not contain 11M sequence reads and consequently the presence of genes could not be determined with a sensitivity comparable to that of the other samples.
CAG attributes Table describing the CAGs and MGS in terms of: number of genes, sample-wise observations,
median gene-wise Pearson correlation coefficient to the CAG median abundance profile across the MetaHit cohort and across the independent SG2 cohort
allCAGs.fna.tar.gz (379MB)
Nucleic acid sequences for the genes in each CAG (incl. the MGS)
allCAGs.faa.tar.gz (264MB)
Amino acid sequences for the genes in each CAG (incl. the MGS)
MGS augmented genome assemblies
These can also be found at http://www.ebi.ac.uk/ena/ EBI under accession ID PRJEB674 to PRJEB1046
mgs_assembled_scaffolds.fa.gz (269MB)Scaffolds from the MGS augmented assemblies
mgs_assembled_scaffolds.fa.annotation
Annotation for the MGS augmentes assemblies
Software
Supplementary Software
Canopy clustering algorithm implemented in C
Please find a demo input abundance matrix
here (this file needs to be unzipped before use).
Furthermore please find a description of post clustering filters that we recomend
here.
See also:
canopy algorithm on bitbucket repository for updated versions and more.
3.9 M Human Gut Gene Catalogue
MetaHit 3.9 M reference gene catalog (fasta, 847 MB)
Sequences of the 3,871,657 gene non-redundant gene catalog (fasta)
De-novo assembled contigs (fasta, 1.9 GB)
The contig sequences from which the gene set originated. Furthermore, a reference table linking gene IDs to the contig IDs can be found
here
Count matrix for the 3.9 M reference gene catalog (csv, 300 MB)
A 3.9M gene vs 393 samples sequence-read count-matrix, using 11M sequence reads per sample.
Three samples (O2.UC14.0, O2.UC21.0, V1.UC3.2) had less than 11M sequence reads and they are therefore not included.
KEGG annotation of the gene catalog
R object: data.frame
eggNOG annotation of the gene catalog
R object: data.frame
Uniprot annotation of the gene catalog (125 MB)
R object: data.frame
SignalP predictions for the gene catalog
R object: data.frame containing SignalP v. 4.0, gram-negative, predictions.
Transmembrane predictions for the gene catalog
R object: data.frame.
Raw sequence data
Sequence data were deposited at EBI with the accession code
ERP002061
A table linking the MetaHIT Sample IDs to the EBI accessions can be found here:
MetaHitIDs2EBIacc.txt