Metagenomic Species (MGS) and Co-Abundance gene Groups (CAGs) of the Human Gut

This page contains data that supports the article:

    Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes

    Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.

    H Bjørn Nielsen, Mathieu Almeida, Agnieszka Sierakowska Juncker, Simon Rasmussen, Junhua Li, Shinichi Sunagawa, Damian R Plichta, Laurent Gautier, Anders G Pedersen, Emmanuelle Le Chatelier, Eric Pelletier, Ida Bonde, Trine Nielsen, Chaysavanh Manichanh, Manimozhiyan Arumugam, Jean-Michel Batto, Marcelo B Quintanilha dos Santos, Nikolaj Blom, Natalia Borruel, Kristoffer S Burgdorf, Fouad Boumezbeur, Francesc Casellas, Joël Doré, Piotr Dworzynski, Francisco Guarner, Torben Hansen, Falk Hildebrand, Rolf S Kaas, Sean Kennedy, Karsten Kristiansen, Jens Roat Kultima, Pierre Leonard, Florence Levenez, Ole Lund, Bouziane Moumen, Denis Le Paslier, Nicolas Pons, Oluf Pedersen, Edi Prifti, Junjie Qin, Jeroen Raes, Søren Sørensen, Julien Tap, Sebastian Tims, David W Ussery, Takuji Yamada, Pierre Renault, Thomas Sicheritz-Ponten, Peer Bork, Jun Wang, Søren Brunak & S Dusko Ehrlich
    Nature Biotechnology (2014) doi:10.1038/nbt.2939.
    See also News and Views by Eran Mick & Rotem Sorek

For the mouse metagenomic species see here

Supplementary information at Nature Biotechnology

Supplementary information at Nature Biotechnology

Data for Fig. 2B
R object: list

MGS and CAGs

R object describing the CAGs and MGS
This file contains an R object that describes all the CAGs and the MGS. The object is a list containing the elements:
    gene_sets: vectors of gene identifiers [list]
    profiles_11M: median abundance profiles for each CAG based on counts from the 11M sequence reads per sample [matrix]
    attributes: a table containing attributed for the CAGs, inc. size, number of samplewise observations, internal correlation to profile, GC content and more [data.frame]
    bestTax: the most specific taxonomy for each MGS [table]
    taxonomy: taxonomy statistics at species, genus and phylum level [list of data.frames]
    sampleInfo: sample information, incl. health status [data.frame]
    phagelike: boolean identifying the phage-like CAGs [vector]
    MGS: boolean identifying the metagenomic species [vector]

Co-Abundance gene Groups (CAG) table
Table linking CAG and MGS identifiers to the catalog genes

CAG abundance profiles (downsized to 11M reads)
Abundance profiles of the CAG and MGS throughout 396 stool samples. The gene abundances were based on Illumina sequence data downsized to 11 M sequence reads per sample. In this matrix 'NA' indicate that the sample did not contain 11M sequence reads and consequently the presence of genes could not be determined with a sensitivity comparable to that of the other samples.

CAG attributes
Table describing the CAGs and MGS in terms of: number of genes, sample-wise observations, median gene-wise Pearson correlation coefficient to the CAG median abundance profile across the MetaHit cohort and across the independent SG2 cohort

allCAGs.fna.tar.gz (379MB)
Nucleic acid sequences for the genes in each CAG (incl. the MGS)

allCAGs.faa.tar.gz (264MB)
Amino acid sequences for the genes in each CAG (incl. the MGS)

MGS augmented genome assemblies

These can also be found at http://www.ebi.ac.uk/ena/ EBI under accession ID PRJEB674 to PRJEB1046
mgs_assembled_scaffolds.fa.gz (269MB)
Scaffolds from the MGS augmented assemblies

Annotation for the MGS augmentes assemblies


Supplementary Software
Canopy clustering algorithm implemented in C
Please find a demo input abundance matrix here (this file needs to be unzipped before use).
Furthermore please find a description of post clustering filters that we recomend here.
See also: canopy algorithm on bitbucket repository for updated versions and more.

3.9 M Human Gut Gene Catalogue

MetaHit 3.9 M reference gene catalog (fasta, 847 MB)
Sequences of the 3,871,657 gene non-redundant gene catalog (fasta)

De-novo assembled contigs (fasta, 1.9 GB)
The contig sequences from which the gene set originated. Furthermore, a reference table linking gene IDs to the contig IDs can be found here

Count matrix for the 3.9 M reference gene catalog (csv, 300 MB)
A 3.9M gene vs 393 samples sequence-read count-matrix, using 11M sequence reads per sample.
Three samples (O2.UC14.0, O2.UC21.0, V1.UC3.2) had less than 11M sequence reads and they are therefore not included.

KEGG annotation of the gene catalog
R object: data.frame

eggNOG annotation of the gene catalog
R object: data.frame

Uniprot annotation of the gene catalog (125 MB)
R object: data.frame

SignalP predictions for the gene catalog
R object: data.frame containing SignalP v. 4.0, gram-negative, predictions.

Transmembrane predictions for the gene catalog
R object: data.frame.

Raw sequence data

Sequence data were deposited at EBI with the accession code ERP002061
A table linking the MetaHIT Sample IDs to the EBI accessions can be found here: MetaHitIDs2EBIacc.txt