Metagenomic Species (MGS) and Co-Abundance gene Groups (CAGs) of the Human Gut

This page contains data that supports the article:

Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes

Abstract

Nature Biotechnology (2014) doi:10.1038/nbt.2939

News and Views by Eran Mick & Rotem Sorek

For the mouse metagenomic species see here

Supplementary information at Nature Biotechnology

Supplementary information at Nature Biotechnology

Data for Fig. 2B
R object: list

MGS and CAGs

R object describing the CAGs and MGS
This file contains an R object that describes all the CAGs and the MGS. The object is a list containing the elements:

gene_sets:

profiles_11M:

attributes:

bestTax:

taxonomy:

sampleInfo:

phagelike:

MGS:

Co-Abundance gene Groups (CAG) table
Table linking CAG and MGS identifiers to the catalog genes

CAG abundance profiles (downsized to 11M reads)
Abundance profiles of the CAG and MGS throughout 396 stool samples. The gene abundances were based on Illumina sequence data downsized to 11 M sequence reads per sample. In this matrix 'NA' indicate that the sample did not contain 11M sequence reads and consequently the presence of genes could not be determined with a sensitivity comparable to that of the other samples.

CAG attributes
Table describing the CAGs and MGS in terms of: number of genes, sample-wise observations, median gene-wise Pearson correlation coefficient to the CAG median abundance profile across the MetaHit cohort and across the independent SG2 cohort

allCAGs.fna.tar.gz (379MB)
Nucleic acid sequences for the genes in each CAG (incl. the MGS)

allCAGs.faa.tar.gz (264MB)
Amino acid sequences for the genes in each CAG (incl. the MGS)

MGS augmented genome assemblies

These can also be found at http://www.ebi.ac.uk/ena/ EBI under accession ID PRJEB674 to PRJEB1046
mgs_assembled_scaffolds.fa.gz (269MB)
Scaffolds from the MGS augmented assemblies

mgs_assembled_scaffolds.fa.annotation
Annotation for the MGS augmentes assemblies

Software

Supplementary Software
Canopy clustering algorithm implemented in C
Please find a demo input abundance matrix here (this file needs to be unzipped before use).
Furthermore please find a description of post clustering filters that we recomend here.
See also: canopy algorithm on bitbucket repository for updated versions and more.

3.9 M Human Gut Gene Catalogue

MetaHit 3.9 M reference gene catalog (fasta, 847 MB)
Sequences of the 3,871,657 gene non-redundant gene catalog (fasta)

De-novo assembled contigs (fasta, 1.9 GB)
The contig sequences from which the gene set originated. Furthermore, a reference table linking gene IDs to the contig IDs can be found here

Count matrix for the 3.9 M reference gene catalog (csv, 300 MB)
A 3.9M gene vs 393 samples sequence-read count-matrix, using 11M sequence reads per sample.
Three samples (O2.UC14.0, O2.UC21.0, V1.UC3.2) had less than 11M sequence reads and they are therefore not included.

KEGG annotation of the gene catalog
R object: data.frame

eggNOG annotation of the gene catalog
R object: data.frame

Uniprot annotation of the gene catalog (125 MB)
R object: data.frame

SignalP predictions for the gene catalog
R object: data.frame containing SignalP v. 4.0, gram-negative, predictions.

Transmembrane predictions for the gene catalog
R object: data.frame.

Raw sequence data

Sequence data were deposited at EBI with the accession code ERP002061
A table linking the MetaHIT Sample IDs to the EBI accessions can be found here: MetaHitIDs2EBIacc.txt