The Genome Structural Variation Consortium has conducted a CNV discovery project to identify common CNVs greater than 500bp in size using array-Comparative Genome Hybridization at tiling resolution on isothermal oligonucleotide arrays. We analyzed 20 female CEU (European ancestry) HapMap samples, 20 female YRI (African ancestry) HapMap samples and one Polymorphism Discovery Resource sample for CNVs a set of NimbleGen CGH arrays that tile across the assayable portion of the genome with approximately 42 million probes spread across twenty 2.1 million probe (HD2) arrays. A single male CEU HapMap sample was used as reference.
Samples used were: NA06985, NA07037, NA07045, NA11894, NA11931, NA11993, NA11995, NA12004, NA12006, NA12044, NA12156, NA12239, NA12287, NA12414, NA12489, NA12749, NA12776, NA12828, NA12878, NA15510, NA18502, NA18505, NA18508, NA18511, NA18517, NA18523, NA18858, NA18861, NA18907, NA18909, NA18916, NA19099, NA19108, NA19114, NA19129, NA19147, NA19190, NA19225, NA19240 and NA19257.
Reference sample was NA10851.
Normalization of raw data was performed in three steps; 1) q-spline normalization was performed by Nimblegen, 2) correction for GC effects was done by fitting a model with linear and quadratic effects of GC content to the log2 ratios, separately for each subarray, and 3) long-range spatial autocorrelation in log2 ratios (the 'wave effect') was modeled and removed using the method described in Marioni, et al. (2007). The CNV calling on normalized data was performed using the CNV calling algorithm GADA (Pique-Regi et al., 2008).
A custom CGH Agilent array was then used to target the majority of CNV events detected by the discovery array. A small number of regions from other studies were also included. This array was run across 450 HapMap samples (using a reference of pooled DNA), and CNV genotypes were called.
These data are being released freely to the scientific community and can be considered a community resource. However, the data generators reserve the right to be the first to publish on the bulk data as indicated by the Fort Lauderdale meeting report (see data release policy below). Our groups are performing various global analyses in this dataset, including:
- generating a genome-wide map of copy number variation
- mapping the genomic-wide CNV map onto functional annotation of the genome
- associations to SNP and haplotype variation
- associations to gene expression variation
- quantify population differentiation for copy number variation
- investigating mechanisms of CNV formation
Authors who use data from this project for presentation and/or publication should acknowledge the project. Below is a sample acknowledgement statement:
This study makes use of data generated by the Genome Structural Variation Consortium (PIs Nigel Carter, Matthew Hurles, Charles Lee and Stephen Scherer) whom we thank for pre-publication access to their CNV discovery [and/or] genotyping data, made available through the websites http://www.sanger.ac.uk/humgen/cnv/42mio/ and http://projects.tcag.ca/variation/ as a resource to the community. Funding for the project was provided by the Wellcome Trust [Grant No. 077006/Z/05/Z], Canada Foundation of Innovation and Ontario Innovation Trust, Canadian Institutes of Health Research, Genome Canada/Ontario Genomics Institute, the McLaughlin Centre for Molecular Medicine, Ontario Ministry of Research and Innovation, the Hospital for Sick Children Foundation, the Department of Pathology at Brigham and Women's Hospital and the National Institutes of Health grants HG004221 and GM081533.
Users should note that the Consortium bears no responsibility for the further analysis or interpretation of these data, over and above that published by the Consortium.
The file containing the CNVE calls can be downloaded here.
CNV genotyping data on a subset of the CNVs discovered can be downloaded here.
Normalized intensity data has previously been released for this project and can currently be downloaded in 5Mb segments. More detailed description of that data and the links to download intensity data can be found here.
The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the "Fort Lauderdale meeting", sponsored by the Wellcome Trust, one of the Project funders. The report from that meeting can be viewed here:
The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of "community resource projects", with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work.
The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. "Responsible use" was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.
Wellcome Trust Sanger Institute : Don Conrad, Richard Redon, Chris Tyler-Smith, Nigel Carter, Matthew Hurles
The Centre for Applied Genomics: Steve Scherer, Lars Feuk, Dalila Pinto
Harvard Medical School, Brigham and Women's Hospital: Charles Lee, Omer Gokcumen
|