

The Human Genome Project involved laboratories in the United States, France, Great Britain, Germany, and Japan. It was financed in the United States by the National Institutes of Health and by the Department of Energy and in Great Britain by the Wellcome Trust of London. A comparable project using new DNA (genetic material) sequencing machines was begun as a private industry venture in the United States in 1998, with a stated goal of completing the mapping of the genome in three years.
Early in 2001 scientists from both teams jointly announced the "completion" of the mapping of the human genome, indicating that they had identified an estimated 30,000 genes (instead of the expected 100,000), constituting just 1% of the total human DNA. Subsequent comparison of the two teams' data has indicated that, because of differences in the genes identified by the teams, there may in fact be as many as 40,000 human genes. A subsequent, more refined estimate (2004) based on additional work on the genome was that there are between 20,000 and 25,000 genes. Work continues on further refining the sequencing of the genes on the chromosomes, eliminating the remaining gaps in the genome map, and identifying the extent of variation in the human genome. In 2007 the first sequences of human individuals (James D. Watson and J. Craig Venter, who led the public and private human genome sequencing efforts, respectively) were released. The NIH's National Center for Biotechnology Information maintains GenBank, a database of publicly available genetic sequences from the genomes of plants and animals, including some extinct species.
See studies by J. Sulston and G. Ferry (2003) and J. Shreeve (2004).
Licensed from Columbia University Press
The Human Genome Project was a landmark genome project and some have argued that the era of genomics is one of the more fundamental advances in human history.
Genome assembly
Genome assembly refers to the process of taking a large number of short DNA sequences, all of which were generated by a shotgun sequencing project, and putting them back together to create a representation of the original chromosomes from which the DNA originated. In a shotgun sequencing project, all the DNA from a source (usually a single organism, anything from a bacterium to a mammal) is first fractured into millions of small pieces. These pieces are then "read" by automated sequencing machines, which can read up to 900 nucleotides or bases at a time. (The four bases are adenine, guanine, cytosine, and thymine, represented as AGCT.) A genome assembly algorithm works by taking all the pieces and aligning them to one another, and detecting all places where two of the short sequences, or reads, overlap. These overlapping reads can be merged together, and the process continues.Genome assembly is a very difficult computational problem, made more difficult because genomes contain large numbers of identical sequences, known as repeats. These repeats can be thousands of nucleotides long, and some occur in thousands of different locations, especially in the large genomes of plants and animals.
Assembly software
Originally, most large-scale DNA sequencing centers developed their own software for assembling the sequences that they produced. However, this has changed as the software has grown more complex and as the number of sequencing centers has increased. Some well known assembly programs include:Phred/Phrap by Phil Green was one of the first successful assemblers, widely used in the 1990s and early 2000s, especially for smaller genomes.
AMOS (A Modular, Open-Source assembler) is a well-known open source effort to bring together the efforts of leading genome assembly code developers. The home of AMOS is currently http://amos.sourceforge.net. AMOS was initiated at The Institute for Genomic Research by Steven Salzberg, Mihai Pop, and Art Delcher, who are now at the The University of Maryland.
The Celera Assembler was the assembler developed by Gene Myers, Granger Sutton, Art Delcher, and others at Celera Genomics from 1998 until approximately 2002. It was moved to SourceForge and continues to be developed by the original scientists and others, at http://sourceforge.net/projects/wgs-assembler.
The Arachne assembler began in 2000 as the doctoral thesis of Serafim Batzoglou, now at Stanford University. Since that time, it has been developed by a team lead by David B. Jaffe at the Broad Institute, formerly part of the Whitehead Institute. It is available for download at http://www.broad.mit.edu/wga/arachnewiki/.
Genome annotation
Genome annotation is the process of attaching biological information to sequences. It consists of two main steps:- identifying elements on the genome, a process called Gene Finding, and
- attaching biological information to these elements.
Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.
The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that. However, nowadays more and more additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation.
For example, the SEED database uses genome context information, similarity scores, experimental data, and integrations of other resources to provide the most accurate genome annotations through their Subsystems approach. The Ensembl database relies on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline.
Structural annotation consists in the identification of genomic elements.
- ORFs and their localisation
- gene structure
- coding regions
- location of regulatory motifs
Functional annotation consists in attaching biological information to genomic elements.
- biochemical function
- biological function
- involved regulation and interactions
- expression
These steps may involve both biological experiments and in silico analysis.
A variety of software tools have been developed to permit scientists to view and share genome annotations
Genome annotation is the next major challenge for the Human Genome Project, now that the genome sequences of human and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological "parts list" for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts "fit together".
Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation:
- ENCyclopedia Of DNA Elements (ENCODE)
- Ensembl
- Gene Ontology Consortium
- RefSeq
- Uniprot
- Vertebrate and Genome Annotation Project (Vega)
When is a genome project finished?
When sequencing a genome, there are usually regions that are difficult to sequence (often regions with highly repetitive DNA). Thus, 'completed' genome sequences are rarely ever complete, and terms such as 'working draft' or 'essentially complete' have been used to more accurately describe the status of such genome projects. Even when every base pair of a genome sequence has been determined, there are still likely to be errors present because DNA sequencing is not a completely accurate process. It could also be argued that a complete genome project should include the sequences of mitochondria and (for plants) chloroplasts as these organelles have their own genomes.
It is often reported that the goal of sequencing a genome is to obtain information about the complete set of genes in that particular genome sequence. The proportion of a genome that encodes for genes may be very small (particularly in eukaryotes such as humans, where coding DNA may only account for a few percent of the entire sequence). However, it is not always possible (or desirable) to only sequence the coding regions separately. Also, as scientists understand more about the role of this noncoding DNA (often referred to as junk DNA), it will become more important to have a complete genome sequence as a background to understanding the genetics and biology of any given organism.
In many ways genome projects do not confine themselves to only determining a DNA sequence of an organism. Such projects may also include gene prediction to find out where the genes are in a genome, and what those genes do. There may also be related projects to sequence ESTs or mRNAs to help find out where the genes actually are.
Historical and Technological Perspectives
Historically, when sequencing eukaryotic genomes (such as the worm Caenorhabditis elegans) it was common to first map the genome to provide a series of landmarks across the genome. Rather than sequence a chromosome in one go, it would be sequenced piece by piece (with the prior knowledge of approximately where that piece is located on the larger chromosome). Changes in technology and in particular improvements to the processing power of computers, means that genomes can now be 'shotgun sequenced' in one go (there are caveats to this approach though when compared to the traditional approach).
Improvements in DNA sequencing technology has meant that the cost of sequencing a new genome sequence has steadily fallen (in terms of cost per base pair) and newer technology has also meant that genomes can be sequenced far more quickly. When research agencies decide what new genomes to sequence, the emphasis has been on species which have either a relevance to human health (e.g. pathogenic bacteria or vectors of disease such as mosquitos) or species which have commercial importance (e.g. livestock and crop plants). Secondary emphasis is placed on species whose genomes will help answer important questions in molecular evolution (e.g. the common chimpanzee).
In the future, it is likely that it will become even cheaper and quicker to sequence a genome. This will allow for complete genome sequences to be determined from many different individuals of the same species. For humans, this will allow us to better understand aspects of human genetic diversity.
Example genome projects
Many organisms have genome projects that have either been completed or will be completed shortly, including:
- Humans, Homo sapiens; see Human genome project
- Neanderthal, "Homo neanderthalensis" (partial)
- Haemophilus influenzae, a bacterium (the first free-living organism to have its genome fully sequenced)
- Common House Mouse, Mus musculus
- Brown Rat, Rattus norvegicus
- Common Chimpanzee Pan troglodytes; see Chimpanzee Genome Project
- Rhesus Macaque, Macaca mulatta
- Domestic Chicken, Gallus gallus
- Tammar Wallaby, Macropus eugenii
- Domestic Cat, Felis silvestris
- Domestic Dog, Canis lupus familiaris
- Common fruit fly, Drosophila melanogaster
- Baker's yeast, Saccharomyces cerevisiae
- Red bread mold, Neurospora crassa
- Thale Cress, Arabidopsis thaliana
- Rice, Oryza sativa
- Common Wheat, Triticum aestivum
- Maize, Zea mays
- Poplar, Populus trichocarpa (The first tree to have its genome fully sequenced)
- Escherichia coli, a coliform bacterium
- SARS virus
- California purple sea urchin, Strongylocentrotus purpuratus
- Caenorhabditis elegans, a nematode worm
- Zebra Danio, Brachydanio rerio
- African clawed frog, Xenopus laevis
- Oryzias latipes, a medakafish
- Tiger blowfish, Takifugu rubipres
- Tomato Solanum lycopersicum
- Potato Solanum tuberosum
- Western Honey bee, Apis mellifera
- Grapevine, Vitis vinifera L.
- Spanish flu
- Platypus, Ornithorhynchus anatinus
See also
- Honey Bee Genome Sequencing Consortium
- Human microbiome project
- International Grape Genome Program
- International HapMap Project
- Joint Genome Institute
- List of sequenced archaeal genomes
- List of sequenced eukaryotic genomes
- List of sequenced prokaryotic genomes
- Model organism
- National Center for Biotechnology Information
Notes
External links
- GOLD:Genomes OnLine Database
- Genome Project Database
- BFAB functional annotation resources to benchmark and to develop new annotation methods
- SEED The SEED database, an open source database for genome annotation.
- SEED's RAST Rapid annotation using subsystems technology for the free automatic annotation of complete microbial genomes
- SEED's mg-RAST for metagenome annotation
- IMG The Integrated Microbial Genomes system, for genome analysis by the DOE-JGI.
- IMG/M The Integrated Microbial Genomes system, for metagenome analysis by the DOE-JGI.
- CAMERA Cyberinfrastructure for Metagenomics, data repository and bioinformatics tools for metagenomic research.
- PUMA2 Integrated Grid based system for the comparative analysis of genomes and metabolic reconstructions.
- SUPERFAMILY Database of protein superfamily and family annotations for all completely sequenced organisms
- SIMAP Similarity Matrix of Proteins distributed computing project
- GDB Human Genome Data Base
This article is licensed under the GNU Free Documentation License.
Last updated on Tuesday July 22, 2008 at 14:13:51 PDT (GMT -0700)
View this article at Wikipedia.org - Edit this article at Wikipedia.org - Donate to the Wikimedia Foundation
The Human Genome Project was a landmark genome project and some have argued that the era of genomics is one of the more fundamental advances in human history.
Genome assembly
Genome assembly refers to the process of taking a large number of short DNA sequences, all of which were generated by a shotgun sequencing project, and putting them back together to create a representation of the original chromosomes from which the DNA originated. In a shotgun sequencing project, all the DNA from a source (usually a single organism, anything from a bacterium to a mammal) is first fractured into millions of small pieces. These pieces are then "read" by automated sequencing machines, which can read up to 900 nucleotides or bases at a time. (The four bases are adenine, guanine, cytosine, and thymine, represented as AGCT.) A genome assembly algorithm works by taking all the pieces and aligning them to one another, and detecting all places where two of the short sequences, or reads, overlap. These overlapping reads can be merged together, and the process continues.Genome assembly is a very difficult computational problem, made more difficult because genomes contain large numbers of identical sequences, known as repeats. These repeats can be thousands of nucleotides long, and some occur in thousands of different locations, especially in the large genomes of plants and animals.
Assembly software
Originally, most large-scale DNA sequencing centers developed their own software for assembling the sequences that they produced. However, this has changed as the software has grown more complex and as the number of sequencing centers has increased. Some well known assembly programs include:Phred/Phrap by Phil Green was one of the first successful assemblers, widely used in the 1990s and early 2000s, especially for smaller genomes.
AMOS (A Modular, Open-Source assembler) is a well-known open source effort to bring together the efforts of leading genome assembly code developers. The home of AMOS is currently http://amos.sourceforge.net. AMOS was initiated at The Institute for Genomic Research by Steven Salzberg, Mihai Pop, and Art Delcher, who are now at the The University of Maryland.
The Celera Assembler was the assembler developed by Gene Myers, Granger Sutton, Art Delcher, and others at Celera Genomics from 1998 until approximately 2002. It was moved to SourceForge and continues to be developed by the original scientists and others, at http://sourceforge.net/projects/wgs-assembler.
The Arachne assembler began in 2000 as the doctoral thesis of Serafim Batzoglou, now at Stanford University. Since that time, it has been developed by a team lead by David B. Jaffe at the Broad Institute, formerly part of the Whitehead Institute. It is available for download at http://www.broad.mit.edu/wga/arachnewiki/.
Genome annotation
Genome annotation is the process of attaching biological information to sequences. It consists of two main steps:- identifying elements on the genome, a process called Gene Finding, and
- attaching biological information to these elements.
Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.
The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that. However, nowadays more and more additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation.
For example, the SEED database uses genome context information, similarity scores, experimental data, and integrations of other resources to provide the most accurate genome annotations through their Subsystems approach. The Ensembl database relies on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline.
Structural annotation consists in the identification of genomic elements.
- ORFs and their localisation
- gene structure
- coding regions
- location of regulatory motifs
Functional annotation consists in attaching biological information to genomic elements.
- biochemical function
- biological function
- involved regulation and interactions
- expression
These steps may involve both biological experiments and in silico analysis.
A variety of software tools have been developed to permit scientists to view and share genome annotations
Genome annotation is the next major challenge for the Human Genome Project, now that the genome sequences of human and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological "parts list" for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts "fit together".
Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation:
- ENCyclopedia Of DNA Elements (ENCODE)
- Ensembl
- Gene Ontology Consortium
- RefSeq
- Uniprot
- Vertebrate and Genome Annotation Project (Vega)
When is a genome project finished?
When sequencing a genome, there are usually regions that are difficult to sequence (often regions with highly repetitive DNA). Thus, 'completed' genome sequences are rarely ever complete, and terms such as 'working draft' or 'essentially complete' have been used to more accurately describe the status of such genome projects. Even when every base pair of a genome sequence has been determined, there are still likely to be errors present because DNA sequencing is not a completely accurate process. It could also be argued that a complete genome project should include the sequences of mitochondria and (for plants) chloroplasts as these organelles have their own genomes.
It is often reported that the goal of sequencing a genome is to obtain information about the complete set of genes in that particular genome sequence. The proportion of a genome that encodes for genes may be very small (particularly in eukaryotes such as humans, where coding DNA may only account for a few percent of the entire sequence). However, it is not always possible (or desirable) to only sequence the coding regions separately. Also, as scientists understand more about the role of this noncoding DNA (often referred to as junk DNA), it will become more important to have a complete genome sequence as a background to understanding the genetics and biology of any given organism.
In many ways genome projects do not confine themselves to only determining a DNA sequence of an organism. Such projects may also include gene prediction to find out where the genes are in a genome, and what those genes do. There may also be related projects to sequence ESTs or mRNAs to help find out where the genes actually are.
Historical and Technological Perspectives
Historically, when sequencing eukaryotic genomes (such as the worm Caenorhabditis elegans) it was common to first map the genome to provide a series of landmarks across the genome. Rather than sequence a chromosome in one go, it would be sequenced piece by piece (with the prior knowledge of approximately where that piece is located on the larger chromosome). Changes in technology and in particular improvements to the processing power of computers, means that genomes can now be 'shotgun sequenced' in one go (there are caveats to this approach though when compared to the traditional approach).
Improvements in DNA sequencing technology has meant that the cost of sequencing a new genome sequence has steadily fallen (in terms of cost per base pair) and newer technology has also meant that genomes can be sequenced far more quickly. When research agencies decide what new genomes to sequence, the emphasis has been on species which have either a relevance to human health (e.g. pathogenic bacteria or vectors of disease such as mosquitos) or species which have commercial importance (e.g. livestock and crop plants). Secondary emphasis is placed on species whose genomes will help answer important questions in molecular evolution (e.g. the common chimpanzee).
In the future, it is likely that it will become even cheaper and quicker to sequence a genome. This will allow for complete genome sequences to be determined from many different individuals of the same species. For humans, this will allow us to better understand aspects of human genetic diversity.
Example genome projects
Many organisms have genome projects that have either been completed or will be completed shortly, including:
- Humans, Homo sapiens; see Human genome project
- Neanderthal, "Homo neanderthalensis" (partial)
- Haemophilus influenzae, a bacterium (the first free-living organism to have its genome fully sequenced)
- Common House Mouse, Mus musculus
- Brown Rat, Rattus norvegicus
- Common Chimpanzee Pan troglodytes; see Chimpanzee Genome Project
- Rhesus Macaque, Macaca mulatta
- Domestic Chicken, Gallus gallus
- Tammar Wallaby, Macropus eugenii
- Domestic Cat, Felis silvestris
- Domestic Dog, Canis lupus familiaris
- Common fruit fly, Drosophila melanogaster
- Baker's yeast, Saccharomyces cerevisiae
- Red bread mold, Neurospora crassa
- Thale Cress, Arabidopsis thaliana
- Rice, Oryza sativa
- Common Wheat, Triticum aestivum
- Maize, Zea mays
- Poplar, Populus trichocarpa (The first tree to have its genome fully sequenced)
- Escherichia coli, a coliform bacterium
- SARS virus
- California purple sea urchin, Strongylocentrotus purpuratus
- Caenorhabditis elegans, a nematode worm
- Zebra Danio, Brachydanio rerio
- African clawed frog, Xenopus laevis
- Oryzias latipes, a medakafish
- Tiger blowfish, Takifugu rubipres
- Tomato Solanum lycopersicum
- Potato Solanum tuberosum
- Western Honey bee, Apis mellifera
- Grapevine, Vitis vinifera L.
- Spanish flu
- Platypus, Ornithorhynchus anatinus
See also
- Honey Bee Genome Sequencing Consortium
- Human microbiome project
- International Grape Genome Program
- International HapMap Project
- Joint Genome Institute
- List of sequenced archaeal genomes
- List of sequenced eukaryotic genomes
- List of sequenced prokaryotic genomes
- Model organism
- National Center for Biotechnology Information
Notes
External links
- GOLD:Genomes OnLine Database
- Genome Project Database
- BFAB functional annotation resources to benchmark and to develop new annotation methods
- SEED The SEED database, an open source database for genome annotation.
- SEED's RAST Rapid annotation using subsystems technology for the free automatic annotation of complete microbial genomes
- SEED's mg-RAST for metagenome annotation
- IMG The Integrated Microbial Genomes system, for genome analysis by the DOE-JGI.
- IMG/M The Integrated Microbial Genomes system, for metagenome analysis by the DOE-JGI.
- CAMERA Cyberinfrastructure for Metagenomics, data repository and bioinformatics tools for metagenomic research.
- PUMA2 Integrated Grid based system for the comparative analysis of genomes and metabolic reconstructions.
- SUPERFAMILY Database of protein superfamily and family annotations for all completely sequenced organisms
- SIMAP Similarity Matrix of Proteins distributed computing project
- GDB Human Genome Data Base
This article is licensed under the GNU Free Documentation License.
Last updated on Tuesday July 22, 2008 at 14:13:51 PDT (GMT -0700)
View this article at Wikipedia.org - Edit this article at Wikipedia.org - Donate to the Wikimedia Foundation
Copyright © 2008, Dictionary.com, LLC. All rights reserved.











