Clase 2 - Genoma Humano proyecto conicet.pdf

Dr. MARCELO A. MARTÍ
Director y Prof. Adjunto Dto. de Química Biológica FCEN-UBA, Investigador
Independiente IQUIBICEN-CONICET
Asesor Científico bitgenia (www.bitgenia.com)
Contacto: marti.marcelo@gmail.com
El Proyecto Genoma Humano
El Proyecto Genoma Humano

E proyecto Genoma Humano
Nace en 1988 cuándo el congreso de los EEUU decide co-financiar al departamento de energía, junto al
instituto nacional de salud (NIH), para que avancen en el desarrollo conceptual del proyecto, que Es lanzado
oficialmente en 1990 con el objetivo de determinar la secuencia del 95% del genoma humano en 15 años
Sanger
Center UK
Sanger
Center UK
Mapa Físico

C. Venter en
NIH
Aparece Celera
Genomics
Celera, pretendia vender la base de datos de genes a las compañias farmaceuticas . Por otro lado decidió
utilizar una estrategia menos ordenada, denominada "whole genome shotgun".

El anuncio conjunto no puso fin a la controversia!
¿Por que hubo dos proyectos?
Anuncio
conjunto

La diferencia de estrategias => su correlato
bioinformático

La diferencia de estrategias => su correlato
bioinformático
¿Que ventajas/desventajas presenta cada uno?

Estrategias de ensamblado:
HGP GigAssembler
Seq. y ensamblado
jerárquico

GIG Assembler en números
Objetivo 2.7 billion bases 7.7 Gb (22 + X + Y)
4.3 billion bases de entrada (1.5X)
400,000 initial sequence contigs,
1.1 billion bases of EST sequence,
1.8 billion bases of paired plasmid reads,
0.4 billion bases of BAC end sequences.
Se obtuvo 92% del Genoma Humano

El proyecto Genoma Humano en Celera: Set
de Datos
Datos: WGS de 27 M lecturas ca. 543pb de Celera
5 donores (random pooled)
Mate-pairs 2, 10 y 50 kpb
Tamaño estimado del genoma 2.9 Gb, cobertura 5X
+Datos de GeneBank (fragmentados en 16M lecturas falsas de 550pb) + 100 mil
BAC mate-pairs

Se usaron 2 Estrategias de
Ensamblado
2) Clusterización de las lecturas en regiones
cromosomicas utilizando STS => ensamblado
jerarquico
1) Combinación de todos las
lecturas de Celera con
“lecturas generadas” de los
datos de GeneBank (Random
WGS)

Ensamblado Genoma Humano Celera
1) Screener: finds and marks all microsatellite repeats with less than a 6-bp
element, and screens out all known interspersed repeat elements, including Alu,
Line, and ribosomal DNA. Marked regions get searched for overlaps, whereas
identified regions get masked
2) Overlapper compares every read against every other read in search of complete
end-to-end overlaps of at least 40 bp and with
no more than 6% differences in the match.
3) Unitigger. We first find all assemblies of reads that appear to be uncontested
with respect to all other reads. We call the contigs formed from these
subassemblies unitigs (for uniquely assembled
contigs). Formally, these unitigs are the uncontested interval subgraphs of the
graph of all overlaps
Results in: correctly assembled subcontigs covering an estimated 73.6% of the
human genome.

4) Scaffolder: use mate-pair information to link these together
into scaffolds. When there are two or more mate pairs that imply that a given pair
of U-unitigs are at a certain distance and orientation with respect to each other,
the probability of this being wrong is again roughly 1 in 1010
5) Gap-Filling (Rocks-Stones) and finally gap “walking.”

The rock phase placed unitigs that were consistently positioned by at least two mate pairs,
The stone-phase placed unitigs that were positioned by a single mate pair and confirmable by
an overlap tiling across the gap containing
The pebble-phase attempted to find the best tiling across gaps using a quality-value based
measure of significance.
Consensus. Reads were multiply aligned according to the consensus metric and
consensus base calls were derived

Resultado:
More than 84% of the genome was covered by scaffolds .100 kbp long, and these
averaged, total of 2.297 Gbp of sequence.
There were a total of 93,857 gaps among the 1637 scaffolds .100
kbp.
The average scaffold size was 1.5 Mbp and the average gap size was 2.43 kbp,
where the distribution of each was essentially exponential.

2da Estrategia
1) Sepparate Celera data into i)
Matching BACs from HGP ii) not-
Matching (Celera Unique reads)
2) Assemble both BAC data and Celera
Unique reads (Contig-Scaffold-Rock)
sepparately and independently
Results: one or two scaffolds for every
BAC region constituting at least 95% of
the relevant sequence, and a
collection of disjoint Celera-unique
scaffolds.

2da Estrategia
3) Tiler: Determine the order and overlap tiling of these BAC and Celera-unique scaffolds
across the genome, using 50-kbp mate-pairs information and BAC-end pairs and
sequence tagged
site (STS) markers to provide longrange guidance and chromosome separation
Results: 2.906 Gbp read spanning 2.654 Gbp of
sequence.
More than 90.0% of the genome was covered by
scaffolds spanning 100 kbp long,
There were a total of 105,264 gaps among 1940
scaffolds
The average scaffold size was 1.4 Mbp,and the
average gap size was 2.0 kbp where each
distribution of sizes was exponential.

¿Qué es entonces “el”
genoma humano?
Inicio del Cromosoma 1:
TTAGGGC......6 mil millones (6Gb) de “LETRAS” .........GGGTTAGGG
Fin del Cromosoma 23
GEN

Resultados: Genoma Humano
Azul: Gaps Rojo:Gap > 10 kpb
#Negro: número de gaps
Cromosoma
HGP
Celera
HGP
Celera
HGP
Celera

Impacto “económico” del
Proyecto Genoma Humano
● The federal government invested $3.8 billion in the HGP through its
completion in 2003 ($5.6 billion in 2010 $).
● This investment was foundational in generating the economic output of
$796 billion above, and thus shows a return on investment (ROI) to
the U.S. economy of 141 to 1
● In 2010 alone, the genomics‐enabled industry generated over $3.7
billion in federal taxes and $2.3 billion in U.S. state and local taxes.
Batelle Technology Partnership Practice
Report on “Economic Impact of the
HGP” 2011
Campos de
Aplicación

Dr. Marcelo Marti
marcelo.marti@bitgenia.
com
www.bitgenia.com

Clase 2 - Genoma Humano proyecto conicet.pdf

Recommended

Recommended

More Related Content

Similar to Clase 2 - Genoma Humano proyecto conicet.pdf

Similar to Clase 2 - Genoma Humano proyecto conicet.pdf (20)

Recently uploaded

Recently uploaded (20)

Clase 2 - Genoma Humano proyecto conicet.pdf