Retos de la Bioinformatica

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Retos de la Bioinformatica - Presentation Transcript

    1. Bioinformática: la biología por otros medios Alberto Labarga UGR, Noviembre 2008
    2. Computational Biology Bioinformatics [Biological Information]
    3. Hacia una teoría científica de la herencia 1859 1866 1870 1900 1902
    4. Charles Darwin publica en 1859 'The Origin of Species‘ donde se propone que los seres vivos son el resultado de la selección natural y que todas las criaturas han evolucionado a lo largo de las generaciones a través de pequeños cambios. 1859 1866 1870 1900 1902
    5. Leyes de Mendel, publicadas en 1866, redescubiertas en 1900 1859 1866 1870 1900 1902
    6. En 1870, un científico alemán llamado Friedrich Miescher aísla los componentes almacenados en el núcleo, compuesto principalmente por proteinas y ácidos nucleicos. En aquel momento se creía que el elemento que almacenaba la información hereditaria tenía que ser la proteína, compuesta por 20 aminoacidos, mientras que los ácidos nucleicos tenían sólo 4 componentes. 1859 1866 1870 1900 1902
    7. A comienzo de siglo, Phoebus Levene, descubrió que el ADN es una cadena de nucleótidos, en la que cada nucleótido está compuesto de un azucar (desoxirribosa), un grupo fosfato y una base nitrogenada, que podía ser de cuatro tipos, Adenin, Timina, guanina y Citosina 1859 1866 1870 1900 1902
    8. Walter Sutton, a graduate student in E. B. Wilson’s lab at Columbia University, observed that in the process of cell division, called meiosis, that produces sperm and egg cells, each sperm or egg receives only one chromosome of each type. (In other parts of the body, cells have two chromosomes of each type, one inherited from each parent.) The segregation pattern of chromosomes during meiosis matched the segregation patterns of Mendel’s genes. 1859 1866 1870 1900 1902
    9. El descubrimiento del ADN 1928 1944 1949 1952 1953
    10. 1928 Frederick Griffith: principio de transformación si mezclaba a los neumococos R con neumococos S previamente muertos por calor, entonces los ratones se morían. Aún más, en la sangre de estos ratones muertos Griffith encontró neumococos con cápsula (S). 1928 1944 1949 1952 1953
    11. En 1944 Oswald Avery y sus colaboradores, que estaban estudiando la bacateria que causa la neumonía, Pneumococcus, descubrieron que las bacterias tienen ácidos nucleicos y que es la molécula de ADN la encargada de almacenar los genes. Otros estudios con virus se encargaronde confirmar esta teoría a pesar de que se seguía creyendo que el ADN era demasiado simple. 1928 1944 1949 1952 1953
    12. La vida puede verse como un proceso de almacenamiento y transmisión de información biológica. Los cromosomas son los portadores de esta información. La información está almacenada en la forma de un código molecular Para entender la vida debemos identificar estas moléculas y descifrar el código 1928 1944 1949 1952 1953
    13. 1949 DNA se duplica durante la división celular Chargaff: A = T and G = C 1928 1944 1949 1952 1953
    14. 1952 - Hershey-Chase Experiment 1928 1944 1949 1952 1953
    15. M.H.F. Wilkins, A.R. Stokes, H.R. Wilson: Molecular Structure of Deoxypentose Nucleic Acids. Nature 171, 738 (1953) R.E. Franklin and R.G. Gosling Molecular Configuration in Sodium Thymonucleate, Nature 171, 740 (1953) 1928 1944 1949 1952 1953
    16. MOLECULAR STRUCTURE OF NUCLEIC ACIDS “We wish to propose a structure for the salt of desoxyribose nucleic acid (DNA). This structure has novel features which are of considerable biological interest” Nature. 25 de abril de 1953 1928 1944 1949 1952 1953
    17. “It has not escaped our attention that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.” 1928 1944 1949 1952 1953
    18. The base pairs
    19. En 1955 Ochoa publicó en Journal of the American Chemical Society con la bioquímica francorrusa Marianne Grunberg-Manago, el aislamiento de una enzima del colibacilo que cataliza la síntesis de ARN, el intermediario entre el ADN y las proteínas. Los descubridores llamaron «polinucleótido-fosforilasa» a la enzima, conocida luego como ARN-polimerasa. El descubrimiento de la polinucleótido fosforilasa dio lugar a la preparación de polinucleótidos sintéticos de distinta composición de bases con los que el grupo de Severo Ochoa, en paralelo con el grupo de Marshall Nirenberg, llegaron al desciframiento de la clave genética. 1955 1959 1962 1966
    20. 1955 1959 1962 1966
    21. Cuando Perutz llegó a Cambridge la estructura molecular más grande que se había resuelto era la del pigmento natural ficocianina, de 58 átomos. Una proteína tiene miles de átomos. Bernal, su director, había realizado algunas imágenes de difracción de rayos X de cristales de una proteína, la pepsina, pero sin llegar a interpretarlas. El tema escogido por Perutz para su tesis fue otra proteína, la hemoglobina, el transportador de oxígeno que da color rojo a nuestra sangre. La hemoglobina tiene nada menos que 11.000 átomos. Tardo 23 años. 1955 1959 1962 1966
    22. 1955 1959 1963 1966
    23. Over the course of several years, Marshall Nirenberg, Har Khorana and Severo Ochoa and their colleagues elucidated the genetic code – showing how nucleic acids with their 4-letter alphabet determine the order of the 20 kinds of amino acids in proteins. Messenger RNA is interpreted three letters at a time; a set of three nucleotides forms a "codon" that encodes an amino acid. A three-letter word made of four possible letters can have 64 (4 x 4 x 4) permutations, which is more than enough to encode the 20 amino acids in living beings. 1955 1959 1962 1966
    24. From DNA to protein
    25. Entendiendo los mecanismos, creando las herramientas 1970 1971 1975 1977 1980
    26. El Central Dogma 1970 1971 1975 1977 1980
    27. Created in 1971 with seven structures 1970 1971 1975 1977 1980
    28. El ADN recombinante, o ADN recombinado, es una molécula de ADN formada por la unión de dos moléculas heterólogas, es decir, de diferente origen. Se realiza a través de las enzimas de restricción que son capaces de "cortar" el ADN en puntos concretos. De una manera muy simple podemos decir que "cortamos" un gen humano y se lo "pegamos" al ADN de una bacteria; si por ejemplo es el gen que regula la fabricación de insulina, lo que haríamos al ponérselo a una bacteria es "obligar" a ésta a que fabrique la insulina. 1970 1971 1975 1977 1980
    29. 1970 1971 1975 1977 1980
    30. A precursor-RNA may often be matured to mRNAs with alternative structures. An example where alternative splicing has a dramatic consequence is somatic sex determination in the fruit fly Drosophila melanogaster. In this system, the female-specific sxl-protein is a key regulator. It controls a cascade of alternative RNA splicing decisions that finally result in female flies. 1970 1971 1975 1977 1980
    31. Entendiendo los mecanismos, creando las herramientas 1981 1982 1983 1985 1987 1990
    32. Read out the letters from a DNA sequence GTGAGGCGCTGC 1981 1982 1983 1985 1987 1990
    33. 1983 La reacción en cadena de la polimerasa, conocida como PCR por sus siglas en inglés (Polymerase Chain Reaction), es una técnica de biología molecular descrita en 1986 por Kary Mullis,[1] cuyo objetivo es obtener un gran número de copias de un fragmento de ADN particular, partiendo de un mínimo; en teoría basta partir de una única copia de ese fragmento original, o molde. 1981 1982 1983 1985 1987 1990
    34. Total nucleotides Number of entries (Nov 07: 188,490,792,445) (Nov 07: 106,144,026) 1981 1982 1983 1985 1987 1990
    35. 1981 1982 1983 1985 1987 1990
    36. El Proyecto Genoma Humano (PGH) (Human Genome Project en inglés) consiste en determinar las posiciones relativas de todos los nucleótidos (o pares de bases) e identificar 100.000 genes presentes en él. El proyecto, dotado con 3.000 millones de dólares, fue fundado en 1990 por el Departamento de Energía y los Institutos de la Salud de los Estados Unidos, con un plazo de realización de 15 años. 1981 1982 1983 1985 1987 1990
    37. ”Imagine varias copias de un libro, cortadas en 10 millones de trocitos cada una, de manera que los trocitos se solapan. Supongamos que 1 millón de trocitos se han perdido, y que los otros 9 millones están manchados de tinta. Recupere el texto original.”
    38. HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct the sequence of the genome.
    39. Descifrando el libro de la vida 1990 1995 1996 1997 1998 1999 2001
    40. S.F. Altschul, et al. (1990), "Basic Local Alignment Search Tool," J. Molec. Biol., 215(3): 403-10, 1990. 15,306 citations Altschul, S.F. et al (1997), “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, Nucleic Acids Res., vol. 25, no. 17, pp. 3389-402. 1990 1995 1996 1997 1998 1999 2001
    41. • SSAHA (Ning et al., 2001) • http://www.sanger.ac.uk/Software/analysis/SSAHA/ • SSAHA is an algorithm for very fast matching and alignment of DNA sequences. It stands for Sequence Search and Alignment by Hashing Algorithm. It achieves its fast search speed by converting sequence information into a `hash table' data structure, which can then be searched very rapidly for matches. • BLAT (J. Kent, 2002) • http://genome.ucsc.edu/cgi-bin/hgBlat • BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 20 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more.
    42. J. Thompson, T. Gibson, D. Higgins (1994), CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment … Nuc. Acids. Res. 22, 4673 - 4680 1990 1995 1996 1997 1998 1999 2001
    43. Flowchart of computation steps in Clustal W (Thompson et al., 1994) Pairwise alignment: calculation of distance matrix Creation of unrooted neighbor-joining tree Rooted nJ tree (guide tree) and calculation of sequence weights Progressive alignment following the guide tree
    44. Otros métodos Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 33, 511–518. Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics , 6, 298. Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007 23(21): 2947-2948.
    45. Tree of Life http://tolweb.org/tree/phylogeny.html http://itol.embl.de/
    46. 1995 • El primer genoma completo de un organismo Hemophilus influenzae. 1990 1995 1996 1997 1998 1999 2001
    47. 1996 • El genoma de la levadura se completa: aproximadamente, 6,000 genes y 14.000.000 de pares de bases 1990 1995 1996 1997 1998 1999 2001
    48. 1990 1995 1996 1997 1998 1999 2001
    49. 1997 •Ecuenciado el genoma de la bacteria E. Coli: 4,600 genes 4,5 millones de nucleótidos. 1990 1995 1996 1997 1998 1999 2001
    50. 1998 El genoma del gusano Caenorhabditis elegans, tiene 18,000 genes unos 100 millones de nucleotidos 1990 1995 1996 1997 1998 1999 2001
    51. 1999 •Se consigue la secuencia completa del cromosoma 22 El HGP va por delante de lo planeado. Sorprende el reducido número de genes encontrado (unos 300) 1990 1995 1996 1997 1998 1999 2001
    52. Fire A, Xu S, Montgomery M, Kostas S, Driver S, Mello C (1998). "Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans". Nature 391 (6669): 806–11. doi:10.1038/35888. PMID 9486653
    53. Hamilton A, Baulcombe D (1999). "A species of small antisense RNA in posttranscriptional gene silencing in plants". Science 286 (5441): 950–2. PMID 10542148
    54. Dr Alan Wolffe (1999) • Epigenetics is heritable changes in gene expression that occur without a change in DNA sequence • Such changes cannot be attributed to changes in DNA sequence (mutations) • They are as Irreversible as mutations (or difficult to reverse)
    55. 1990 1995 1996 1997 1998 1999 2001
    56. Gene prediction Where are the genes? In humans: ~22,000 genes ~1.5% of human DNA
    57. the gencode pipeline 1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the human genome 2. manual curation to resolve conflicting evidence 3. additional computational predictions 4. experimental verification 5. FINAL ANNOTATION
    58. Genome annotation - building a pipeline Genome sequence Map repeats Map ESTs Map Peptides Genefinding nc-RNAs Protein-coding genes Functional annotation Release August 2008 Bioinformatics tools for Comparative 64 Genomics of Vectors
    59. Genefinding - ab initio predictions  Use compositional features of the DNA sequence to define coding segments (essentially exons)  ORFs  Coding bias  Splice site consensus sequences  Start and stop codons  Each feature is assigned a log likelihood score  Use dynamic programming to find the highest scoring path  Need to be trained using a known set of coding sequences  Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh August 2008 Bioinformatics tools for Comparative 65 Genomics of Vectors
    60. ab initio prediction Genome Coding potential ATG & Stop codons Splice sites ATG & Stop codons Coding potential August 2008 Bioinformatics tools for Comparative 66 Genomics of Vectors
    61. ab initio prediction Genome Coding potential ATG & Stop codons Splice sites ATG & Stop codons Coding potential August 2008 Bioinformatics tools for Comparative 67 Genomics of Vectors
    62. ab initio prediction Genome Coding potential ATG & Stop codons Splice sites ATG & Stop codons Coding potential Find best prediction August 2008 Bioinformatics tools for Comparative 68 Genomics of Vectors
    63. Genefinding - similarity  Use known coding sequence to define coding regions  EST sequences  Peptide sequences  Needs to handle fuzzy alignment regions around splice sites  Needs to attempt to find start and stop codons  Examples: EST2Genome, exonerate, genewise  Use 2 or more genomic sequences to predict genes based on conservation of exon sequences  Examples: Twinscan and SLAM August 2008 Bioinformatics tools for Comparative 69 Genomics of Vectors
    64. Similarity-based prediction Genome Align cDNA/peptide Create prediction August 2008 Bioinformatics tools for Comparative 70 Genomics of Vectors
    65. Example of a simple HMM Top: model architecture and parameters. Bottom: sequence generation process. green: state transition probabilities, red: emission probabilities. Prob(sequence, path|model) = 6.8e-8. EPFL – Bioinformatics I – 05 Dec 2005
    66. Automatic Annotation vs Manual Automatic Annotation Manual Annotation • Quick whole genome analysis ~ • Extremely slow~3 months Chr 6 weeks • Need finished seq • Consistent annotation • Flexible, can deal with • Use unfinished sequence/shotgun inconsistencies in data assembly • Most rules have exception • No polyA sites/signals, pseudogene • Consult publications as well as • Predicts ~70% loci databases
    67. Analysis EGASP predictions vs manual 100 annotation 100 Exon Sn Nuc Sn 90 90 Nuc Sp Exon Sp 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 9_101_1 20_79_1 36_46_1 41_77_1 9_101_1 20_79_1 36_46_1 41_77_1 80 80 Trans Sn 70 Gene Sn Trans Sp 70 Gene Sp 60 60 50 50 40 40 30 30 20 20 10 10 0 0 9_101_1 20_79_1 36_46_1 41_77_1 9_101_1 20_79_1 36_46_1 41_77_1
    68. Y sólo es el principio 2002 2004 2005 2007 2010
    69. 2002 2004 2005 2007 2010
    70. 10/3/02 8/28/03 5/07 10/08 Published complete genomes: 104 156 500 874 Ongoing prokaryotic genomes: 316 386 1500 2124 Ongoing eukaryotic genomes: 218 246 700 1004 http://www.genomesonline.org 4000 2002 2004 2005 2007 2010
    71. 32,000,000 454-GS20 Millions 4 .5 4 4 .0 4 Applied Biosystems 3 .5 4 Roche / 454 # Bases/Run 3 .0 4 ABI 3730XL ABI Genome Sequencer FLX 2 .5 4 ABI 1 Mb / day 2 .0 4 ABI 3730 100 Mb / run 1 .5 4 3700 1 .0 4 370/377 0 .5 4 0 .0 4 1994 1996 1998 2000 2002 2004 2006 Dat e of Int roduct ion Applied Biosystems SOLiD Illumina / Solexa 3000 Mb / run Genetic Analyzer 2000 Mb / run 2002 2004 2005 2007 2010
    72. Aunque los seres humanos compartimos 99.9 por ciento de la información genética, tenemos pequeñas variaciones, llamadas poliformismos singulares de nucléotido o SNP (por su siglas en inglés; se pronuncia snip). Se estima que existen unos 10 millones de SNP en la especie humana y supuestamente esas diferencias estarían relacionadas con la mayor resistencia o susceptibilidad a enfermedades y medicamentos. 2002 2004 2005 2007 2010
    73. VARIACIÓN EN LA SECUENCIA HUMANA DE DNA Tasa de mutación = 10-8 /sitio/generación Nº generaciones ancestro común-humano actual: 104-105
    74. ENCyclopedia Of DNA Elements 2002 2004 2005 2007 2010
    75. 2002 2004 2005 2007 2010
    76. Genómica funcional
    77. Sequence (DNA/RNA) Comparative & phylogeny genomics Protein sequence analysis & Regulation of gene evolution expression; transcription factors & micro RNAs Protein structure & function: computational crystallography Protein families, motifs and domains Chemical biology Protein interactions & complexes: modelling and prediction Pathway analysis Data integration & literature mining Image analysis Systems modelling
    78. Se preparan las Se preparan copias del ADN muestras de ARN de los genes de interés de interés Laser 1 Laser 2 control muestr a El chip se excita con láseres diferentes: el ...que se Transcripción control imprimen inversa reacciona a uno en el chip Añadir de ellos y la fluorescencia muestra al otro La comparación de ambas imágenes nos indica que genes se expresan de manera diferente Las muestras se hibridan en el microarray Schena et al. Science 1995
    79. Microarray analysis Clinical prediction of Leukemia type • 2 types – Acute lymphoid (ALL) – Acute myeloid (AML) • Different treatment & outcomes • Predict type before treatment? Golub et. al. Science 286:531-537. (1999)
    80. Biomarkers discovery Data statistical Management analysis Network Annotation análisis Selection 30.000 1500 genes 150 genes 50 elements 10 targets genes
    81. RT-PCR Standard Processing Procedure TaqMan Assays ! Overview Plates & Samples ! Quality Control Step1: Calculate Ct with SDS and export text file Raw Values ! Discard Samples Step2: Retrieve data and define experiment design ! Quality Control ΔCt Overview Step 4: Selection of Optimal Step 5: Differential Step 3: Biological Endogenous Controls & Expression Analysis ΔΔCt Replicates Calculation of ΔCt
    82. Example of Array CGH Technology* Chari et al, Cancer Informatics, 2006, 2, 48-58 88
    83. 89
    84. Chip-on-chip Source: http://www.chiponchip.org/
    85. ChIP (Chromatin ImmunoPrecipitation) • Chromatin immunoprecipitation, or ChIP, refers to a procedure used to determine whether a given protein binds to a specific DNA sequence in vivo DNA-binding proteins are crosslinked to DNA with formaldehyde in vivo Bind antibodies specific to the DNA- binding protein to isolate the complex by precipitation. Reverse the cross- linking to release the DNA and digest the proteins. Isolate the chromatin. Shear DNA along with bound proteins into small fragments. Use PCR( Polymerase Chain Reaction ) to amplify specific DNA sequences to see if they were precipitated with the antibody
    86. Protein Microarray G. MacBeath and S.L. Schreiber, 2000, Science 289:1760 arrayIT TM Spotting platform and protein microarray
    87. Different Kinds of Protein Arrays* Antibody Array Antigen Array Ligand Array Detection by: SELDI MS, fluorescence, SPR, electrochemical, radioactivity, microcantelever
    88. The Microarray Study Process
    89. Preprocesado
    90. Some Questions: • Which genes have expression levels that are correlated with some external variable? • For a given pathway, which of the genes in our collection are most likely to be involved? • For a diffuse disease, which genes are associated with different outcomes?
    91. Challenges for Data Analysis • Normalization (removing systematic measurement effects) • Variable Selection (Identification of relevant Variables) • Large sample Effects: Type I and Type II errors (False positives / False negatives) • Dimensionality Reduction • Identification of new disease classes • Classification of data into known disease classes
    92. Data Analysis Methods Dimension Reduction • PCA (Principle Component Analysis) • ICA (Independent Component Analysis) • Multidimensional Scaling Unsupervised Learning • K-Means / K-Medoid • Hierarchical Clustering Algorithms Supervised Learning • Linear Discriminant Analysis • Maximum Likelihood Discrimination • Nearest Neighbor Methods • Decision Trees • Random Forests
    93. Matrix factorization
    94. Popular Classification Methods • Decision Trees/Rules – Find smallest gene sets, but not robust – poor performance • Neural Nets - work well for reduced number of genes • K-nearest neighbor – good results for small number of genes, but no model • Naïve Bayes – simple, robust, but ignores gene interactions • Support Vector Machines (SVM) – Good accuracy, does own gene selection, but hard to understand • Specialized methods, D/S/A (Dudoit), … 102
    95. Support Vector Machine (SVM) • Main idea: Select hyperplane that is more likely to generalize on a future datum
    96. Best Practices • Capture the complete process, from raw data to final results • Gene (feature) selection inside cross-validation • Randomization testing • Robust classification algorithms – Simple methods give good results – Advanced methods can be better • Wrapper approach for best gene subset selection • Use bagging to improve accuracy • Remove/relabel mislabeled or poorly differentiated samples 104
    97. Enrichment Analysis • What are major enriched GO terms? • What are the highly active pathways? • What are the frequently interacting proteins? • What are the known disease associations? Alistair Chalk, 2008
    98. Meta-analysis example: “Creation and implications of a phenome-genome network” Butte and Kohane. Nat Biotech. 2006
    99. Meta-analysis example: “Creation and implications of a phenome-genome network” Butte and Kohane. Nat Biotech. 2006 • Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus. • Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression. • “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long- term value of the time invested in improving annotations.”
    100. Biología de sistemas
    101. PPI ANNOTATION AND DATABASES Database Reference URL MINT (Zanoni et al., 2002) http://mint.bio.uniroma2.it/mint IntAct (Hermjakob et al., 2004) http://www.ebi.ac.uk/intact DIP (Xenarios et al., 2002) http://dip.doe-mbi.ucla.edu/ HPID (Han et al., 2004) http://www.hpid.org HPRD (Peri et al., 2004) http://www.hprd.org/  iMEX agreement to share curation efforts  Protein Standard Initiative (PSI) recommendation  Molecular Interaction (MI) Ontology  Large scale experiments Literature curation
    102. Complex networks • Many systems can be represented as networks (graphs) – Nodes: individual component (proteins) – Edges: relationships (interactions) • They share common properties – Scale-free – Hierarchical – Clustering • Some properties may be intrinsic and can be understood better when putting into the context of evolution
    103. Detecting Hierarchical Organization
    104. Summary: Network Measures • Degree ki The number of edges involving node i • Degree distribution P(k) The probability (frequency) of nodes of degree k • Mean path length The avg. shortest path between all node pairs • Network Diameter – i.e. the longest shortest path • Clustering Coefficient – A high CC is found for modules
    105. Mapping the phenotypic data to the network •Systematic phenotyping of 1615 gene knockout strains in yeast •Evaluation of growth of each strain in the presence of MMS (and other DNA damaging agents) •Screening against a network of 12,232 protein interactions Begley TJ, Rosenbach AS, Ideker T, Samson LD. Damage recovery pathways in Saccharomyces cerevisiae revealed by genomic phenotyping and interactome mapping. Mol Cancer Res. 2002 Dec;1(2):103-12.
    106. The Role of Proteomics • The existence of an ORF does not imply the existence of a functional gene. • Limitations of comparative genomics. • mRNA levels may not correlate with protein levels. • Protein modifications  post-transcriptional modifications, isoforms, post-translational modifications, mutants. • Issues of proteolysis, sequestration, etc. relevant only at the protein level. • Protein complex composition, protein-protein interactions, structures.
    107. Structural proteomics • Folding • Structure and function • Protein structure prediction • Secondary structure • Tertiary structure • Function • Post-translational modification • Prot.-Prot. Interaction -- Docking algorithm • Molecular dynamics/Monte Carlo
    108. What kind of methods around? 5 main levels of protein Structure prediction: 1. Extensive Sequence Search 2. Threading and 1D-3D profiles 3. Ab initio prediction of protein structure 4. Comparative Modelling 5. Docking (domain interaction prediction)
    109. Prediction of Protein Structures • Examples – a few good examples actual predicted actual predicted actual predicted actual predicted
    110. MODPIPE: Large-Scale Comparative Protein Structure Modeling START 1 Get profile for sequence (NR) Expand match to cover complete domains PSI-BLAST For each template structure For each target sequence Scan sequence profile against MODELLER representative PDB chains Align matched parts of sequence and structure Scan PDB chain profiles Build model for target segment by against sequence satisfaction of spatial restraints Evaluate model Select templates using permissive E-value cutoff 1 END R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998. N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali. 3/25/03
    111. Structural Proteomics: The Motivation* 2000000 200000 1800000 180000 1600000 160000 1400000 140000 Sequences Structures 1200000 120000 1000000 100000 800000 80000 600000 60000 400000 40000 200000 20000 0 0 1980 1985 1990 1995 2000 2005
    112. The hierarchies of protein structure
    113. Docking Programs • Dock (UCSF) • Autodock (Scripps) • Glide (Schrodinger) • ICM (Molsoft) • FRED (Open Eye) • Gold, FlexX, etc. 126
    114. Cell cycle network from KEGG
    115. Graphical Notation: a necessity for the conceptual representation of biopathways Qualitative Mechanistic various degree of detail, mixed level of presentation Aladjem et al., Science STKE pe8 Thiery & Sleeman, Nat. Rev. Mol. (2004) Cell. Biol 7:131 (2006) 128
    116. Strategies: simulate or analyse? (or rather what to do first) obtain qualitative convert diagram simulate model understanding into a quantitative behavior through numerical model numerically results and model reduction build and identify qualitatively simulate a “elementary analyze network reduced model modes” topology, stability, etc 129
    117. 130 stochsim Boolean networks Space of modeling methods continuous ↔ discrete
    118. Continuum of modeling approaches Top-down Bottom-up
    119. Frazier et al. (2003) Science 11 April Vol 300:290-293
    120. Integración de datos
    121. Nucleic Acids Research article lists 1078 public databases Nucleic Acids Research, 2008, Vol. 36, Database issue http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2
    122. Growth in Available Bioinformatics Databases
    123. Too much unintegrated data • Data sources incompatible • No (or few) standard naming convention • No common interface (varying tools for browsing, querying and visualizing data)
    124. – Large experiments or large research – Small, isolated, independent, groups/labs, possibly distributed groups/individuals – Large service provider institutes. – Loosely coupled provider- consumer of resources. – Tightly coupled provider-consumer of resources. – Commonly resource consumers – Commonly resource providers. – Boutique suppliers. – Some or lots of access to sys admin – Poor access systems admins
    125. Challenges: Names and Identity • WSL-1 protein Q93038 = Tumor necrosis factor • Apoptosis-mediating receptor DR3 receptor superfamily member • Apoptosis-mediating receptor 25 precursor TRAMP • Death domain receptor 3 Annotation history: • WSL protein • Apoptosis-inducing receptor AIR Q92983 P78515 • Apo-3 O00275 Q93036 • Lymphocyte-associated receptor of death O00276 Q93037 • LARD O00277 Q99722 • GENE: Name=TNFRSF25 O00278 Q99830 O00279 Q99831 O00280 Q9BY86 O14865 Q9UME0 GUIDs O14866 Q9UME1 Life Science P78507 Q9UME5 Identifier? Normalisation 138 http://www.expasy.org/uniprot/Q93038
    126. Why must support standards? • Unambiguous representation, description and communication – Final results and metadata • Interoperability – Data management and analysis • Integration of OMICS  system biology
    127. What to standarize? • CONTENT: Minimal/Core Information to be reported • MIBBI (http://www.mibbi.org) • SEMANTIC: Terminology Used -> Ontologies • OBI (http://obi-ontology.org) • SYNTAX: Data Model, Data Exchange • Fuge (http://fuge.sourceforge.net/)
    128. MIBBI: Standard Content Promoting Coherent Minimum Reporting Requirements for Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.
    129. Link Integration: Integration Lite Application interface User interface Application Ontology Authority Identity Authority 143
    130. Warehouse Wrappers Wrappers Data Access and Query User interface Application Unified Wrappers model • Copy the data sets, clean and massage data into shape • Combine them into a (different) pre-determined model before query • ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART • Often called “Knowledge bases”  144
    131. View integration Wrappers Wrappers Data Access and Query User interface Application Unified Wrappers model • Data at Source; Virtual integrating database view • Global as View / Local as View mappings between models • Map from model to databases dynamically so always fresh • TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE 145
    132. Specialist Integrating Application Wrappers Wrappers User interface Application Wrappers E.g. Ensembl, UTOPIA • Very popular. Known to be one application. 146
    133. Workflows Workflow Engine User interface Application Wrapper • Data flow protocol. Automated data chaining. • General technique for describing and enacting a process • Describes what you want to do, not how you want to do it • Various degrees of data type compliance anticipated 147
    134. Mash-Up Data Marshalling objects Protocol Mash Up Application User interface Protocol Protocol • Content syndication and feeds • Emphasis on User creating specific integration by mapping. • Just in time, just enough design • On demand integration 148
    135. Composite applications
    136. Semantic Web help? Access and Query Wrappers User interface Application Wrapper Wrappers Semantic Enrichment Model flattening Mapping Transparency • Slight problem: we have no first class metadata migration and management infrastructure, where metadata is outside the application and in the middleware, and we can handle progressive curation 150
    137. Service Oriented Architecture Advanced Search Retrieve data Submit data submission curation ws ws ws ws ws dataflow workflow
    138. Distributed Annotation System
    139. Distributed Annotation System
    140. An Integrative Analysis Example Relational data Decision mining Text tree model of mining Visualizing metabonomi serial/spect Visualizing c profile rum data cluster statistics Visualizing Visualizing Visualizin Chemical multidimensi Visualizing g sequence structure data pathway onal data Chemical relational Text mining Spectrum visualization data data sequence visualization data data clusters mining model
    141. From experiments to scientific publications 1- Experiments 2- Results 3- Scientific Peer- reviewed articles Planning and Processing and carrying out interpretation of 'Relevant' results are experiments obtained results published in scientific (lab work) journals
    142. PubMed/Medline database at NCBI - Developed at the National Center for Biotechnology Information (NCBI). - The core 'Textome'. - repository of citation entries of scientific articles. - PubMed titles and abstracts are primary data source for Bio-NLP. - ~ 450,000 new abstracts/a - > 4,800 biomedical journals - ENTREZ search engine
    143. Data in scientific articles Scientific Free Text Tables Figures Journals Title Abstracts Keywords Text body References Journal- Biomedical literature characteristics specific Information: - Heavy use of domain specific terminology (12% biochemistry •Format •Paper structure related technical terms). (sections) - Polysemic words (word sense disambiguation). •Article type - Most words with low frequency (data sparseness). - New names and terms created. - Typographical variants - Different writing styles (native languages)
    144. BioCreative
    145. BioCreative
    146. BioCreative results TP: prediction evaluated as protein and GO terms correct Precision: TP / Total nr. of evaluated submissions 1: Chiang et al. 2: Couto et al. 3: Ehrler et al. 4: Ray et al. 5: Rice et al. 6: Verspoor et al.
    147.  Data Integration • Standards, DBs Infrastructure  Knowledge Discovery • Algorithms, Informatics, Machine Learning  Integrate knowledge • Text mining, Ontologies  Modelling • Pathways, Circuits, Abstraction Research Support
    148. Los retos de la biología en los próximos 50 years • Listado de todos los componentes moleculares que forman un organismo: – Genes, proteinas, y otros elementos funcionales • Comprender la funcion de cada componente • Comprender como interaccionan • Estudiar como la función ha evolucionado • Encontrar defectos geneticos que causan enfermedades • Diseñar medicamentos y terapias de manera racional • Secuenciar el genoma de cada individuo y usarlo en una medicina personalizada • La Bioinformatica es un componente esencial para conseguir todos estos objetivos
    SlideShare Zeitgeist 2009

    + Alberto LabargaAlberto Labarga Nominate

    custom

    184 views, 0 favs, 0 embeds more stats

    Charla impartida en la Universidad de Granada

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 184
      • 184 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 6
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories