Bioinformática y
supercomputación
M. Gonzalo Claros Díaz
Dpto Biología Molecular y Bioquímica
Plataforma Andaluza de Bioin...
http://www.scbi.uma.es
Empecemos con unas palabras que no son mías
2
http://everydaylife.globalpost.com/medical-schools-bi...
http://www.scbi.uma.es
Empecemos con unas palabras que no son mías
2
http://everydaylife.globalpost.com/medical-schools-bi...
http://www.scbi.uma.es
La bioinformática no sólo se aplica a los humanos
3
http://mscbioinformatics.uab.cat/base/base3.asp...
http://www.scbi.uma.es
La bioinformática es IMPRESCINDIBLE hoy en día
4
http://bioinformatics.biol.ntnu.edu.tw/sher/Teachi...
http://www.scbi.uma.es
¿Cómo surge la bioinformática?
5
Margaret Oakley
Dayhoff Había que poner orden en….
!
¡¡¡ 65 proteí...
http://www.scbi.uma.es
Tras una base de datos, viene otra
6
1975
¡¡¡ 12 estructuras !!!
http://www.scbi.uma.es
Llamarlas BD es un casi un insulto a un informático
7
HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2
C...
http://www.scbi.uma.es
1977: el punto de inflexión
8
Proc. Nati. Acad. Sci. USA
Vol. 74, No. 12, pp. 5463-5467, December 19...
http://www.scbi.uma.es
Y un mes «antes» la primera suite bioinformática
9
Volume 4 Number 11 November 1977 Nucleic Acids R...
http://www.scbi.uma.es
El Staden Package es hoy de dominio público
10
http://staden.sourceforge.net
http://www.scbi.uma.es
Y surgen las BD de secuencias
11
1983
1980: 

563
secuencias
1988
http://www.scbi.uma.es
También eran BD de «texto»
12
http://www.scbi.uma.es
Empezamos a necesitar algoritmos de comparación
13
J. Mol. Bid. (1981) 147, 195-197
Identification ...
http://www.scbi.uma.es
Empezamos a necesitar algoritmos de comparación
13
J. Mol. Bid. (1981) 147, 195-197
Identification ...
http://www.scbi.uma.es
Se acumulan más secuencias, por lo que se necesitan
comparaciones más eficaces
14
Se mejora el algor...
http://www.scbi.uma.es
El coste de secuenciar disminuye, gracias a los ingenieros
15
http://www.scbi.uma.es
Menos coste: más secuenciación, más datos y más BD
16
http://www.scbi.uma.es
Las BD son IMPRESCINDIBLES hoy para los bioinformáticos
17
http://www.scbi.uma.es
Pero la ley de Moore no perdona
18
La información se acumula más rápido de lo que
aumenta la veloci...
http://www.scbi.uma.es
La «info» no logra ponerse al ritmo de la «bio»
19
http://www.scbi.uma.es
Si no aumentan los recursos, habrá que dedicar más gente
a analizar los datos
20
http://www.scbi.uma.es
Se necesitan bioinformáticos a pesar de (¿gracias a?) la crisis
21
http://www.indeed.com/jobtrends?...
http://www.scbi.uma.es
Vamos, que hay trabajo para bioinformáticos
22
http://www.scbi.uma.es
Vamos, que hay trabajo para bioinformáticos
22
http://www.scbi.uma.es
Vamos, que hay trabajo para bioinformáticos
22
http://www.scbi.uma.es
Todos los días hay nuevas peticiones de bioinformáticos
23
http://www.scbi.uma.es
Todos los días hay nuevas peticiones de bioinformáticos
23
30-dic-13
http://www.scbi.uma.es
Todos los días hay nuevas peticiones de bioinformáticos
23
30-dic-13
http://www.scbi.uma.es
Y también en España y Europa
24http://www.eurosciencejobs.com/jobs/bioinformatics
http://www.scbi.uma.es
Si lo que quieres es ganar dinero, también
25
Puedes anunciarte
aquídesde 50euros
Contacta:63360120...
http://www.scbi.uma.es
Se les paga bien, al menos en el extranjero
26
Se paga mejor
linux y OSX
que Windows
http://www.r-b...
http://www.scbi.uma.es
¿Sabías que tras las BD, R es lo que más se usa en la
bioinformática?
27
Lo que más se usan son las...
http://www.scbi.uma.es
¿Y que hay ofertas de trabajo para bioinformáticos con R?
28
http://www.r-bloggers.com/r-jobs-march...
http://www.scbi.uma.es
Tenéis este mundo a vuestro alcance en la UMA
29
http://www.uma.es/grado-en-ingenieria-de-la-salud
http://www.scbi.uma.es
Siempre nos quedan los cursos de especialización
30
http://www.scbi.uma.es
¡Y los libros! Que como Teruel, también existen
31
http://www.scbi.uma.es
El bioinformático puede ejercer de muchas formas
• Como un ingeniero
• Facilitando tareas difíciles...
http://www.scbi.uma.es
Se están definiendo las competencias del bioinformático
33
Message from ISCB
Bioinformatics Curricul...
http://www.scbi.uma.es
El ingeniero, el científico y el usuario
34
http://www.ploscompbiol.org/article/info:doi
%2F10.1371%...
http://www.scbi.uma.es
El perfil de un bioinformático australiano
35
http://www.ebi.edu.au/news/braembl-community-survey-re...
http://www.scbi.uma.es
El bioinformático no tiene problemas de movilidad
36
http://www.scbi.uma.es
¿Cuándo descansan los bioinformáticos?
37
NCBI is the most heavily site in
biomedicine. Why?
300,00...
http://www.scbi.uma.es
Siempre hay cosas que hará mejor un informático
38
10-04-13
Ya sabemos lo que se espera de un
bioinformático
Veamos ahora unos ejemplos reales
como la vida misma
39
http://www.scbi.uma.es
Flujos de trabajo que automaticen tareas repetitivas
40
Data miningMicroarray
«Wet» side «Dry» side...
http://www.scbi.uma.es
Dos ejemplos «made in Málaga»
41
SeqTrim
FullLengtherNEXT
Raw
sequences
Annotation
with Maker
SeqTr...
http://www.scbi.uma.es
¿Por qué se necesitaban estas herramientas?
42
0
15000
30000
45000
60000
OLC DE BRUIjN OLC+De BRUIJ...
http://www.scbi.uma.es
Hay bioinformática para transcriptómica en la UMA
43
DATABASE Open Access
EuroPineDB: a high-covera...
http://www.scbi.uma.es
Primero se recopilan los datos
44
homology was found, respectively, confirming that most assem-
bled...
http://www.scbi.uma.es
Después se diseña el flujo de trabajo
45
Unmapped
contigs
Full-LengtherNext
v3
Non-coding
#1
Short r...
http://www.scbi.uma.es
Los flujos son cada vez más importantes
46
Genes 2012, 3, 545-575; doi:10.3390/genes3030545
genes
IS...
http://www.scbi.uma.es
Luego se ejecuta, y se paraleliza todo lo posible
47
Fewer transcripts for genes encoding enzymes o...
http://www.scbi.uma.es
Ahora diseñamos una base de datos
48
Con tablas para las
anotaciones y
metainformación que
encontre...
http://www.scbi.uma.es
… y le damos una interfaz web para la comunidad científica
49
gene library and pine species, and can...
http://www.scbi.uma.es
Ahora podemos descubrir información biológica
50
A total of 5974 putative simple-sequence repeat (S...
http://www.scbi.uma.es
¿No acabo de mencionar «paralelización»?
51
Hindawi Publishing Corporation
Computational Biology Jo...
http://www.scbi.uma.es
SCBI_MapReduce: para paralelizar y distribuir
52
Eficiente
Robusto
Mejora el rendimiento de Blast
http://www.scbi.uma.es
Luego la bioinfo no está reñida con la supercomputación
53
Red Española de
Supercomputación
Picasso...
http://www.scbi.uma.es
Picasso: CPD para supercomputación y bioinformática
54
Hard disks
FAT nodes
Computing
nodes
THIN no...
http://www.scbi.uma.es
Por qué son buenas las infraestruturas de CPD
55
• Providing solid infrastructure for software and ...
http://www.scbi.uma.es
¿Cómo se accede?
56
Web
tools
Command line
Web interface
Web server
Virtual machines
Database
Home
...
http://www.scbi.uma.es
La bioinformática no se limita a secuencias y BD
57
Aplicaciones de la bioinformática y la
supercomputación
58
http://www.scbi.uma.es
El descubrimiento de nuevos fármacos «era» carísimo
59
Hay que sintetizar cada
compuesto y comproba...
http://www.scbi.uma.es
Ha valido para el Nobel de química en 2013
60
Por el desarrollo de modelos
computacionales para con...
http://www.scbi.uma.es
Ha valido para el Nobel de química en 2013
60
Por el desarrollo de modelos
computacionales para con...
http://www.scbi.uma.es
La biología de sistemas nos revela las claves
61
La regulación celular se va complicando a medida q...
http://www.scbi.uma.es
allow the formation of supramolecular activator or
inhibitory complexes, depending on their compone...
http://www.scbi.uma.es
Genes biomarcadores del cáncer de mama deducidos con
análisis bioinformáticos
63
http://www.scbi.uma.es
Eso lo hacemos en la UMA con miRNA del cáncer de mama
64
A microRNA Signature Associated with Early...
http://www.scbi.uma.es
Con la bioinformática se explican algunas observaciones
65
Molecular Evidence for the Inverse Comor...
http://www.scbi.uma.es
Se ve con claridad
66
(Figure 2, Figure S2, Table S3). The inverse relationship
between the levels ...
http://www.scbi.uma.es
La aplicación más llamativa a corto plazo
• Hay fármacos antidepresivos que se
podrán utilizar como...
http://www.scbi.uma.es
Y no se ha hecho esperar: 31-3-2014
68
http://www.scbi.uma.es
El genoma no nos permite predecir el organismo
69
?
http://www.scbi.uma.es
Empezamos a saber el aspecto a partir del genoma
70
Modeling 3D Facial Shape from DNA
Peter Claes1
...
http://www.scbi.uma.es
La bioinformática, la EPOC, y las publicaciones
71
Chen and Wang Journal of Clinical Bioinformatics...
http://www.scbi.uma.es
Los bioinformáticos y las publicaciones
72
Microarrays
Bases de datos
Microarrays
Minería de datos
...
http://www.scbi.uma.es
Los bioinformáticos y las publicaciones
72
Microarrays
Bases de datos
Microarrays
Minería de datos
...
http://www.scbi.uma.es
Ejemplo de colaboración e integración: la alergia al olivo
73
Las proteínas alergénicas
están en el...
http://www.scbi.uma.eshttp://www.scbi.uma.es/
Construcción de genoteca de polen
74
Grupo de investigación
de Juan de Dios ...
http://www.scbi.uma.eshttp://www.scbi.uma.es/
1.º Secuenciación en el laboratorio
75
Picasso
Edificio de Bioinnovación
Se u...
http://www.scbi.uma.eshttp://www.scbi.uma.es/
2.º Ensamblaje: de la secuencia al transcriptoma
76
Se usan los FAT NODES
(m...
http://www.scbi.uma.eshttp://www.scbi.uma.es/
3.º Anotación y enriquecimiento biológico
77
Aparecen
alérgenos ya
conocidos...
http://www.scbi.uma.es
Todavía queda mucho por descubrir
78
http://www.scbi.uma.es
Todavía queda mucho por descubrir
78
http://www.scbi.uma.es
Nuestro pequeño grupo interdisciplinar
79
Think  design Coding
Testing
Almudena
C
Darío
C
Juan
C
No...
http://www.scbi.uma.es
Nuestro pequeño grupo interdisciplinar
79
Think  design Coding
Testing
Almudena
C
Darío
C
Juan
C
No...
Upcoming SlideShare
Loading in …5
×

Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

1,235 views

Published on

¿En qué consiste la bioinformática? ¿Cómo puedo especializarme? ¿Dónde? Capacidad de supercomputación en la UMA. Recientes logros bioinformáticos relacionados con la medicina y con la ciencia en general, muchos de ellos realizados por equipos de la UMA.

Published in: Health & Medicine, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,235
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA

  1. 1. Bioinformática y supercomputación M. Gonzalo Claros Díaz Dpto Biología Molecular y Bioquímica Plataforma Andaluza de Bioinformática 1 Centro de Bioinnovación http://about.me/mgclaros/ @MGClaros
  2. 2. http://www.scbi.uma.es Empecemos con unas palabras que no son mías 2 http://everydaylife.globalpost.com/medical-schools-bioinformatics-37686.html La bioinformática es un campo científico nuevo y muy atractivo que está en la interfase entre la informática, la biología y las matemáticas para descubrir informaciones nuevas sobre las enfermedades y el cuerpo humano La bioinformática utiliza la biología y la informática para descubrir cómo funcionan los seres vivos y sus enfermedades
  3. 3. http://www.scbi.uma.es Empecemos con unas palabras que no son mías 2 http://everydaylife.globalpost.com/medical-schools-bioinformatics-37686.html La bioinformática es un campo científico nuevo y muy atractivo que está en la interfase entre la informática, la biología y las matemáticas para descubrir informaciones nuevas sobre las enfermedades y el cuerpo humano La bioinformática utiliza la biología y la informática para descubrir cómo funcionan los seres vivos y sus enfermedades
  4. 4. http://www.scbi.uma.es La bioinformática no sólo se aplica a los humanos 3 http://mscbioinformatics.uab.cat/base/base3.asp?sitio=msbioinformatics Pero entiendo que para un Ingeniero de la Salud, el interés en los humanos esté por encima de lo demás
  5. 5. http://www.scbi.uma.es La bioinformática es IMPRESCINDIBLE hoy en día 4 http://bioinformatics.biol.ntnu.edu.tw/sher/Teaching.html
  6. 6. http://www.scbi.uma.es ¿Cómo surge la bioinformática? 5 Margaret Oakley Dayhoff Había que poner orden en…. ! ¡¡¡ 65 proteínas !!!
  7. 7. http://www.scbi.uma.es Tras una base de datos, viene otra 6 1975 ¡¡¡ 12 estructuras !!!
  8. 8. http://www.scbi.uma.es Llamarlas BD es un casi un insulto a un informático 7 HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33 REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34 REMARK 5 1DGC 35 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36 REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37 REMARK 6 1DGC 38 REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41 REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42 REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45 REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46 REMARK 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48 REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49 REMARK 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51 REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52 REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53 REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57 REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58 REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67 CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68 ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69 ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70 ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71 SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72 SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73 SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74 ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75 ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76 ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916 ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917 TER 844 C B 9 1DGC 918 MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919 END 1DGC 920 FORTRAN era el rey
  9. 9. http://www.scbi.uma.es 1977: el punto de inflexión 8 Proc. Nati. Acad. Sci. USA Vol. 74, No. 12, pp. 5463-5467, December 1977 Biochemistry DNA sequencing with chain-terminating inhibitors (DNA polymerase/nucleotide sequences/bacteriophage 4X174) F. SANGER, S. NICKLEN, AND A. R. COULSON Medical Research Council Laboratory of Molecular Biology, Cambridge CB2 2QH, England Contributed by F. Sanger, October 3, 1977 ABSTRACT A new method for determining nucleotide se- quences in DNA is described. It is similar to the "plus and minus" method [Sanger, F. & Coulson, A. R. (1975)J. Mol. Biol. 94,441-4481 but makes use of the 2',3'-dideoxy and arabinonu- cleoside analogues ofthe normal deoxynucleoside triphosphates, which act as specific chain-terminating inhibitors of DNA polymerase. The technique has been applied to the DNA of bacteriophage 4bX174 and is more rapid and more accurate than either the plus or the minus method. The "plus and minus" method (1) is a relatively rapid and simple technique that has made possible the determination of the sequence of the genome of bacteriophage 4X174 (2). It depends on the use of DNA polymerase to transcribe specific regions of the DNA under controlled conditions. Although the method is considerably more rapid and simple than other available techniques, neither the "plus" nor the "minus" method is completely accurate, and in order to establish a se- quence both must be used together, and sometimes confirma- tory data are necessary. W. M. Barnes (J. Mol. Biol., in press) has recently developed a third method, involving ribo-substi- tution, which has certain advantages over the plus and minus method, but this has not yet been extensively exploited. Another rapid and simple method that depends on specific chemical degradation of the DNA has recently been described by Maxam and Gilbert (3), and this has also been used exten- a stereoisomer of ribose in whic ented in trans position with res The arabinosyl (ara) nucleotide hibitors of Escherichia coli DN comparable to ddT (4), although 3' araC can be further extende polymerases (5). In order to obta from which an extensive sequen to have a ratio of terminating tr phate such that only partial in occurs. For the dideoxy derivati for the arabinosyl derivatives ab METH Preparation of the Triphosp ration of ddTTP has been descr now commercially available. McCarthy et al. (8). We essenti and used the methods of Tener to convert it to the triphosphate DEAE-Sephadex, using a 0.1-1. carbonate at pH 8.4. The prepa has not been described previou same method as that used for d
  10. 10. http://www.scbi.uma.es Y un mes «antes» la primera suite bioinformática 9 Volume 4 Number 11 November 1977 Nucleic Acids Research Sequence data handling by computer R.Staden MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK Received 10 October 1977 ABSTRACT The speed of the new DINA sequencing techniques has created a need for computer programs to handle the data produced. This paper describes simple programs designed specifically for use by people with little or no computer experience. The programs are for use on small computers and provide facili- ties for storage, editing and analysis of both DNA and amino acid sequences. A magnetic tape containing these programs is available on request. INTRODUCTION The development of rapid DNA sequencing techniques12 now enables large amounts of sequence data to be accumulated in a short period of time. The complete sequence of bacteriophage 0X174 has recently been published3 and the sequences of other, similarly sized molecules are near to completion. During the sequencing of 0X174 DNA it became necessary to develop computer programs to process the large amounts of data produced. Some of the programs are specific to DNA sequences but many are equally applicable to amino acid sequences. These programs are designed for small computers in common use, such as the PDP 11/45, and are simplified so that they can be used by people with little or no experience of computers. This paper describes some of the programs currently being used in this laboratory. They provide facilities for (1) storage and editing of a sequence, (2) producing copies of the sequence in various forms, e.g. in single or double stranded form, (3) translation into the amino acid sequence coded by the DNA
  11. 11. http://www.scbi.uma.es El Staden Package es hoy de dominio público 10 http://staden.sourceforge.net
  12. 12. http://www.scbi.uma.es Y surgen las BD de secuencias 11 1983 1980: 
 563 secuencias 1988
  13. 13. http://www.scbi.uma.es También eran BD de «texto» 12
  14. 14. http://www.scbi.uma.es Empezamos a necesitar algoritmos de comparación 13 J. Mol. Bid. (1981) 147, 195-197 Identification of Common Molecular Subsequences The identification of maximally homologous subsequences among sets of long sequences is an important problem in molecular sequence analysis. The problem is straightforward only if one restricts consideration to contiguous subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics (Sellers 1974; Waterman et al., 1976) developed to measure the minimum number of “events” required to convert one sequence into another. These developments in the modern sequence analysis began with the heuristic homology algorithm of Needleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Numerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathemat- ically rigorous algorithms were suggested by Sankoff (1972), Reichert et al. (1973) and Beyer et al. (1979) but these were generally not biologically satisfying or interpretable. Success came with Sellers (1974) development of a true metric measure of the distance between sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of “mutational events” required to convert one sequence into another. It is of interest to note that Smith et al. (1980) have recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Needleman & Wunsch (1970). In this letter we extend the above ideas to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions. Algorithm The two molecular sequences will be h=alaz . . . an and IZj= blb, b,. A similarity a(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wt. To find pairs of segments with high degrees of similarity, we set up a matrix H. First set Proc. Natt Acad. Sci. USA Vol. 80, pp. 726-730, February 1983 Biochemistry Rapid similarity searches of nucleic acid and protein data banks (global homology/optimal alignment) W. J. WILBUR AND DAVID J. LIPMAN Mathematical Research Branch, National Institute ofArthritis, Diabetes, and Digestive and Kidney Diseases, National Institutes of Health, Building 31 Room 4B-54, Bethesda, Maryland 20205 Communicated by Maxine Singer, November 8, 1982 ABSTRACT With the development oflarge data banks ofpro- tein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar toagiven sequence has become evident. We present an algorithm for the global compar- ison ofsequences basedonmatchingk-tuples ofsequenceelements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separateimplementa- tion, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the en- tire Protein DataBankofthe NationalBiomedical Research Foun- dation with a 350-residue query sequence in less than 3 min and carryoutasimilar analysiswith a500-base query sequence against large banks of sequences. We shall describe here a global al- gorithm for comparing two nucleic acid or two amino acid se- quences. This algorithm involves the construction ofan optimal alignment that is useful in its own right. The algorithm also re- quires a computation time on the order ofN X M, where N and M are the lengths of-the sequences being compared, but, for given sequences, the computation is many times faster than the above-mentioned methods. Results obtained by the method and its limitations and advantages are discussed. METHODS Computational Methods and Data Sources. All computing Son buenos, pero lentos Aparece FASTA
  15. 15. http://www.scbi.uma.es Empezamos a necesitar algoritmos de comparación 13 J. Mol. Bid. (1981) 147, 195-197 Identification of Common Molecular Subsequences The identification of maximally homologous subsequences among sets of long sequences is an important problem in molecular sequence analysis. The problem is straightforward only if one restricts consideration to contiguous subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics (Sellers 1974; Waterman et al., 1976) developed to measure the minimum number of “events” required to convert one sequence into another. These developments in the modern sequence analysis began with the heuristic homology algorithm of Needleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Numerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathemat- ically rigorous algorithms were suggested by Sankoff (1972), Reichert et al. (1973) and Beyer et al. (1979) but these were generally not biologically satisfying or interpretable. Success came with Sellers (1974) development of a true metric measure of the distance between sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of “mutational events” required to convert one sequence into another. It is of interest to note that Smith et al. (1980) have recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Needleman & Wunsch (1970). In this letter we extend the above ideas to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions. Algorithm The two molecular sequences will be h=alaz . . . an and IZj= blb, b,. A similarity a(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wt. To find pairs of segments with high degrees of similarity, we set up a matrix H. First set Proc. Natt Acad. Sci. USA Vol. 80, pp. 726-730, February 1983 Biochemistry Rapid similarity searches of nucleic acid and protein data banks (global homology/optimal alignment) W. J. WILBUR AND DAVID J. LIPMAN Mathematical Research Branch, National Institute ofArthritis, Diabetes, and Digestive and Kidney Diseases, National Institutes of Health, Building 31 Room 4B-54, Bethesda, Maryland 20205 Communicated by Maxine Singer, November 8, 1982 ABSTRACT With the development oflarge data banks ofpro- tein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar toagiven sequence has become evident. We present an algorithm for the global compar- ison ofsequences basedonmatchingk-tuples ofsequenceelements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separateimplementa- tion, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the en- tire Protein DataBankofthe NationalBiomedical Research Foun- dation with a 350-residue query sequence in less than 3 min and carryoutasimilar analysiswith a500-base query sequence against large banks of sequences. We shall describe here a global al- gorithm for comparing two nucleic acid or two amino acid se- quences. This algorithm involves the construction ofan optimal alignment that is useful in its own right. The algorithm also re- quires a computation time on the order ofN X M, where N and M are the lengths of-the sequences being compared, but, for given sequences, the computation is many times faster than the above-mentioned methods. Results obtained by the method and its limitations and advantages are discussed. METHODS Computational Methods and Data Sources. All computing Son buenos, pero lentos Aparece FASTA La bioinformática es una ciencia que se plantea problemas y les busca soluciones La bioinformática es una ciencia porque busca descubrir información
  16. 16. http://www.scbi.uma.es Se acumulan más secuencias, por lo que se necesitan comparaciones más eficaces 14 Se mejora el algoritmo, no el ordenador: llega BLAST
  17. 17. http://www.scbi.uma.es El coste de secuenciar disminuye, gracias a los ingenieros 15
  18. 18. http://www.scbi.uma.es Menos coste: más secuenciación, más datos y más BD 16
  19. 19. http://www.scbi.uma.es Las BD son IMPRESCINDIBLES hoy para los bioinformáticos 17
  20. 20. http://www.scbi.uma.es Pero la ley de Moore no perdona 18 La información se acumula más rápido de lo que aumenta la velocidad de los procesadores Número de transistores en los procesadores Intel Crecimiento de datos en las bases de datos Ingenieros informáticos: ¡SOCORRO!
  21. 21. http://www.scbi.uma.es La «info» no logra ponerse al ritmo de la «bio» 19
  22. 22. http://www.scbi.uma.es Si no aumentan los recursos, habrá que dedicar más gente a analizar los datos 20
  23. 23. http://www.scbi.uma.es Se necesitan bioinformáticos a pesar de (¿gracias a?) la crisis 21 http://www.indeed.com/jobtrends?q=molecular+biology,+bioinformatics,+biomedical +engineering&l=&relative=1
  24. 24. http://www.scbi.uma.es Vamos, que hay trabajo para bioinformáticos 22
  25. 25. http://www.scbi.uma.es Vamos, que hay trabajo para bioinformáticos 22
  26. 26. http://www.scbi.uma.es Vamos, que hay trabajo para bioinformáticos 22
  27. 27. http://www.scbi.uma.es Todos los días hay nuevas peticiones de bioinformáticos 23
  28. 28. http://www.scbi.uma.es Todos los días hay nuevas peticiones de bioinformáticos 23 30-dic-13
  29. 29. http://www.scbi.uma.es Todos los días hay nuevas peticiones de bioinformáticos 23 30-dic-13
  30. 30. http://www.scbi.uma.es Y también en España y Europa 24http://www.eurosciencejobs.com/jobs/bioinformatics
  31. 31. http://www.scbi.uma.es Si lo que quieres es ganar dinero, también 25 Puedes anunciarte aquídesde 50euros Contacta:633601207 publicidad@lamarea.com LaMareatieneunCÓDIGO ÉTICO consensuadoconlos sociospararegularlasinser- cionespublicitarias.Larevista nuncapublicaráanunciosque entrenencontradiccióncon nuestrosprincipios.Noacep- tamospublicidadconconte- nidossexistas,racistasoque frutossecosylegumbres.Todocondeno- minacióndeagriculturaecológica. Ctra.AV923,km.0,5. Mombeltrán.Ávila. Teléfono:920370297 Genoma4u Conocertugenomayeldetushijosesla llavedelamedicinapersonalizada. www.genoma4u.com ElCanterodeLetur Alimentoslácteosecológicosdealtaca- lidad.Eslógico.Esecológico. Teléfono:967426066 www.elcanterodeletur.com ¿Sepuede cambiar Europa através delvoto? ElParlamentodelaUE ganapoderperocarecede competenciasparacontrolar organismoscomolatroika ABRIL2014 LA REV ISTA M ENSUA L DE LA COOPERATIVA M Á SPÚ BLICO MERCADONA Elreydelos supermercados imponesuspropias condicioneslaborales AGUA ElGobiernoultima laprivatización demanantialesyde caudalesderíos 22-M LasMarchas delaDignidad, unsímbolodeunidad ypoderpopular ABRIL 2014 | Nº15 | 3€
  32. 32. http://www.scbi.uma.es Se les paga bien, al menos en el extranjero 26 Se paga mejor linux y OSX que Windows http://www.r-bloggers.com/r-skills-attract-the-highest-salaries/ En la rama de bioinformática de Ing. de la Salud se estudia R
  33. 33. http://www.scbi.uma.es ¿Sabías que tras las BD, R es lo que más se usa en la bioinformática? 27 Lo que más se usan son las BD Y luego R
  34. 34. http://www.scbi.uma.es ¿Y que hay ofertas de trabajo para bioinformáticos con R? 28 http://www.r-bloggers.com/r-jobs-march-24th/
  35. 35. http://www.scbi.uma.es Tenéis este mundo a vuestro alcance en la UMA 29 http://www.uma.es/grado-en-ingenieria-de-la-salud
  36. 36. http://www.scbi.uma.es Siempre nos quedan los cursos de especialización 30
  37. 37. http://www.scbi.uma.es ¡Y los libros! Que como Teruel, también existen 31
  38. 38. http://www.scbi.uma.es El bioinformático puede ejercer de muchas formas • Como un ingeniero • Facilitando tareas difíciles o tediosas • Flujos de trabajo y automatización • Como un informático • Mejorando los algoritmos existentes • Creando algoritmos nuevos • Por ejemplo, ensamblaje de secuencias • Como un científico • Descubriendo información biológica con el ordenador • Por ejemplo, relacionar enfermedades aparentemente inconexas 32
  39. 39. http://www.scbi.uma.es Se están definiendo las competencias del bioinformático 33 Message from ISCB Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies Lonnie Welch1 *, Fran Lewitter2 , Russell Schwartz3 , Cath Brooksbank4 , Predrag Radivojac5 , Bruno Gaeta6 , Maria Victoria Schneider7 1 School of Electrical Engineering and Computer Science, Ohio University, Athens, Ohio, United States of America, 2 Bioinformatics and Research Computing, Whitehead Institute, Cambridge, Massachusetts, United States of America, 3 Department of Biological Sciences and School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America, 4 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom, 5 School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America, 6 School of Computer Science and Engineering, The University of New South Wales, Sydney, New South Wales, Australia, 7 The Genome Analysis Centre, Norwich Research Park, Norwich, United Kingdom Introduction Rapid advances in the life sciences and in related information technologies neces- sitate the ongoing refinement of bioinfor- matics educational programs in order to maintain their relevance. As the discipline of bioinformatics and computational biol- ogy expands and matures, it is important to characterize the elements that contrib- ute to the success of professionals in this field. These individuals work in a wide variety of settings, including bioinformatics core facilities, biological and medical re- search laboratories, software development organizations, pharmaceutical and instru- ment development companies, and institu- tions that provide education, service, and The skill sets required for success in the field of bioinformatics are considered by several authors: Altman [2] defines five broad areas of competency and lists key technologies; Ranganathan [3] presents highlights from the Workshops on Education in Bioinformatics, discussing challenges and possible solutions; Yale’s interdepartmental PhD program in computational biology and bioinformatics is described in [4], which lists the general areas of knowledge of bioinfor- matics; in a related article, a graduate of Yale’s PhD program reflects on the skills needed by a bioinformatician [5]; Altman and Klein [6] describe the Stanford Bio- medical Informatics (BMI) Training Pro- gram, presenting observed trends among BMI students; the American Medical Infor- matics Association defines competencies in the related field of biomedical informatics in [7]; and the approaches used in several German universities to implement bioinfor- matics education are described in [8]. Several approaches to providing bioin- life sciences curricula. Pevzner and Shamir [11] propose that undergraduate biology curricula should contain an additional course, ‘‘Algorithmic, Mathematical, and Statistical Concepts in Biology.’’ Wingren and Botstein [12] present a graduate course in quantitative biology that is based on original, pathbreaking papers in diverse areas of biology. Johnson and Friedman [13] evaluate the effectiveness of incorpo- rating biological informatics into a clinical informatics program. The results reported are based on interviews of four students and informal assessments of bioinformatics faculty. The challenges and opportunities rele- vant to training and education in the context of bioinformatics core facilities are discussed by Lewitter et al. [14]. Relatedly, Lewitter and Rebhan [15] provide guid- ance regarding the role of a bioinformatics core facility in hiring biologists and in furthering their education in bioinfor- matics. Richter and Sexton [16] describe and educate bioinformaticians. The previ- ous report of the task force summarized a survey that was conducted to gather input regarding the skill set needed by bioinfor- maticians [1]. The current article details a subsequent effort, wherein the task force broadened its perspectives by examining bioinformatics career opportunities, survey- ing directors of bioinformatics core facili- ties, and reviewing bioinformatics educa- tion programs. The bioinformatics literature provides valuable perspectives on bioinformatics edu- cation by defining skill sets needed by bioinformaticians, presenting approaches for providing informatics training to biologists, and discussing the roles of bioinformatics core facilities in training and education. of the ‘‘-omics’’ era. They define a requisite skill set by analyzing responses to questions about the knowledge, skills, and abilities that biologists should possess. The authors in [10] present examples of strategies and methods for incorporating bioinformatics content into undergraduate This manuscript expands the body of knowledge pertaining to bioinformatics curriculum guidelines by presenting the results from a broad set of surveys (of core facility directors, of career opportunities, and of existing curricula). Although there is some overlap in the findings of the Citation: Welch L, Lewitter F, Schwartz R, Brooksbank C, Radivojac P, et al. (2014) Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies. PLoS Comput Biol 10(3): e1003496. doi:10.1371/ journal.pcbi.1003496 Published March 6, 2014 Copyright: ß 2014 Welch et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: No specific funding was received for writing this article. Competing Interests: The authors have declared that no competing interests exist. * E-mail: welch@ohio.edu PLOS Computational Biology | www.ploscompbiol.org 1 March 2014 | Volume 10 | Issue 3 | e1003496 database management languages (e.g., Oracle, PostgreSQL, and MySQL), and also desirable for a bioinformatician to have modeling experience or background in one Preliminary Survey of Existing Curricula Table 1. Summary of the skill sets of a bioinformatician, identified by surveying bioinformatics core facility directors and examining bioinformatics career opportunities. Skill Category Specific Skills General time management, project management, management of multiple projects, independence, curiosity, self-motivation, ability to synthesize information, ability to complete projects, leadership, critical thinking, dedication, ability to communicate scientific concepts, analytical reasoning, scientific creativity, collaborative ability Computational programming, software engineering, system administration, algorithm design and analysis, machine learning, data mining, database design and management, scripting languages, ability to use scientific and statistical analysis software packages, open source software repositories, distributed and high-performance computing, networking, web authoring tools, web-based user interface implementation technologies, version control tools Biology molecular biology, genomics, genetics, cell biology, biochemistry, evolutionary theory, regulatory genomics, systems biology, next generation sequencing, proteomics/mass spectrometry, specialized knowledge in one or more domains Statistics and Mathematics application of statistics in the contexts of molecular biology and genomics, mastery of relevant statistical and mathematical modeling methods (including experimental design, descriptive and inferential statistics, probability theory, differential equations and parameter estimation, graph theory, epidemiological data analysis, analysis of next generation sequencing data using R and Bioconductor) Bioinformatics analysis of biological data; working in a production environment managing scientific data; modeling and warehousing of biological data; using and building ontologies; retrieving and manipulating data from public repositories; ability to manage, interpret, and analyze large data sets; broad knowledge of bioinformatics analysis methodologies; familiarity with functional genetic and genomic data; expertise in common bioinformatics software packages, tools, and algorithms doi:10.1371/journal.pcbi.1003496.t001 http://www.ploscompbiol.org/article/info:doi %2F10.1371%2Fjournal.pcbi.1003496#pcbi-1003496-g002
  40. 40. http://www.scbi.uma.es El ingeniero, el científico y el usuario 34 http://www.ploscompbiol.org/article/info:doi %2F10.1371%2Fjournal.pcbi.1003496#pcbi-1003496-g002
  41. 41. http://www.scbi.uma.es El perfil de un bioinformático australiano 35 http://www.ebi.edu.au/news/braembl-community-survey-report-2013 ¿Dónde trabaja? ¿Quién es el bioinformático? Este es el bioinformático Esto es un biousuario Otro biousuario Y este también
  42. 42. http://www.scbi.uma.es El bioinformático no tiene problemas de movilidad 36
  43. 43. http://www.scbi.uma.es ¿Cuándo descansan los bioinformáticos? 37 NCBI is the most heavily site in biomedicine. Why? 300,000 200,000 100,000 NCBI Web Traffic – 1997-2006 400,000 January1998 500,000 600,000 700,000 January1999 January2000 January2001 January2002 January2003 January2004 January2005 January2006 722,000 Unique IPs a Day 91 Million Web Hits a Day 3200 Peak Web Hits a Second 1.5 Terabytes FTP a Day 1.8 Million Unique Users a Day
  44. 44. http://www.scbi.uma.es Siempre hay cosas que hará mejor un informático 38 10-04-13
  45. 45. Ya sabemos lo que se espera de un bioinformático Veamos ahora unos ejemplos reales como la vida misma 39
  46. 46. http://www.scbi.uma.es Flujos de trabajo que automaticen tareas repetitivas 40 Data miningMicroarray «Wet» side «Dry» sideAssembling
  47. 47. http://www.scbi.uma.es Dos ejemplos «made in Málaga» 41 SeqTrim FullLengtherNEXT Raw sequences Annotation with Maker SeqTrimNEXT (pre-processing) Assembly Mining with FullLengtherNEXT G EN O M IC S TRANSCRIPTOMICS ntro de Bioinnovación
  48. 48. http://www.scbi.uma.es ¿Por qué se necesitaban estas herramientas? 42 0 15000 30000 45000 60000 OLC DE BRUIjN OLC+De BRUIJN+CAP3 Unigenes # Orthologs for unigenes Complete unigenes with orthologs Unique complete unigenes with orthologs FullLengtherNEXTSeqTrimNEXT Menos contigs Mayor N50 # contigs 0 6 12 18 24 30 BAC1 BAC2 BAC3 Newbler SeaTrimNext + Newbler N50 0 10000 20000 30000 40000 50000 BAC1 BAC2 BAC3 Mejor ensamblaje cuanto más genes completos hay
  49. 49. http://www.scbi.uma.es Hay bioinformática para transcriptómica en la UMA 43 DATABASE Open Access EuroPineDB: a high-coverage web database for maritime pine transcriptome Noé Fernández-Pozo1 , Javier Canales1 , Darío Guerrero-Fernández2 , David P Villalobos1 , Sara M Díaz-Moreno1 , Rocío Bautista2 , Arantxa Flores-Monterroso1 , M Ángeles Guevara3 , Pedro Perdiguero4 , Carmen Collada3,4 , M Teresa Cervera3,4 , Álvaro Soto3,4 , Ricardo Ordás5 , Francisco R Cantón1 , Concepción Avila1 , Francisco M Cánovas1 and M Gonzalo Claros1,2* Abstract Background: Pinus pinaster is an economically and ecologically important species that is becoming a woody gymnosperm model. Its enormous genome size makes whole-genome sequencing approaches are hard to apply. Therefore, the expressed portion of the genome has to be characterised and the results and annotations have to be stored in dedicated databases. Description: EuroPineDB is the largest sequence collection available for a single pine species, Pinus pinaster (maritime pine), since it comprises 951 641 raw sequence reads obtained from non-normalised cDNA libraries and high-throughput sequencing from adult (xylem, phloem, roots, stem, needles, cones, strobili) and embryonic (germinated embryos, buds, callus) maritime pine tissues. Using open-source tools, sequences were optimally pre- processed, assembled, and extensively annotated (GO, EC and KEGG terms, descriptions, SNPs, SSRs, ORFs and InterPro codes). As a result, a 10.5× P. pinaster genome was covered and assembled in 55 322 UniGenes. A total of 32 919 (59.5%) of P. pinaster UniGenes were annotated with at least one description, revealing at least 18 466 different genes. The complete database, which is designed to be scalable, maintainable, and expandable, is freely available at: http://www.scbi.uma.es/pindb/. It can be retrieved by gene libraries, pine species, annotations, UniGenes and microarrays (i.e., the sequences are distributed in two-colour microarrays; this is the only conifer database that provides this information) and will be periodically updated. Small assemblies can be viewed using a dedicated visualisation tool that connects them with SNPs. Any sequence or annotation set shown on-screen can be downloaded. Retrieval mechanisms for sequences and gene annotations are provided. Conclusions: The EuroPineDB with its integrated information can be used to reveal new knowledge, offers an easy-to-use collection of information to directly support experimental work (including microarray hybridisation), and provides deeper knowledge on the maritime pine transcriptome. 1 Background Conifers (Coniferales), the most important group of gymnosperms, represent 650 species, some of which are the largest, tallest, and oldest non-clonal terrestrial Given that trees are the great majority of conifers, they provide a different perspective on plant genome biology and evolution taking into account that conifers are sepa- rated from angiosperms by more than 300 million years Fernández-Pozo et al. BMC Genomics 2011, 12:366 http://www.biomedcentral.com/1471-2164/12/366 Research Article De novo assembly of maritime pine transcriptome: implications for forest breeding and biotechnology Javier Canales1† , Rocio Bautista2† , Philippe Label3† , Josefa Gomez-Maldonado1 , Isabelle Lesur4,5,6 , Noe Fernandez-Pozo2 , Marina Rueda-Lopez1 , Dario Guerrero-Fernandez2 , Vanessa Castro-Rodrıguez1 , Hicham Benzekri2 , Rafael A. Ca~nas1 , Marıa-Angeles Guevara7 , Andreia Rodrigues8 , Pedro Seoane2 , Caroline Teyssier9 , Alexandre Morel9 , Francßois Ehrenmann4,5 , Gregoire Le Provost4,5 , Celine Lalanne4,5 , Celine Noirot10 , Christophe Klopp10 , Isabelle Reymond11 , Angel Garcıa-Gutierrez1 , Jean-Francßois Trontin11 , Marie-Anne Lelu-Walter9 , Celia Miguel8 , Marıa Teresa Cervera7 , Francisco R. Canton1 , Christophe Plomion4,5 , Luc Harvengt11 , Concepcion Avila1,2 , M. Gonzalo Claros1,2 and Francisco M. Canovas1,2 * 1 Departamento de Biologıa Molecular y Bioquımica, Facultad de Ciencias, Universidad de Malaga, Malaga, Spain 2 Plataforma Andaluza de Bioinformatica, Edificio de Bioinnovacion, Parque Tecnologico de Andalucıa, Malaga, Spain 3 INRA, Universite Blaise Pascal, Aubiere Cedex, France 4 INRA, Cestas, France 5 Universite de Bordeaux, Talence, France 6 HelixVenture, Merignac, France 7 Plant Biotechnology Journal (2013), pp. 1–14 doi: 10.1111/pbi.12136 Microarrays Bases de datos Herramientas y algoritmos… Genómica, proteómica, metabolómica Biotecnología
  50. 50. http://www.scbi.uma.es Primero se recopilan los datos 44 homology was found, respectively, confirming that most assem- bled unigenes were pine transcripts. In fact, 4608 unigenes had a homologue EST in the Pine Gene Index 9.0 database (http://compbio.dfci.harvard.edu/cgi-bin/tgi/ Table 1 Description of samples used for DNA sequencing Gene library Sequencing platform Sampled plant material Experimental conditions SRA code EuroPineDB Sanger/454 Bud, xylem, phloem, stem, needles, roots, stem, embryos, callus, cone, male and female strobili ESTs and SSH libraries from different tissues and conditions as described by Fernandez-Pozo et al., 2011 SRS479769 Biogeco1 454 Xylem, bud and needle ESTs from differentiating xylem, swelling bud and young needles SRX032960, SRX032961, SRX032962, SRX032963 Biogeco2 454 Bud EST from quiescent buds harvested on 2-year-old maritime pine (low growing family) in well-watered or drought-stress conditions SRX031546 Biogeco3 454 Bud EST from quiescent buds harvested on 2-year-old maritime pine (fast growing family) in well-watered or drought-stress conditions SRX031589 UAGPF1 454 Embryome ESTs from developing, immature embryos (1-week maturation) SRX022618 INIA_PPIN 454 Bud ESTs from buds PRJNA221139 U_root 454 Root ESTs from roots (1-month-old seedlings) SRS480239 U_tip 454 Root tips ESTs from root tips (1-month-old seedlings) SRS480265 U_H 454 Hypocotyl ESTS from hypocotyl (1-month-old seedlings) SRS480236 U_N 454 Needle ESTs from needles (1-month-old seedlings) SRS480237 U_Cot_Os 454 Cotyledon ESTs from cotyledons grown under dark conditions SRS479771 U_H_Os 454 Hypocotyl ESTs from hypocotyl grown under dark conditions SRS480236 U_R_6 454 Roots ESTs from roots (6-month-old seedlings) SRS480238 U_S_8 454 Stem ESTs from stem (8-month-old seedlings) SRS480261 UAGPF2 Illumina Somatic embryo Paired-end ESTs from developing, immature embryos (1 week maturation) SRR609713 BIOGECO4 Illumina Bud ESTs from young and aged buds SRX031587 BIOGECO5 Illumina Root ESTs from drought-stressed and control roots in hydropony SRX031592, SRX031590 BIOGECO6 Illumina Bud ESTs from young and aged buds SRX031594 IBET Illumina Zygotic embryo Paired-end ESTs from embryos SRS481044 The maritime pine transcriptome 3
  51. 51. http://www.scbi.uma.es Después se diseña el flujo de trabajo 45 Unmapped contigs Full-LengtherNext v3 Non-coding #1 Short reads SeqTrimNext (pre-processing) Oases (pre-assembling) kmer 23 47 paired-end + single CD-HIT 99% Miss-assembly rejection#3 #2 Rejected #1 S. senegalensis long-reads SeqTrimNext (pre-processing) MIRA (pre-assembling) EULER-SR (pre-assembling) CAP3 (reconciliation) Unmapped contigs UNIGENES S.senegalensis v4 #6 Mapped contigs #4 Contigs Debris Non-coding #7 Coding unmapped contigs BOWTIE 2 (mapping test) #3 B #2 Rejected #9 #10 #11 Full-LengtherNext Missassemblies #12 Contigs #8
  52. 52. http://www.scbi.uma.es Los flujos son cada vez más importantes 46 Genes 2012, 3, 545-575; doi:10.3390/genes3030545 genes ISSN 2073-4425 www.mdpi.com/journal/genes Article Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows Federica Torri 1,2 , Ivo D. Dinov 2,3 , Alen Zamanyan 3 , Sam Hobel 3 , Alex Genco 3 , Petros Petrosyan 3 , Andrew P. Clark 4 , Zhizhong Liu 3 , Paul Eggert 3,5 , Jonathan Pierce 3 , James A. Knowles 4 , Joseph Ames 2 , Carl Kesselman 2 , Arthur W. Toga 2,3 , Steven G. Potkin 1,2 , Marquis P. Vawter 6 and Fabio Macciardi 1,2, * 1 Department of Psychiatry and Human Behavior, University of California, Irvine, CA 92617, USA; E-Mails: ftorri@uci.edu (F.T.); sgpotkin@uci.edu (S.G.P.) 2 Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: ivo.dinov@loni.ucla.edu (I.D.D.); jdames@uci.edu (J.A.); carl@isi.edu (C.K.); toga@loni.ucla.edu (A.W.T.) 3 Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: Alen.Zamanyan@loni.ucla.edu (A.Z.); shobel87@gmail.com (S.H.); alexgenco@gmail.com (A.G.); Petros.Petrosyan@loni.ucla.edu (P.P.); zhizhong.liu@loni.ucla.edu (Z.L.); eggert@cs.ucla.edu (P.E.); jonathan.pierce@loni.ucla.edu (J.P.) 4 Zilkha Neurogenetic Institute, USC Keck School of Medicine, Los Angeles, CA 90033, USA; E-Mails: clarkap@usc.edu (A.P.C.); knowles@med.usc.edu (J.A.K.) 5 Department of Computer Science, University of California, Los Angeles, CA 90095, USA 6 Functional Genomics Laboratory, Department of Psychiatry And Human Behavior, School of Medicine, University of California, Irvine, CA 92697, USA; E-Mail: mvawter@uci.edu * Author to whom correspondence should be addressed; E-Mail: fmacciar@uci.edu; Tel.: +1-949-824-4559; Fax: +1-949-824-2072. Received: 6 July 2012; in revised form: 15 August 2012 / Accepted: 15 August 2012 / Published: 30 August 2012 Abstract: Whole-genome and exome sequencing have already proven to be essential and powerful methods to identify genes responsible for simple Mendelian inherited disorders. OPEN ACCESS Genes 2012, 3 547 Table 1. Review of the most used software in next-generation sequencing (NGS) data analysis. Which includes two major computational macro-processes: (1) a primary step related to mapping and assembling, with alignment quality control, quality score re- regions of the genome; and (2) secondary, advanced steps focused on variant (single nucleotide polymorphisms (SNPs), insertions-deletions (Indels) and copy number variations (CNVs)) calling and annotation. These macro-processes are briefly reviewed to provide a background for the software algorithms embedded in DNA-Seq analysis. Process Software Algorithms Website Preprocessing step homemade script (N/A) (1.1) Alignment MAQ http://maq.sourceforge.net BWA http://bio-bwa.sourceforge.net/bwa.shtml BWA-SW (SE only) http://bio-bwa.sourceforge.net/bwa.shtml PERM http://code.google.com/p/perm/ BOWTIE http://bowtie-bio.sourceforge.net SOAPv2 http://soap.genomics.org.cn MOSAIK http://bioinformatics.bc.edu/marthlab/Mosaik NOVOALIGN http://www.novocraft.com/ (1.2) De novo Assembly VELVET http://www.ebi.ac.uk/%7Ezerbino/velvet SOAPdenovo http://soap.genomics.org.cn ABYSS http://www.bcgsc.ca/platform/bioinfo/software/abyss (1.3) Basic QC SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ PICARD http://picard.sourceforge.net/command-line-overview.shtml (1.4) Advanced QC GATK http://www.broadinstitute.org/gsa/wiki/index.php/ The_Genome_Analysis_Toolkit PICARD http://picard.sourceforge.net/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ IGVtools http://www.broadinstitute.org/igv/igvtools (2.1a) Variant Calling and annotation Sequence Variant Analyzer v1.0, for hg18 annotations SVA http://www.svaproject.org/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ERDS http://www.duke.edu/~mz34/erds.htm SAMTOOLS and ANNOVAR for annotation SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ANNOVAR http://www.openbioinformatics.org/annovar/ UnifiedGenotyper and ANNOVAR for annotation GATK http://www.broadinstitute.org/gsa/wiki/index.php/ The_Genome_Analysis_Toolkit ANNOVAR http://www.openbioinformatics.org/annovar/ (2.1b) CNVs CNVseq CNVseq http://tiger.dbs.nus.edu.sg/cnv-seq/ R http://www.r-project.org/ SAMTOOLS/ERDS/Sequen ce variant analyzer v1.0 ERDS SVA http://www.svaproject.org/ SAMTOOLS http://sourceforge.net/projects/SAMtools/files/ ERDS http://www.duke.edu/~mz34/erds.htm CNVer CNVer http://compbio.cs.toronto.edu/CNVer/ BOWTIE http://bowtie-bio.sourceforge.net SAVANT http://compbio.cs.toronto.edu/savant/ Simulated data generation tool dwgsim http://sourceforge.net/projects/dnaa/
  53. 53. http://www.scbi.uma.es Luego se ejecuta, y se paraleliza todo lo posible 47 Fewer transcripts for genes encoding enzymes of ammonium spruce. In contrast, two genes encode Fd-GOGAT and NADH- Fig. 2 Flow chart showing preprocessing into useful reads, assembly into contigs and overlap-based reconciliation into final unigenes of sequenced data from 5 (591 174 069 short reads, Illumina) or 14 (6 381 011 long reads, 454) cDNA libraries in maritime pine. The maritime pine transcriptome 5 Aquí se ensamblaron «muchas» secuencias Y aquí 10X más
  54. 54. http://www.scbi.uma.es Ahora diseñamos una base de datos 48 Con tablas para las anotaciones y metainformación que encontremos
  55. 55. http://www.scbi.uma.es … y le damos una interfaz web para la comunidad científica 49 gene library and pine species, and can be accessed using the ‘Assemblies’ tab. Each assembly can be inspected in detail, showing a paged list of UniGenes and a summary description. The detailed view of every UniGene means of GO term filtering. 3.1.2 Database retrieval In addition to a guided browsing, EuroPineDB contents can be retrieved by means of text search or sequence Home Gene libraries 96-Well plates 384-Well plates Microarrays Assemblies BLASTSearch Each library All sequences Each 96w_plate Each clone/sequence Each 384w_plate Each microarray block Each UniGeneExternal links Annotations List of assemblies Descriptions GO EC KEGG InterPro SNP SSR ORF Figure 3 Navigating through EuroPineDB. Arrowheads indicate the direction of navigation. Green boxes correspond to available views from all pages (thus, no incoming arrowhead is specified). Violet text indicates the option of downloading sequences in FASTA format.
  56. 56. http://www.scbi.uma.es Ahora podemos descubrir información biológica 50 A total of 5974 putative simple-sequence repeat (SSRs) were found, with trinucleotide repeats (3309) being the most common, and dinucleotide repeats (479) the less abundant. This is in agreement to previously published P. pinaster SSR abundance (Fernandez-Pozo et al., 2011). Discussion Maritime pine transcriptome assembly Long-read sequence data sets are required for transcriptome assembly in nonmodel species for which a reference genome is not available. In conifers, 454 sequencing has been recently used to generate well-defined transcriptomes in several species of ecological and economic interest, that is, Pinus contorta (Parch- man et al., 2010), P. glauca (Rigault et al., 2011), P. pinaster (Fernandez-Pozo et al., 2011), Pinus taeda and 11 other conifers (Lorenz et al., 2012). In the present work, we used a combination of 454 and Illumina sequencing to define a minimal reference transcriptome for maritime pine (P. pinaster). A similar approach was recently used to characterize, for example, the globe artichoke transcriptome (Scaglione et al., 2012). The nonredun- dant transcriptome resulting from the assembly contains 26 020 unique transcripts with orthologue ID in public databases, a number very close to the 27 720 unique cDNA clusters reported for the P. glauca transcriptome (Rigault et al., 2011) and higher than the 17 000 unique coding genes obtained in the assembly of P. contorta transcriptome (Parchman et al., 2010). The number of unique transcripts in maritime pine is also close to the number of genes (28 354) resulting from the draft assembly of the 20-gigabase genome of P. abies (Nystedt et al., 2013). Consid- ering all the available data, an elevated coverage of the maritime pine transcriptome is estimated. † MYB family of TF. ‡ Dof family of TF. § NAC family of TF. Fig. 4 Distribution of unique transcripts corresponding to TF gene families in Pinus pinaster and comparison to other plant transcriptomes. The number of different encoded transcripts with the conserved DNA- binding domain of each family is represented. The distribution of TF gene families in P. pinaster, Picea glauca, Picea abies, Populus trichocarpa and Arabidopsis thaliana is compared. annotation, comparative analysis with other conifer species and also for functional analysis of relevant genes associated to maritime pine growth, development and response to environ- mental changes. Furthermore, this genomic resource will greatly facilitate protein identification as well as protein–protein inter- action studies through proteomics approaches (Canovas et al., 2004). For all these reasons, it was of paramount importance to (Figure 5) present in maritime pine (this work) or spruce genomes (Birol et al., 2013; Nystedt et al., 2013; Rigault et al., 2011) were of similar or even lower size compared with angiosperm species (P. trichocarpa, A. thaliana and V. vinifera). Meanwhile, the existence of large gene families in conifers coding for enzymes of secondary metabolism has been reported (Martin et al., 2004), there are other families in primary metabolism that contain Fig. 5 Comparison of gene families for relevant enzymes in Pinus pinaster, Picea abies, Populus trichocarpa and Arabidopsis thaliana. The following databases were used in addition to SustainpineDB: P. abies v1.0, P. trichocarpa v3.0, A. thaliana TAIR 10. The maritime pine transcriptome 9 El genoma de pino es 10X el humano, pero las familias génicas son más pequeñas que en otras plantas
  57. 57. http://www.scbi.uma.es ¿No acabo de mencionar «paralelización»? 51 Hindawi Publishing Corporation Computational Biology Journal Volume 2013, Article ID 707540, 12 pages http://dx.doi.org/10.1155/2013/707540 Research Article SCBI_MapReduce, a New Ruby Task-Farm Skeleton for Automated Parallelisation and Distribution in Chunks of Sequences: The Implementation of a Boosted Blast+ Darío Guerrero-Fernández,1 Juan Falgueras,2 and M. Gonzalo Claros1,3 1 Supercomputaci´on y Bioinform´atica-Plataforma Andaluza de Bioinform´atica (SCBI-PAB), Universidad de M´alaga, 29071 M´alaga, Spain 2 Departamento de Lenguajes y Ciencias de la Computaci´on, Universidad de M´alaga, 29071 M´alaga, Spain 3 Departamento de Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´alaga, 29071 M´alaga, Spain Correspondence should be addressed to M. Gonzalo Claros; claros@uma.es Received 21 June 2013; Revised 18 September 2013; Accepted 19 September 2013 Academic Editor: Ivan Merelli Copyright © 2013 Dar´ıo Guerrero-Fern´andez et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Current genomic analyses often require the managing and comparison of big data using desktop bioinformatic software that was not developed regarding multicore distribution. The task-farm SCBI MapReduce is intended to simplify the trivial parallelisation and distribution of new and legacy software and scripts for biologists who are interested in using computers but are not skilled programmers. In the case of legacy applications, there is no need of modification or rewriting the source code. It can be used from multicore workstations to heterogeneous grids. Tests have demonstrated that speed-up scales almost linearly and that distribution in small chunks increases it. It is also shown that SCBI MapReduce takes advantage of shared storage when necessary, is fault- tolerant, allows for resuming aborted jobs, does not need special hardware or virtual machine support, and provides the same results than a parallelised, legacy software. The same is true for interrupted and relaunched jobs. As proof-of-concept, distribution of a compiled version of Blast+ in the SCBI Distributed Blast gem is given, indicating that other blast binaries can be used while maintaining the same SCBI Distributed Blast code. Therefore, SCBI MapReduce suits most parallelisation and distribution needs in, for example, gene and genome studies. 1. Introduction The study of genomes is undergoing a revolution: the produc- tion of an ever-growing amount of sequences increases year by year at a rate that outpaces computing performance [1]. This huge amount of sequences needs to be processed with the well-proven algorithms that will not run faster in new computer chips since around 2003 chipmakers discovered that they were no longer able to sustain faster sequential exe- cution except for generating the multicore chips [2, 3]. There- fore, the only current way to obtain results in a timely manner Sequence alignment and comparison are the most impor- tant topics in bioinformatic studies of genes and genomes. It is a complex process that tries to optimise sequence homology by means of sequence similarity using the algorithm of Needleman-Wunsch for global alignment, or the one of Smith-Waterman for local alignments. Blast and Fasta [4] are the most widespread tools that have implemented them. Paired sequence comparison is inherently a parallel pro- cess in which many sequence pairs can be analysed at the same time by means of functions or algorithms that are iter- atively performed over sequences. This is impelling the par- picasso Fundamentos de programación
  58. 58. http://www.scbi.uma.es SCBI_MapReduce: para paralelizar y distribuir 52 Eficiente Robusto Mejora el rendimiento de Blast
  59. 59. http://www.scbi.uma.es Luego la bioinfo no está reñida con la supercomputación 53 Red Española de Supercomputación Picasso Picasso: 
 2310 cores 700 TB disk 7 FAT nodes of shared memory: 
 80 cores 2 TB RAM 25 GB/core Computing nodes: 984 cores 4 TB RAM 4 GB/core «Thin» nodes: 768 cores 3 TB RAM 8 GB/core GPU nodes: 32 GPU 1 TB RAM 8 GB/core
  60. 60. http://www.scbi.uma.es Picasso: CPD para supercomputación y bioinformática 54 Hard disks FAT nodes Computing nodes THIN nodes More disks GPU nodes
  61. 61. http://www.scbi.uma.es Por qué son buenas las infraestruturas de CPD 55 • Providing solid infrastructure for software and hardware • More cost-efficient for large-scale projects • Cost-effective (licenses, computers...) • Including expensive software and multi-user licenses • Specialization • Collaboration with other research groups outside UMA Editorial The Need for Centralization of Computational Biology Resources Fran Lewitter1 *, Michael Rebhan2 *, Brent Richter3 *, David Sexton4 * 1 Bioinformatics and Research Computing, Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, United States of America, 2 Novartis Institutes for BioMedical Research, Basel, Switzerland, 3 Enterprise Research IS and Informatics, Brigham and Women’s Hospital, Massachusetts General Hospital, and Partners Healthcare, Boston, Massachusetts, United States of America, 4 Center for Human Genetics Research, Computational Genomics Core, Vanderbilt University, Nashville, Tennessee, United States of America Biomedical research is benefiting from the wealth of new data generated in the laboratory through new instrumentation, greater computational resources, and mas- sive repositories of public domain data. Using these data to make scientific discov- eries is sometimes straightforward, but can be complicated by the number and breadth of public sources available to the researcher as well as by the plethora of tools from which to choose. Complex searches, anal- yses, or even storage needs require more computational expertise than that available within an individual laboratory. As bio- medical researchers develop more compu- tational skills, this may change over time. Having a centralized group of experts in ‘‘core facility’’, ‘‘platform’’, etc.—and dif- ferent responsibilities for the group based on size and organization. For the purposes of this Editorial and the accompanying Perspectives (doi:10.1371/journal.pcbi. 1000368 and doi:10.1371/journal.pcbi. 1000369), we use the term ‘‘Bioinformatics Core Facility’’ to refer to these centralized resources. No matter what name is used, the primary focus of the centralized resource will be to support the investiga- tors with their computational needs. Be- low, we highlight some of the most important reasons we see for centralizing these resources. Providing Infrastructure On the software side, it can be econom- ical to purchase multi-user, concurrent, or site licenses rather than individual licenses. This also helps with support of the software as purchasers of the larger licenses will likely be better prepared to field questions and offer training opportunities about installation and use of the software. In addition, the Bioinformatics Core Facility may be in a position to purchase expensive software that is used only occasionally by researchers, thus being able to provide more options for individuals to address important research needs. Many researchers in an institution may have the same needs for custom software. A person working in a centralized facility can Why Centralize? Different institutions will have different names for these centralized resources— * E-mail: lewitter@wi.mit.edu (FL); michael.rebhan@novartis.com (MR); brichter@partners.org (BR); sexton@chgr. mc.vanderbilt.edu (DS) The order of authors is alphabetic; each author has contributed equally to the development and writing of this Editorial. PLoS Computational Biology | www.ploscompbiol.org 1 June 2009 | Volume 5 | Issue 6 | e1000372
  62. 62. http://www.scbi.uma.es ¿Cómo se accede? 56 Web tools Command line Web interface Web server Virtual machines Database Home Files Virtual machine File transfer
  63. 63. http://www.scbi.uma.es La bioinformática no se limita a secuencias y BD 57
  64. 64. Aplicaciones de la bioinformática y la supercomputación 58
  65. 65. http://www.scbi.uma.es El descubrimiento de nuevos fármacos «era» carísimo 59 Hay que sintetizar cada compuesto y comprobarlo en los animales Método clásico Método bioinformático Solo se sintetizan los candidatos. Ahorro en síntesis, tiempo y animales Ligand database
  66. 66. http://www.scbi.uma.es Ha valido para el Nobel de química en 2013 60 Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos Químico teórico Biofísico Bioquímico http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/ Bioquímico
  67. 67. http://www.scbi.uma.es Ha valido para el Nobel de química en 2013 60 Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos Químico teórico Biofísico Bioquímico http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/ Bioquímico This Nobel Prize is the first given to work in computational biology, indicating that the field has matured and is on a par with experimental biology ! The blog of PLOS Computational Biology
  68. 68. http://www.scbi.uma.es La biología de sistemas nos revela las claves 61 La regulación celular se va complicando a medida que aumenta la complejidad del organismo
  69. 69. http://www.scbi.uma.es allow the formation of supramolecular activator or inhibitory complexes, depending on their components and possible combinations. Transcription factors (TFs) are an essential subset of interacting proteins responsible for the control of gene expression. They interact with DNA regions and tend to form transcriptional regulatory complexes. Thus, the final effect of one of these complexes is determined by its TF composition. The number of TFs varies among organisms, although it appears to be linked to the organism’s complexity. Around 200–300 TFs are predicted for Escherichia coli [18] and Saccharomyces [19,20]. By contrast, comparative analysis in multicellular organ- isms shows that the predicted number of TFs reaches 600–820 in C. elegans and D. melanogaster [20,21], and 1500–1800 in Arabidopsis (1200 cloned sequences) [20–22]. For humans, around 1500 TFs have been documented [21] and it is estimated that there are 2000–3000 [21,23]. Such an increase in the number of TFs is associated with higher control of gene regula- tion [24]. Interestingly, such an increase is based on the use of the same structural types of proteins. Human transcription factors are predominantly Zn fin- gers, followed by homeobox and basic helix–loop–helix [21]. Phylogenetic studies have shown that the amplifi- cation and shuffling of protein domains determine the Fig. 1. Human transcription factor network built from data extracted from the TRANSFAC 8.2 database. Numbered black filled nodes are the highest connected transcription factors. 1, TATA-binding protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5, retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit (RelA); 7, c-jun; 8, c-myc; 9, c-fos. Human transcription factor network topology C. Rodriguez-Caso et al.Nos dice qué proteína más vale no tocar 62 Topology, tinkering and evolution of the human transcription factor network Carlos Rodriguez-Caso1,2 , Miguel A. Medina2 and Ricard V. Sole´1,3 1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain 2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Ma´laga, Spain 3 Santa Fe Institute, Santa Fe, New Mexico, USA Living cells are composed of a large number of differ- ent molecules interacting with each other to yield com- plex spatial and temporal patterns. Unfortunately, this reality is seldom captured by traditional and molecular biology approaches. A shift from molecular to modular biology seems unavoidable [1] as biological systems are Early topological studies of cellular networks revealed that genomic, proteomic and metabolic maps share characteristic features with other real-world networks [8–12]. Protein networks, also called inter- actomes, were studied thanks to a massive two-hybrid system screening in unicellular Saccharomyces cerevisiae Keywords human; molecular evolution; protein interaction; tinkering; transcription factor network Correspondence Ricard V. Sole´, ICREA - Complex System Laboratory, Universitat Pompeu Fabra, Dr Aiguader 80, 08003 Barcelona, Spain Fax: +34 93 221 3237 Tel: +34 93 542 2821 E-mail: ricard.sole@upf.edu (Received 5 August 2005, revised 25 October 2005, accepted 31 October 2005) doi:10.1111/j.1742-4658.2005.05041.x Patterns of protein interactions are organized around complex heterogene- ous networks. Their architecture has been suggested to be of relevance in understanding the interactome and its functional organization, which per- vades cellular robustness. Transcription factors are particularly relevant in this context, given their central role in gene regulation. Here we present the first topological study of the human protein–protein interacting transcrip- tion factor network built using the TRANSFAC database. We show that the network exhibits scale-free and small-world properties with a hierarchi- cal and modular structure, which is built around a small number of key proteins. Most of these proteins are associated with proliferative diseases and are typically not linked to each other, thus reducing the propagation of failures through compartmentalization. Network modularity is consistent with common structural and functional features and the features are gener- ated by two distinct evolutionary strategies: amplification and shuffling of interacting domains through tinkering and acquisition of specific interact- ing regions. The function of the regulatory complexes may have played an active role in choosing one of them. Abbreviations ER, Erdo¨s-Re´nyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor. FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 642 ology, tinkering and evolution of the human nscription factor network Rodriguez-Caso1,2 , Miguel A. Medina2 and Ricard V. Sole´1,3 Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain ment of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Ma´laga, Spain Fe Institute, Santa Fe, New Mexico, USA cells are composed of a large number of differ- ecules interacting with each other to yield com- atial and temporal patterns. Unfortunately, this s seldom captured by traditional and molecular approaches. A shift from molecular to modular seems unavoidable [1] as biological systems are Early topological studies of cellular networks revealed that genomic, proteomic and metabolic maps share characteristic features with other real-world networks [8–12]. Protein networks, also called inter- actomes, were studied thanks to a massive two-hybrid system screening in unicellular Saccharomyces cerevisiae ds molecular evolution; protein n; tinkering; transcription factor ondence Sole´, ICREA - Complex System y, Universitat Pompeu Fabra, der 80, 08003 Barcelona, Spain 93 221 3237 93 542 2821 card.sole@upf.edu d 5 August 2005, revised 25 2005, accepted 31 October 2005) 11/j.1742-4658.2005.05041.x Patterns of protein interactions are organized around complex heterogene- ous networks. Their architecture has been suggested to be of relevance in understanding the interactome and its functional organization, which per- vades cellular robustness. Transcription factors are particularly relevant in this context, given their central role in gene regulation. Here we present the first topological study of the human protein–protein interacting transcrip- tion factor network built using the TRANSFAC database. We show that the network exhibits scale-free and small-world properties with a hierarchi- cal and modular structure, which is built around a small number of key proteins. Most of these proteins are associated with proliferative diseases and are typically not linked to each other, thus reducing the propagation of failures through compartmentalization. Network modularity is consistent with common structural and functional features and the features are gener- ated by two distinct evolutionary strategies: amplification and shuffling of interacting domains through tinkering and acquisition of specific interact- ing regions. The function of the regulatory complexes may have played an active role in choosing one of them. or via control of TF expression, less connected factors may also be relevant to cell survival. Functional and structural patterns from topology In order to reveal the mechanisms that shape the struc- ture of HTFN, we studied its topological modularity in relation to the function and structure of TFs from available information. From a structural point of view, the overabundance of self-interactions is associated a complex, by varying their function and affinity to DNA. This is the case of the bHLH–bZip proto-onco- gen c-myc [44], or the Zn finger retinoid X receptor RXR [45]. From a topological viewpoint, connections by self- interacting domains would imply high clustering and modularity, because all these proteins share the same rules and they have the potential to give a highly inter- connected subgraph (i.e. a module). According to this, the high clustering of HTFN (see Fig. 1) could be Table 2. Description and functionality of transcriptions factor hubs. Transcription factor (TF), degree (k), betweenness centrality (b). TF Description Associate disease k b · 103 TBP Basal transcription machinery initiator Spinocerebellar ataxia [40] 27 17.3 p53 Tumor suppressor protein Proliferative disease [68] 23 18.5 P300 Coactivator. Histone acetyltransferase May play a role in epithelial cancer [69] 18 20.2 RXR-a Retinoid X-a receptor Hepatocellular carcinoma [70] 18 8 pRB retinoblastoma suppressor protein. Tumour suppressor protein Proliferative disease Bladder cancer. Osteosarcoma [71] 15 27.1 RelA NF-jB pathway Hepatocyte apoptosis and foetal death [72] 14 6.6 c-jun AP-1 complex (activator). Proto-oncogen Proliferative disease [73] 14 4.1 c-myc Activator. Proto-oncogen Proliferative disease [74] 13 10.5 c-fos AP-1 complex (activator). Proto-oncogen Proliferative disease [75] 12 2 C. Rodriguez-Caso et al. Human transcription factor network topology 2 1 ! 4 5 ! ! 7 6 9 Hay al menos 9 factores de transcripción que provocan cáncer si se mutan, sí o sí Biología de sistemas
  70. 70. http://www.scbi.uma.es Genes biomarcadores del cáncer de mama deducidos con análisis bioinformáticos 63
  71. 71. http://www.scbi.uma.es Eso lo hacemos en la UMA con miRNA del cáncer de mama 64 A microRNA Signature Associated with Early Recurrence in Breast Cancer Luis G. Pe´rez-Rivas1. , Jose´ M. Jerez2. , Rosario Carmona3 , Vanessa de Luque1 , Luis Vicioso4 , M. Gonzalo Claros3,5 , Enrique Viguera6 , Bella Pajares1 , Alfonso Sa´nchez1 , Nuria Ribelles1 , Emilio Alba1 , Jose´ Lozano1,5 * 1 Laboratorio de Oncologı´a Molecular, Servicio de Oncologı´a Me´dica, Instituto de Biomedicina de Ma´laga (IBIMA), Hospital Universitario Virgen de la Victoria, Ma´laga, Spain, 2 Departamento de Lenguajes y Ciencias de la Computacio´n, Universidad de Ma´laga, Ma´laga, Spain, 3 Plataforma Andaluza de Bioinforma´tica, Universidad de Ma´laga, Ma´laga, Spain, 4 Servicio de Anatomı´a Patolo´gica, Instituto de Biomedicina de Ma´laga (IBIMA), Hospital Universitario Virgen de la Victoria, Ma´laga, Spain, 5 Departmento de Biologı´a Molecular y Bioquı´mica, Universidad de Ma´laga, Ma´laga, Spain, 6 Departmento of Biologı´a Celular, Gene´tica y Fisiologı´a Animal, Universidad de Ma´laga, Ma´laga, Spain Abstract Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse pattern after surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years, respectively. Although several clinical and pathological features have been used to discriminate between low- and high-risk patients, the identification of molecular biomarkers with prognostic value remains an unmet need in the current management of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in 71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developed early (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregated tumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarray data analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentially expressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs were down-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-risk group of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early- relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by public databases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result in an overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-related microRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breast surgery. Citation: Pe´rez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoS ONE 9(3): e91884. doi:10.1371/journal.pone.0091884 Editor: Sonia Rocha, University of Dundee, United Kingdom Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014 Copyright: ß 2014 Pe´rez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio de Economı´a, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucı´a (TIN-4026, to JJ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: jlozano@uma.es . These authors contributed equally to this work. Introduction years, respectively, followed by a nearly flat plateau in which the Introduction Breast cancer comprises a group of heterogeneous diseases that can be classified based on both clinical and molecular features [1– 5]. Improvements in the early detection of primary tumors and the development of novel targeted therapies, together with the systematic use of adjuvant chemotherapy, has drastically reduced mortality rates and increased disease-free survival (DFS) in breast cancer. Still, about one third of patients undergoing breast tumor excision will develop metastases, the major life-threatening event which is strongly associated with poor outcome [6,7]. The risk of relapse after tumor resection is not constant over time. A detailed examination of large series of long-term follow-up studies over the last two decades reveals a bimodal hazard function with two peaks of early and late recurrence occurring at 1.5 and 5 years, respectively, followed by a nearly flat plateau in which the risk of relapse tends to zero [8–10]. A causal link between tumor surgery and the bimodal pattern of recurrence has been proposed by some investigators (i.e. an iatrogenic effect) [11]. According to that model, surgical removal of the primary breast tumor would accelerate the growth of dormant metastatic foci by altering the balance between circulating pro- and anti-angiogenic factors [9,11–14]. Such hypothesis is supported by the fact that the two peaks of relapse are observed regardless other factors than surgery, such as the axillary nodal status, the type of surgery or the administration of adjuvant therapy. Although estrogen receptor (ER)-negative tumors are commonly associated with a higher risk of early relapse [15], the bimodal distribution pattern is observed with independence of the hormone receptor status [16]. Other studies also suggest that the dynamics of tumor relapse may be a PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884 Biología Molecular Microarrays Bases de datos Herramientas y algoritmos… Microarrays Minería de datos Table 2). MiR- RT-qPCR data . Next, we re- signature. As B were clearly d most of the A in cluster 1b k). Of note, the up C (72.8%), ure specifically discriminates tumors with an overall higher risk of early recurrence. The 5-miRNA signature MiR-149 was the most significant miRNA downregulated in group B, as determined by microarray hybridization and by RT- qPCR. This miRNA has been described as a TS-miR that regulates the expression of genes associated with cell cycle, invasion or migration and its downregulation has been observed in several tumor diseases, including gastric cancer and breast cancer [70,77–81]. Down-regulation of miR-149 can occur epigenetical- early recurrence in breast cancer. Hierarchical clustering of the 71 tumor samples based r expression levels of the 5-miRNA signature defines a distinct cluster 2b wich mainly includes trary, most patients with good prognosis (group A) had tumors with normal or higher-than erent cluster 1b (‘‘low risk’’). atients with diferent RFS. A) Kaplan-Meier graph for the whole patient cohort included in overall down-regulation of the 5-miRNA signature (i.e. those from cluster 2b in Fig. 2) were FS was calculated (red line). RFS was also calculated for the remaining patients in the cohort at the 5-miRNA signature specifically discriminates tumors with an overall higher risk of early post-recurrence survival [100], likely because it targets AKT1 mRNA [101]. In sum, the available bibliographic data suggests that down- regulation of miR-149, miR-30a-3p, miR-20b, miR-10a and miR342-5p in primary breast tumors could confer them enhanced proliferative, angiogenic and invasive potentials. Prognostic value of the 5-miRNA signature. The relation- ship between expression of the 5-miRNA signature and RFS was examined by a survival analysis. Figure 3A shows a Kaplan-Meier graph for the whole series of patients included in the study. Due to the intrinsic characteristics of the cohort, decreases in the RFS are only observed in the intervals 0–24 and 50–60 months (corresponding to groups B and C, respectively). We next grouped the tumors according to their 5-miRNA signature status in two different groups. One group included those tumors with all five miRNAs simultaneously downregulated, (FC.2 and p,0.05) and a second group included those tumors not having all five miRNAs downregulated. A survival analysis was performed using clinical data from the corresponding patients. As shown in Figure 3B, the Kaplan-Meier graphs for the two groups demonstrate that the 5- miRNA signature defines a ‘‘high risk’’ group of patients with a Figure 4. Receiver operating characteristic curve (ROC) for early breast cancer recurrence by the 5-miRNA signature status. ROC curves generated using the prognosis information and expression levels of the 5-miRNA signature can discriminate between A miRNA Signature Predictive of Early RecurrenceA miRNA Signature Predictive of Early Recurrence
  72. 72. http://www.scbi.uma.es Con la bioinformática se explican algunas observaciones 65 Molecular Evidence for the Inverse Comorbidity between Central Nervous System Disorders and Cancers Detected by Transcriptomic Meta-analyses Kristina Iba´n˜ ez1. , Ce´sar Boullosa1. , Rafael Tabare´s-Seisdedos2 , Anaı¨s Baudot3 *, Alfonso Valencia1 * 1 Structural Biology and Biocomputing Programme, Spanish National Cancer, Research Centre (CNIO), Madrid, Spain, 2 Department of Medicine, University of Valencia, CIBERSAM, INCLIVA, Valencia, Spain, 3 Aix-Marseille Universite´, CNRS, I2M, UMR 7373, Marseille, France Abstract There is epidemiological evidence that patients with certain Central Nervous System (CNS) disorders have a lower than expected probability of developing some types of Cancer. We tested here the hypothesis that this inverse comorbidity is driven by molecular processes common to CNS disorders and Cancers, and that are deregulated in opposite directions. We conducted transcriptomic meta-analyses of three CNS disorders (Alzheimer’s disease, Parkinson’s disease and Schizophrenia) and three Cancer types (Lung, Prostate, Colorectal) previously described with inverse comorbidities. A significant overlap was observed between the genes upregulated in CNS disorders and downregulated in Cancers, as well as between the genes downregulated in CNS disorders and upregulated in Cancers. We also observed expression deregulations in opposite directions at the level of pathways. Our analysis points to specific genes and pathways, the upregulation of which could increase the incidence of CNS disorders and simultaneously lower the risk of developing Cancer, while the downregulation of another set of genes and pathways could contribute to a decrease in the incidence of CNS disorders while increasing the Cancer risk. These results reinforce the previously proposed involvement of the PIN1 gene, Wnt and P53 pathways, and reveal potential new candidates, in particular related with protein degradation processes. Citation: Iba´n˜ez K, Boullosa C, Tabare´s-Seisdedos R, Baudot A, Valencia A (2014) Molecular Evidence for the Inverse Comorbidity between Central Nervous System Disorders and Cancers Detected by Transcriptomic Meta-analyses. PLoS Genet 10(2): e1004173. doi:10.1371/journal.pgen.1004173 Editor: Marshall S. Horwitz, University of Washington, United States of America Received September 16, 2013; Accepted December 30, 2013; Published February 20, 2014 Copyright: ß 2014 Iba´n˜ez et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by a Fellowship from Obra Social la Caixa grant to KI (http://obrasocial.lacaixa.es/laCaixaFoundation/home_en.html), FPI grant BES-2008-006332 to CB and grant BIO2012 to AV Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: anais.baudot@univ-amu.fr (AB); avalencia@cnio.es (AV) . These authors contributed equally to this work. Introduction Epidemiological evidences point to a lower-than-expected probability of developing some types of Cancer in certain CNS Results and Discussion For each CNS disorder and Cancer type independently, we undertook meta-analyses from a large collection of microarray rnal factors (for review, see [3–7]). In e deregulation in opposite directions of a nd pathways as an underlying cause of logical plausibility of this hypothesis, a establish the existence of inverse gene i.e., down- versus up-regulations) in CNS owards this objective, we have performed of collections of gene expression data, AD, PD and SCZ, and Lung (LC), Prostate (PC) Cancers. Clinical and eviously reported inverse comorbidities for according to population studies assessing patients with CNS disorders [8–17]. significant overlaps (Fisher’s exact test, corrected p-value (q- value),0.05, see Methods) between the DEGs upregulated in CNS disorders and those downregulated in Cancers. Similarly, DEGs downregulated in CNS disorders overlapped significantly with DEGs upregulated in Cancers (Figure 1A). Significant overlaps between DEGs deregulated in opposite directions in CNS disorders and Cancers are still observed while setting more stringent cutoffs for the detection of DEGs (qvalues lower than 0.005, 0.0005, 0.00005 and 0.000005, Figure S1). A significant overlap between DEGs deregulated in the same direction was only identified in the case of CRC and PD upregulated genes (Figure 1A). A molecular interpretation of the inverse comorbidity between CNS disorders and Cancers could be that the downregulation of certain genetics.org 1 February 2014 | Volume 10 | Issue 2 | e1004173 Comparación de genes con expresión diferencialWorkflow Se sabía que los enfermos de alzhéimer sufrían menos cáncer que el resto de la población El flujo de trabajo
  73. 73. http://www.scbi.uma.es Se ve con claridad 66 (Figure 2, Figure S2, Table S3). The inverse relationship between the levels of expression deregulations of these pathways possibly suggests opposite roles in CNS disorders and Cancers. Figure 3). Hence, global regulations of cellular activity may account for a protective effect between inversely comorbid diseases. Figure 2. KEGG pathways significantly deregulated in Central Nervous System (CNS) disorders and Cancer types. KEGG pathways [24] significantly up- and downregulated in each disease were identified using the GSEA method [34] (q-value,0.05). The significant pathways were compared between the 6 diseases and combined in a network representation. Node pie charts are coloured according to the pathway status as Cancer upregulated (yellow), Cancer downregulated (blue), CNS disorder upregulated (green) and CNS disorder downregulated (red). The green/blue and yellow/red associations thus correspond to pathways deregulated in opposite directions in CNS disorders and Cancers. Pathway labels are coloured according to their classifications provided by KEGG [24], as: Metabolism (green), Genetic Information Processing (yellow), Cellular Process (pink), Environmental Information Processing (red) and Organismal Systems (dark red). All networks are available at bioinfo.cnio.es/people/cboullosa/ validation/cytoscape/Ibanezetal.zip, in cytoscape format (http://www.cytoscape.org/). doi:10.1371/journal.pgen.1004173.g002 PLOS Genetics | www.plosgenetics.org 4 February 2014 | Volume 10 | Issue 2 | e1004173 El cáncer (próstata, colorrectal, pulmón) comparte 93 genes con otras enfermedades del sistema nervioso central (párkinson, alzhéimer, esquizofrenia) ↑↑ cáncer ↓↓ SNC enfermo 74 genes19 genes cáncer ↓↓ SNC enfermo↑↑ Genes exclusivos del cáncer Genes exclusivos del SNC enfermo
  74. 74. http://www.scbi.uma.es La aplicación más llamativa a corto plazo • Hay fármacos antidepresivos que se podrán utilizar como medicamentos contra el cáncer • Hay fármacos antineoplásicos que se pueden usar contra las enfermedades del SNC • el bexaroteno (contra el cáncer de piel) es eficaz para el tratamiento del alzhéimer en los ratones 67 http://esmateria.com/2014/02/20/iluminado-el-blindaje-contra-el-cancer-de-personas-con-otras-enfermedades-en-el-cerebro/
  75. 75. http://www.scbi.uma.es Y no se ha hecho esperar: 31-3-2014 68
  76. 76. http://www.scbi.uma.es El genoma no nos permite predecir el organismo 69 ?
  77. 77. http://www.scbi.uma.es Empezamos a saber el aspecto a partir del genoma 70 Modeling 3D Facial Shape from DNA Peter Claes1 , Denise K. Liberton2 , Katleen Daniels1 , Kerri Matthes Rosana2 , Ellen E. Quillen2 , Laurel N. Pearson2 , Brian McEvoy3 , Marc Bauchet2 , Arslan A. Zaidi2 , Wei Yao2 , Hua Tang4 , Gregory S. Barsh4,5 , Devin M. Absher5 , David A. Puts2 , Jorge Rocha6,7 , Sandra Beleza4,8 , Rinaldo W. Pereira9 , Gareth Baynam10,11,12 , Paul Suetens1 , Dirk Vandermeulen1 , Jennifer K. Wagner13 , James S. Boster14 , Mark D. Shriver2 * 1 Medical Image Computing, ESAT/PSI, Department of Electrical Engineering, KU Leuven, Medical Imaging Research Center, KU Leuven UZ Leuven, iMinds-KU Leuven Future Health Department, Leuven, Belgium, 2 Department of Anthropology, Penn State University, University Park, Pennsylvania, United States of America, 3 Smurfit Institute of Genetics, Dublin, Ireland, 4 Department of Genetics, Stanford University, Palo Alto, California, United States of America, 5 HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, United States of America, 6 CIBIO: Centro de Investigac¸a˜o em Biodiversidade e Recursos Gene´ticos, Universidade do Porto, Porto, Portugal, 7 Departamento de Biologia, Faculdade de Cieˆncias, Universidade do Porto, Porto, Portugal, 8 IPATIMUP: Instituto de Patologia e Imunologia Molecular da Universidade do Porto, Porto, Portugal, 9 Programa de Po´s-Graduac¸a˜o em Cieˆncias Genoˆmicas e Biotecnologia, Universidade Cato´lica de Brası´lia, Brasilia, Brasil, 10 School of Paediatrics and Child Health, University of Western Australia, Perth, Australia, 11 Institute for Immunology and Infectious Diseases, Murdoch University, Perth, Australia, 12 Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, Australia, 13 Center for the Integration of Genetic Healthcare Technologies, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America, 14 Department of Anthropology, University of Connecticut, Storrs, Connecticut, United States of America Abstract Human facial diversity is substantial, complex, and largely scientifically unexplained. We used spatially dense quasi- landmarks to measure face shape in population samples with mixed West African and European ancestry from three locations (United States, Brazil, and Cape Verde). Using bootstrapped response-based imputation modeling (BRIM), we uncover the relationships between facial variation and the effects of sex, genomic ancestry, and a subset of craniofacial candidate genes. The facial effects of these variables are summarized as response-based imputed predictor (RIP) variables, which are validated using self-reported sex, genomic ancestry, and observer-based facial ratings (femininity and proportional ancestry) and judgments (sex and population group). By jointly modeling sex, genomic ancestry, and genotype, the independent effects of particular alleles on facial features can be uncovered. Results on a set of 20 genes showing significant effects on facial features provide support for this approach as a novel means to identify genes affecting normal-range facial features and for approximating the appearance of a face from genetic markers. Citation: Claes P, Liberton DK, Daniels K, Rosana KM, Quillen EE, et al. (2014) Modeling 3D Facial Shape from DNA. PLoS Genet 10(3): e1004224. doi:10.1371/ journal.pgen.1004224 Editor: Daniela Luquetti, Seattle Children’s Research Institute, United States of America Received September 12, 2013; Accepted January 22, 2014; Published March 20, 2014 Copyright: ß 2014 Claes et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This investigation was supported by grants to MDS from Science Foundation of Ireland Walton Fellowship (04.W4/B643); to MDS and DAP from the National Institute Justice (2008-DN-BX-K125); to JKW from the NIH/National Human Genome Research Institute (K99HG006446); to DKL from the National Science Foundation (BCS-0851815) and from the Wenner Gren Foundation (Fieldwork Grant 7967). PC is partly supported by the Flemish Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT Vlaanderen), the Research Program of the Fund for Scientific Research - Flanders (Belgium) (FWO), the Research Fund KU Leuven and SB was supported by the Portuguese Institution ‘‘Fundac¸a˜o para a Cieˆncia e a Tecnologia’’ [FCT; PTDC/BIABDE/64044/2006 (project) and SFRH/BPD/21887/2005 (post-doc grant)] and by a Dean’s Postdoctoral Fellowship at Stanford University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: mds17@psu.edu Introduction The craniofacial complex is initially modulated by precisely- timed embryonic gene expression and molecular interactions mediated through complex pathways [1]. As humans grow, hormones and biomechanical factors also affect many parts of the face [2,3]. The inability to systematically summarize facial variation has impeded the discovery of the determinants and correlates of face shape. In contrast to genomic technologies, systematic and comprehensive phenotyping has lagged. This is especially so in the context of multipartite traits such as the human face. In typical genome-wide association studies (GWAS) today phenotypes are summarized as univariate variables, which is inherently limiting for multivariate traits, which, by definition cannot be expressed with single variables. Current state-of-the-art genetic association studies for facial traits are limited in their description of facial morphology [4–7]. These analyses start from a sparse set of anatomical landmarks (these being defined as ‘‘a point of correspondence on an object that matches between and within populations’’), which overlooks salient features of facial shape. Subsequently, either a set of conventional morphometric mea- surements such as distances and angles are extracted, which drastically oversimplify facial shape, or a set of principal components (PCs) are extracted using principal components analysis (PCA) on the shape-space obtained with superimposition techniques, where each PC is assumed to represent a distinct morphological trait. Here we describe a novel method that facilitates the compounding of all PCs into a single scalar variable customized to relevant independent variables including, sex, genomic ancestry, and genes. Our approach combines placing PLOS Genetics | www.plosgenetics.org 1 March 2014 | Volume 10 | Issue 3 | e1004224 Figure 4. Relationships between the ancestry and sex RIP variables and their initial predictor variables. (A) RIP-A with genomic ancestry; genomic ancestry is calculated using the core panel of 68 AIMs and RIP-A is calculated using this ancestry estimate on the set of three populations combined (N = 592). Populations are indicated as shown in the legend with United States participants shown with black circles, Brazilians with red circles, and Cape Verdeans with blue circles. (B) Histograms of RIP-S by self-reported sex. doi:10.1371/journal.pgen.1004224.g004
  78. 78. http://www.scbi.uma.es La bioinformática, la EPOC, y las publicaciones 71 Chen and Wang Journal of Clinical Bioinformatics 2011 1:35 doi:10.1186/2043-9113-1-35 Se necesita la bioinformática para descubrir los candidatos Bioinformática pura y dura Con la bioinformática se descubren: Aquí no publicarán ni el informático clínico ni el ingeniero biomédico
  79. 79. http://www.scbi.uma.es Los bioinformáticos y las publicaciones 72 Microarrays Bases de datos Microarrays Minería de datos Aprendizaje computacional
  80. 80. http://www.scbi.uma.es Los bioinformáticos y las publicaciones 72 Microarrays Bases de datos Microarrays Minería de datos Aprendizaje computacional Con colaboración se llega más lejos
  81. 81. http://www.scbi.uma.es Ejemplo de colaboración e integración: la alergia al olivo 73 Las proteínas alergénicas están en el polen
  82. 82. http://www.scbi.uma.eshttp://www.scbi.uma.es/ Construcción de genoteca de polen 74 Grupo de investigación de Juan de Dios Alché Estación Experimental «El Zaidín» (Granada)
  83. 83. http://www.scbi.uma.eshttp://www.scbi.uma.es/ 1.º Secuenciación en el laboratorio 75 Picasso Edificio de Bioinnovación Se usan las máquinas virtuales de picasso
  84. 84. http://www.scbi.uma.eshttp://www.scbi.uma.es/ 2.º Ensamblaje: de la secuencia al transcriptoma 76 Se usan los FAT NODES (máquinas de memoria compartida) de picasso
  85. 85. http://www.scbi.uma.eshttp://www.scbi.uma.es/ 3.º Anotación y enriquecimiento biológico 77 Aparecen alérgenos ya conocidos (Ole1-10) Se están identificando nuevos alérgenos desconocidos Se usan los COMPUTING NODES de supercomputación
  86. 86. http://www.scbi.uma.es Todavía queda mucho por descubrir 78
  87. 87. http://www.scbi.uma.es Todavía queda mucho por descubrir 78
  88. 88. http://www.scbi.uma.es Nuestro pequeño grupo interdisciplinar 79 Think design Coding Testing Almudena C Darío C Juan C Noé B Rocío B Gonzalo B Isabel B Hicham B Rosario B Pedro B Biólogos y tal Ing. Informático B C IS Bioinformáticos ¡Necesito bioinformáticos! IS
  89. 89. http://www.scbi.uma.es Nuestro pequeño grupo interdisciplinar 79 Think design Coding Testing Almudena C Darío C Juan C Noé B Rocío B Gonzalo B Isabel B Hicham B Rosario B Pedro B Biólogos y tal Ing. Informático B C IS Bioinformáticos ¡Necesito bioinformáticos! IS Rocío Gonzalo Noé Rafa Hicham Almudena Antonio Banderas

×