Big Data in Biomedicinehttp://www.nature.com/news/gene-data-to-hit-milestone-1.11019
We live in a Big Dataoutline Course world1. Analyzing genomic data 1. Traditional bioinformatics tools 2. Microarrays/gene lists without any code 3. Microarrays/gene lists with code 4. NGS and mRNA-seq2. Beyond genomic 1. Protein-protein interaction network3. General data handling tools 1. Storing your data 2. Data are dirty4. Statistics made easy5. Graphics rules!6. Demystifying “the work”! (the code)7. Conclusion + Q&A
We live in a Big Data world tools Traditional bioinformaticsBioinformatics software to solve everyday problems.The EMBOSS tool suite http://emboss.sourceforge.net/One web portal is: http://mobyle.pasteur.fr/cgi-bin/portal.py- DNA / AA Pairwise global and local alignment- Sequence feature analysis (CpG island, gene scan, restriction enzyme site, 2D/3D structure...)- Protein structure and domains- Similarity search (Blast, phi-blast, psi-blast, delta-blast...)- Phylogenetics (trees from multiple alignments)- ...
We live in a Big Data world tools Traditional bioinformaticsBioinformatics software to solve everyday problems.The EMBOSS tool suite http://emboss.sourceforge.net/One web portal is: http://mobyle.pasteur.fr/cgi-bin/portal.py- DNA / AA Pairwise global and local alignment- Sequence feature analysis (CpG island, gene scan, restriction enzyme site, 2D/3D structure...)- Protein structure and domains- Similarity search (Blast, phi-blast, psi-blast, delta-blast...)- Phylogenetics (trees from multiple alignments)- ... UPGMA joining method
We live in a Big Data world tools Traditional bioinformaticsBioinformatics software to solve everyday problems.Some tools are provided through databases interface such as NCBI Entrez.- The UCSC genome browser. - The Encode project results - For example: visualize GC content and restriction enzyme site in your gene of interest.
Stating the obviousThis is not because you have a GUI that the analysis is brain dead simple.
We live in a Big Data worlddata Analyzing genomicAnalyzing microarray gene expression microarray without any code.Gene Pattern: http://genepattern.broadinstitute.org/gp/
We live in a Big Data worlddata Analyzing genomic Upload your expression data as a text file. Gene Pattern takes RES and GCT files. Conversion tools are providedTo transform CEL files to GCT. RES
We live in a Big Data worlddata Analyzing genomicStarBiogene http://web.mit.edu/star/biogene/index.html (java web app)- Part of GenePattern but provide pipeline style process onlineSeqExpress http://www.seqexpress.com/ (Windows only)- Alternative independent application (less activity than GenePattern)Expander http://acgt.cs.tau.ac.il/expander/- Alternative independent application (less activity than GenePattern)RMAExpress http://rmaexpress.bmbolstad.com/- Interesting to perform a quality control of your microarrays.Cluster http://bonsai.hgc.jp/~mdehoon/software/cluster/- This is the original program to analyze microarray results. No pre-processing functionality. You need to pre-process separately (using RMAExpress for example)SAM http://www-stat.stanford.edu/~tibs/SAM/ (significance Analysis of Microarrays)- To extract the DE genes. This is a Excel plugin. Again, you need to pre-processseparately
We live in a Big Data worlddata Analyzing genomicCommercial solutionGenespring GX (first 20 days are free)Access through subscription @ Stanford with CMGM http://cmgm3.stanford.edu
We live in a Big Datayour results Interpreting worldInterpreting a gene list rely on external knowledge.Several resources / tools are available to help. KEGG: http://www.genome.jp/kegg/ pathway database REACTOME: http://www.reactome.org pathway 2.0 database Gene Ontology: http://www.geneontology.org/ the ultimate resource for gene function, processes, localization BioMart: http://www.biomart.org/ Portal providing access to multiple database GSEA: http://www.broadinstitute.org/gsea/index.jsp part of GenePattern but also R David: http://david.abcc.ncifcrf.gov/ to perform an over-representation analysis Bingo: http://www.psb.ugent.be/cbd/papers/BiNGO/Home.html over-representation analysis but produce graphical result(cytoscape) BioGPS: http://biogps.org/ To know where your gene is expressed in the body or which cell line
We live in a Big Datayour results Interpreting worldReactome• Made to be used programmatically • Cytoscape (a network tool) has a plugin for Reactome. Just give a gene list or a list of gene + the number of sample where the gene is mutated(for Cox survival analysis)- Retrieve a network from a gene list- Do network analysis- Perform Gene Ontology analysis- Survival analysishttp://www.reactome.org/userguide/Usersguide.html#FI_Network_Tool
We live in a Big Datayour results Interpreting worldDAVID databasePerform fast over-representation analysis again different databases- KEGG; Reactome; OMIM (diseases), Generif (literature), protein domain etc...Protein domains
We live in a Big Datayour results Interpreting worldbioGPS. Exploring expression across tissues and cell lines Look at other library of tissues
We live in a Big Datayour results Interpreting worldRMAexpress and quality control of microarraysSeveral test exist to test if the microarray performed correctly.Hall of fame of failed microarrays:http://plmimagegallery.bmbolstad.com/
Analyzing public gene expression data We live in a Big Data worldAnalyzing public microarray with code (kind of...)
Analyzing public gene expression dataWe live in a Big Data world Then clic on “TOP 250” button
Analyzing public gene expression dataWe live in a Big Data world Top 250 genes R code
Next Generation Sequencing We live in a Big Data worldNext Generation SequencingThe main NGS platform are:• Roche /454 (Genome Sequencer; GS)• Illumina/Solexa (Genome Analyzer software)• SOLiD (Applied Bioscience)Upcoming challengers:• Ion Torrent (Illumina)• Oxford Nanopore What you should request Done by the core facility
Analyzing mRNA-seq We live in a Big Data worldAnalyzing mRNA-seq data: 4 steps.1- Alignment and trimming of reads: [no GUI] [with GUI and commercial] Tophat (assembly and splice junction mapper) Genome Studio from Illumina Cufflinks (assembly and RPKM estimates) Genomequest [looks pretty awesome.] GALAXY provide access to Tophat, Cufflinks.2- Calling variants and indels: GATK (http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page) VarScan (http://varscan.sourceforge.net/) SHRIMP2; VARiD; Atlas-SNP2; SomaticSniper... Interpretation of variants: SIFT (galaxy)3- Finding differentially expressed genes Cuffdiff (galaxy) DEXseq (R)4- Visualization: SAVANT (http://genomesavant.com/savant/) IGV (http://www.broadinstitute.org/software/igv)
How to use Galaxy? We live in a Big Data worldAnalyzing mRNAseq data: Introducing GALAXY http://galaxy.psu.edu/
We live in a Big Data the cloud Working in worldDudley JT, and Butte AJ. 2010. In silico research in the era of cloud computing. NatBiotechnol 28: 1181–1185.
Summary mRNA-seq We live in a Big Data worldGALAXYThis is a compendium of software. You even have UNIX tools and EMBOSS in it.Take home message:FASTQ files > Tophat > Cuffdiff > IGV (for differential expression)FASTQ files > Tophat > GATK > IGV (for variant detection)Where to find help: http://seqanswers.comAnalyzing RNAseq using RDEXSeq is a R / BioConductor package.R is a statistical programming softwarewidely used in bioinformatics
Summary mRNA-seq We live in a Big Data worldAdditional tools for genomic-- Genomespace: http://www.genomespace.org Collection of tools: GenePattern, Galaxy, cytoscape, genomica etc... (freeapparently). Data are stored in the cloud on Amazon VM.If you do not want to do it yourself:-- Science exchange: https://www.scienceexchange.com/ Science job for hire! This is where top core facilities compete to provide thebest service.-- Assay Depot: https://www.assaydepot.com/ like home depot but for science-- taskrabbit: http://www.taskrabbit.com/ If science take too much of your time!
Beyond genomics: results interpretation We live in a Big Data worldInterpreting your gene list with protein-protein interaction network.iHOP: http://www.ihop-net.org/UniPub/iHOP/Ingenuity Pathway Analysis(commercial) access throughCMGM @ stanford
Beyond genomics: results interpretation We live in a Big Data worldLooking into PPI databases: IntAct: http://www.ebi.ac.uk/intact/ BioGrid: http://thebiogrid.org/ (soon multigene search) HPRD: http://www.hprd.org/index_htmlWhat about open-source solutions for searching the interaction between the genes inyour gene list?• Cytoscape http://cytoscape.org • BioNetBuilder http://chianti.ucsd.edu/cyto_web/plugins/ • ...• R for programmatic access to databases • http://brainchronicle.blogspot.comThe plus of using R is that results arereproducible and you can share your methodmore easily than with point and clic interface.
We live in management and manipulation Data a Big Data worldREDCap: http://project-redcap.org/ Web app for building and managing online survey and databasesTo find participants: https://www.researchmatch.orgMySQL for a professional relational database. Requires some programming skills in SQL and database design.Application to query and build databases (goodbye command line):[OS X]: SequelPro[Windows]: sqlyog; Toad for MySQL...
Data are dirty... We live in a Big Data worldHow to clean your data more efficiently than doing everything by hand? 12:10:00 9999999 POCT Comment GLUCOSE BY METER 21:24:00 51 O2 Saturation, ISTAT (Ven) ISTAT EG7, VENOUS 5:39:00 91 Glu GLUCOSE BY METER 10:58:00 9999999 Comments BLOOD CULTURE (2 AEROBIC BOTTLES) 9:36:00 9999999 Report Status BLOOD CULTURE (2 AEROBIC BOTTLES) 16:25:00 25 CO2, Ser/Plas METABOLIC PANEL, COMPREHENSIVE 8:12:00 132 Glucose, Ser/Plas METABOLIC PANEL, BASIC 8:06:00 5.7 MONO, % CBC WITH DIFF 8:01:00 9.6 Glucose METABOLIC PANEL, BASIC 13:22:00 16.2 CO2 (a) BLOOD GASES, ARTERIAL 4:45:00 2.7 MONO CBC WITH DIFFDataWrangler @ Stanfordhttp://vimeo.com/19185801Google-refine @ down the road.A bit less intuitive than Wrangler.For more complex data transformation: reshape2 package in R
We live in a Statistics made easy... Big Data worldExcel... Obviously. But what else when you want something more powerful?• Switch to a statistical software like R. • R graphical interface: Deducer (http://www.deducer.org/) • http://www.youtube.com/watch?v=T6kOvlMaFCAThe case of starting using R1. Powerful statistics procedures • R has become the lingua franca for statistical programming2. Packages for everything from • Flow cytometry • DNA microarrays • RNA-seq • Google graph API • ... See http://goo.gl/RwER73. Graphics, graphics, graphics... • R graphical manual: http://goo.gl/qSHMQ
We liveData Science Visualization: Circos in a Big Data worldCIRCOS: http://circos.ca/To visualize genome scale interaction andfunctional information CIRCOS is a Perl program. Some light programming is needed. But it is worth it!
We live in Data Science Visualization a Big Data worldTableau: http://www.tableausoftware.com/ Great for geo-localized data
We live in Data Science Visualization a Big Data worldGoogle Visualization: https://developers.google.com/chart/interactive/docs/galleryRequire data in JSON format. Fortunately a bridge with R is possible. Earthquake in Japan
We live in Data Science Visualization a Big Data world Google Visualization: https://developers.google.com/chart/interactive/docs/gallery Motion chart http://www.youtube.com/watch?v=rnF-7TCIe08R commands:> M1 <- gvisMotionChart(Fruits, idvar="Fruit", timevar="Year”)> plot(M1)
We live in aDemystifying the work Big Data world Its all about “reproducible research” Sharing your analytical process (aka. what you did) is as important as the final manuscript. How do you share what you did with a graphical interface? The solution is to use a programming language, like R if suitable, and share your code.Several tools can make your life easier.Rstudio or DeducerCome to the workshop in 2 weeks!
We live in a Big The kitchen Data worldTextMate and NotePad++ for codingUse version control systems like GitHub or BitbucketTo make research reproducible when data are not available:DataThief: http://www.datathief.org/To follow the last buzz in science: Twitter @druau Some R books. Most of those book are available online for free through the Stanford Library.
Q&AWe live in a Big Data worldThis Class was sponsored by the Office of Postdoctoral Affairs andthe Lane LibraryOffline questions to email@example.com Thanks!