Human Genome and Big Data Challenges

1,078 views

Published on

Presented to the Big Data in Symptoms Research Boot Camp, National Institute of Nursing Research, July 24, 2014

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,078
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
58
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 1 hr
  • Within biomedical research, many data types
    Victims of our own success
    Data production outstrips data handling and analysis
    Major long-term changes are needed
  • Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124
    http://www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328
  • Federal Information Security Management Act of 2002
    The Health Insurance Portability and Accountability Act of 1996
  • Human Genome and Big Data Challenges

    1. 1. Human Genome and Big Data Challenges Philip E. Bourne Ph.D. Associate Director for Data Science National Institutes of Health http://www.slideshare.net/pebourne
    2. 2. The History of Analytics in Biomedical Research 1980s 1990s 2000s 2010s 2020 Discipline: Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver The Raw Material: Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated The People: No name Technicians Industry recognition data scientists Academics Searls (ed) The Roots in Bioinformatics Series PLOS Comp Biol
    3. 3. Data Science Timeline 6/12 • U54 Centers of Excellence - under review • U54 BD2K-LINCS– under review • U24 Data Discovery Index– under review • R01, R41, R42, R43, R44, U01 software and analysis methods grants – on-going • T32, T15, K01, R25 and R26 training awards – under review 2/14 3/14
    4. 4. “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair …”
    5. 5. A Tale of Two Numbers Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
    6. 6. Myriad Data Types Other ‘Omic Imaging Phenotypic Clinical Genomic Exposure
    7. 7. Growth May Just be Beginning  Evidence: – Google car – 3D printers – Waze – Robotics From: The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies by Erik Brynjolfsson & Andrew McAfee
    8. 8. There are other drivers of change out there besides economics and an increasing emphasis on data and analytics
    9. 9. Politicians Demand It: G8 open data charter 9http://opensource.com/government/13/7/open-data-charter-g8
    10. 10. [from Harlem Krumholtz]
    11. 11. The Story of Meredith http://fora.tv/2012/04/20/Congress_Unplugged_ Phil_Bourne Stephen Friend
    12. 12. We have Entered An Era of Deinstitutionalize & Democratization of Science Daniel Hulshizer/Associated Press
    13. 13. We have Entered An Era of Deinstitutionalize & Democratization of Science – NIH Should Support This Daniel Hulshizer/Associated Press
    14. 14. 47/53 “landmark” publications could not be replicated [Begley, Ellis Nature, 483, 2012] [Carole Goble]
    15. 15. Scholarship is broken  I have a paper with 16,000 citations that no one has ever read  I have papers in PLOS ONE that have more citations than ones in PNAS  I have data sets I am proud of few places to put them  I edited a journal but it did not count for much
    16. 16. It was the age when software developers are in the greatest demand for science.. It was the age when the rewards outside academia are greater than the rewards inside
    17. 17. It was a time when patient data are becoming more available It is a time when the ability to maintain the anonymity of a patient gets harder and harder
    18. 18. To Summarize Thus Far … A time of great (unprecedented?) scientific development but limited funding A time of upheaval in the way we do science
    19. 19. From a funders perspective… A time to squeeze every cent/penny to maximize the amount of research that can be done A time when top down approaches meet bottom up approaches
    20. 20. Top Down vs Bottom Up  Top Down – Regulations e.g. US: Common Rule, FISMA, HIPPA – Data sharing policies • OSTP • GWAS • Genome data • Clinical trials – Digital enablement – Moves towards reproducibility  Bottom Up – Communities emerge and crowd source • Collaboration • Data shared • Open source software • Common principles • Standards
    21. 21. Okay so what are we doing about it?
    22. 22. To start with we are thinking about the complete research lifecycle
    23. 23. The Research Life Cycle IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
    24. 24. Tools and Resources Will Continue To Be Developed IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Authoring Tools Lab Notebooks Data Capture Software Analysis Tools Visualization Scholarly Communication
    25. 25. Those Elements of the Research Life Cycle Need to Become More Interconnected Around a Common Framework IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Authoring Tools Lab Notebooks Data Capture Software Analysis Tools Visualization Scholarly Communication
    26. 26. Those Elements of the Research Life Cycle Need to Become More Interconnected Around a Common Framework IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Authoring Tools Lab Notebooks Data Capture Software Analysis Tools Visualization Scholarly Communication Commercial & Public Tools Git-like Resources By Discipline Data Journals Discipline- Based Metadata Standards Community Portals Institutional Repositories New Reward Systems Commercial Repositories Training
    27. 27. Those Elements of the Research Life Cycle Need to Become More Interconnected Around a Common Framework IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Authoring Tools Lab Notebooks Data Capture Software Analysis Tools Visualization Scholarly Communication Commercial & Public Tools Git-like Resources By Discipline Data Journals Discipline- Based Metadata Standards Community Portals Institutional Repositories New Reward Systems Commercial Repositories Training
    28. 28. Associate Director for Data Science Commons Training Center BD2K Modified Review Sustainability* Education* Innovation* Process • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis The Biomedical Research Digital Enterprise Communication Collaboration rogrammatic Theme Deliverable Example Features • IC’s • Researchers • Federal Agencies • International Partners • Computer Scientists Scientific Data Council External Advisory Board * Hires made
    29. 29. I. Facilitating Broad Use of Biomedical Big Data II. Developing and Disseminating Analysis Methods and Software for Biomedical Big Data III. Enhancing Training for Biomedical Big Data IV. Establishing Centers of Excellence for Biomedical Big Data BD2K: Four Programmatic Areas
    30. 30. What are we proposing as that common framework?
    31. 31. The Commons Is …  A public/private partnership  An agile development starting with the evaluation of a few pilots  An example: porting DbGAP to the cloud  An experiment with new funding strategies
    32. 32. What The Commons Is and Is Not  Is Not: – A database – Confined to one physical location – A new large infrastructure – Owned by any one group  Is: – A conceptual framework – Analogous to the Internet – A collaboratory – A few shared rules • All research objects have unique identifiers • All research objects have limited provenance
    33. 33. Sustainability and Sharing: The Commons Data The Long Tail Core Facilities/HS Centers Clinical /Patient The Why: Data Sharing Plans The Commons Government The How: Data Discovery Index Sustainable Storage Quality Scientific Discovery Usability Security/ Privacy Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment The End Game: KnowledgeNIH Awardees Private Sector Metrics/ Standards Rest of Academia Software Standards Index BD2K Centers Cloud, Research Objects,
    34. 34. What Does the Commons Enable?  Dropbox like storage  The opportunity to apply quality metrics  Bring compute to the data  A place to collaborate  A place to discover http://100plus.com/wp-content/uploads/Data-Commons-3- 1024x825.png
    35. 35. [Adapted from George Komatsoulis] One Possible Commons Business Model HPC, Institution …
    36. 36. Commons Pilots  Define a set of use cases emphasizing: – Openness of the system – Support for basic statistical analysis – Embedding of existing applications – API support into existing resources  Evaluate against the use cases  Review results & business model with NIH leadership  Design a pilot phase with various groups  Conduct pilot for 6-12 months  Evaluate outcomes and determine whether a wider deployment makes sense  Report to NIH leadership summer 2015
    37. 37. One Possible End Product 1. User clicks on thumbnail 2. Metadata and a webservices call provide a renderable image that can be annotated 3. Selecting a features provides a database/literature mashup 4. That leads to new papers 1. A link brings up figures from the paper 0. Full text of PLoS papers stored in a database 2. Clicking the paper figure retrieves data from the PDB which is analyzed 3. A composite view of journal and database content results 4. The composite view has links to pertinent blocks of literature text and back to the PDB 1. 2. 3. 4. PLoS Comp. Biol. 2005 1(3) e34
    38. 38. Mission Statement To foster an ecosystem that enables biomedical research to be conducted as a digital enterprise that enhances health, lengthens life and reduces illness and disability
    39. 39. Some Acknowledgements  Eric Green & Mark Guyer (NHGRI)  Jennie Larkin (NHLBI)  Leigh Finnegan (NHGRI)  Vivien Bonazzi (NHGRI)  Michelle Dunn (NCI)  Mike Huerta (NLM)  David Lipman (NLM)  Jim Ostell (NLM)  Andrea Norris (CIT)  Peter Lyster (NIGMS)  All the over 100 folks on the BD2K team

    ×