Big Data


Published on

Barend Mons over Big Data op de SURFnet Relatiedagen 2012

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Messages: The data in the life sciences is not only immense, but also highly complex First: data are captured from the differently levels of organisation in living organisms: DNA, RNA, Protein, Metabolites, cells, tissues, organs and whole organisms. Next even ecological, social-behavioural and epidemiological data play a key role. These data are captured with a variety of instruments and techniques and are in many different formats (not necessarily compatible) Such data are generated in studies on many different (model) organisms form virusses and bateria to humans. Many data need interpretation across species. Many data have to be captured in time or space series and is therefore also mutlidimensional DISC will nor only provide the necessary tools and compute infrastructure but critically also the experts to integrate and connect the data towards biological interpretation. In some case this will only be two pieces of the puzzle, but in many cases more. The final goal is biological understanding and societal application, not just major publications in the Green, the Red and the White sectors of biology.
  • Messages: Big Data problem now pervading mainstream non-science literature as well and the deluge is everywhere, however the complexity and multidisciplinary nature of LS data makes them a particular challenge. No single institution or even Big Pharma or DSM/UNILEVER can have all the technology and expertise in-house (see IMI, ESFRI) Even if economically and technically feasible, repeating the deep analysis and preprocessing of massive (frequently publicly available) datasets behind firewalls of institutions or companies is now considered a waste of precious resources as much of it is precompetitive. The real added value is in the biological interpretation of the data and its application in red, green and whit innovations. Modern science is really about ‘projecting’ one’s own limited data on a massive body of ‘known’ and prior biological knowledge, way beyond ‘reading’ DISC will support all super institutional needs for data integration, stewardship and interpretation at the request of the users DISC will be closely associated with the top research institutions participating in it, and distributed over multiple concentrations of expertise and infrastructure to ensure a continued ‘cutting edge’ offering in all four infrastructural aspects (computing, tooling, expertise and training) Several key technologies of can be applied beyond the Life Sciences. If The Netherlands does miss out on massive data expertise other centers will develop and crucial expertise will ‘leave’ our country. Now, NL has a leading role and can benefit (example BGI China).
  • De ecosystem aanpak met interoperable data maakt knowledge management en knowledge discovery mogelijk over ALLE data
  • Dat kan een private partij per definitie niet oppakken omdat ze geen trusted party zijn (community vorming, certificering, ONS beheer, etc) Vandaar de 4 kolommen en al het werk dat al is verzet inclusief 'adaptatie' door heel veel relevante Associations en Academic Institutions (CWA, W3C, ..................) Dat vraagt om een PPP benadering waarin Elsevier zijn eigen rol speelt strategisch gepositioneerd in de value chain van het ecosysteem  De trusted party activiteiten, de infrastructuur en de community worden door anderen gedaan
  • Schaakspel metafoor??
  • Big Data

    1. 1. A curse of interdisciplinarity‘ A challenge in the other discipline always seems ‘easy’ because we are not hindered by knowledge’.Barend Mons(DTL-DISC/ELIXIR)NBIC, LUMC. 1
    2. 2. PPP10/09/12 2
    3. 3. ELIXIRSafeguarding the results of life science research in Europe European Life Sciences Infrastructure for Biological Information
    4. 4. DISC: the connected data departments of DTL research Hotels DISC*technologyfacilitiestechnologyresearcheducation DTL& training *) DISC = DTL Data Integration & Stewardship Centre
    5. 5. What is bioinformatics?• The science of storing, retrieving and analysing large amounts of biological information• An interdisciplinary science involving biologists, biochemists, computer scientists and mathematicians• At the heart of modern biology 5
    6. 6. Bioinformatics underpins life-science research 11Genomes GenomesContain genes Contain genes 22Genes are Genes are transcribed transcribed 33Transcripts translate Transcripts translate to protein sequences to protein sequences 44Proteins form three- Proteins form three- dimensional structures dimensional structures 55Proteins interact with each other Proteins interact with each other and with small molecules to form and with small molecules to form pathways pathways 6 Pathways combine 6 Pathways combine to build systems to build systems 6
    7. 7. Life Science data: Multi-omics, multi-technology, multi organism, multi dimensional
    8. 8. From molecules to medicineMolecular components Integration Translation Genomes Human populations Nucleotides Biobanks Tissues and organs Transcripts Complexes Therapies Proteins Disease prevention Domains Pathways Cells Human Early individuals DiagnosisStructures Small molecules 8
    9. 9. What is ELIXIR?• An ESFRI research infrastructure of global significance• Unites Europe’s leading life science organisations in managing and safeguarding the vast amounts of data being generated every day by publicly funded research.• A large-scale initiative that will provide the facilities necessary for Europe’s life-science researchers to make the most of our rapidly growing store of information about living systems, which is the foundation on which our understanding of life is built. 9
    10. 10. Why ELIXIR?• Creating a robust infrastructure for biological information is a bigger task than EMBL-EBI – or any individual organisation or nation – can take on alone.• Biology has by far the largest research community: • ~3 million life science researchers in Europe • >6 million web hits a day at EMBL-EBI alone• We need to involve other European partners 10
    11. 11. The challenge• Computer speed and storage capacity is doubling every 18 months and this rate is steady• DNA sequence data is doubling every 6- 8 months over the last 3 years and looks to continue for Guy Cochrane, ENA, EMBL-EBI this decade 11
    12. 12. Europe has already paid for the science Annual cost of generating new protein structure data in labs around the world Annual cost of maintaining the data in a central database 12
    13. 13. ELIXIR’s missionTo build a sustainableEuropean infrastructure forbiological information,supporting life scienceresearch and its medicinetranslation to: environment bioindustries society 13
    14. 14. A distributed pan-European infrastructure 14
    15. 15. BenefitsELIXIR will contribute to European innovation by:• Optimising access and exploitation of life-science data• Ensuring longevity of the data, thereby protecting investments already made in research• Enhancing the quality of European research by supporting national efforts to increase the competence and number of bioinformatics users through training• Strengthening the global position and influence of Europe in life-science research in both in academia and industry 15
    16. 16. The scientific reason for ELIXIR• Data is an essential commodity for life-science research.• Ten years ago, finding the connection between a gene and a characteristic (e.g. drought tolerance, risk of heart disease) could take years; now it takes minutes. Image courtesy of Genome Research Ltd.• Data analysis is now the bottleneck in life-science research• ELIXIR is our only realistic hope of easing that bottleneck 16
    17. 17. One societal reason for ELIXIR• The era of personal genome sequencing is upon us.• Sequence data will not cross national boundaries.• Every national health system will need expertise to interpret it and treat patients accordingly.• Individuals need to be sure that their personal biological data are in safe hands. 18
    18. 18. The financial reason for ELIXIR• Europe has already spent the money to generate the data.• It will waste all this investment in research if the future of the data is not secured.• Industry, from SMEs to big multinationals, needs access to public data to analyse its proprietary data. 19
    19. 19. Maintaining open access• Open access to life science is essential for advances in many areas of research• Open access to bioinformatics resources provides a valuable path to discovery, one that in many other areas of research is limited by commercial confidentiality Mark Forster, Syngenta,• Charging for that data, or seeking to restrict member of the EMBL-EBI Industry Programme access through exercising Intellectual Property (IP) rights, would impede progress• ELIXIR will guarantee that open access to biological data is maintained. Speaking with a single voice will strengthen Europe’s influence in such global discussions. 20
    20. 20. 13 ELIXIR Countries 21
    21. 21. Part two >>>> eScience in LS• The way we dicover knowledge has changed fundamentally over just a decade. BIGNORANC E10/09/12 22
    22. 22. The general challenge: Data has far outgrown institutional handling capacity is everywhere The Data Deluge The Issue: But Life Sciences is particularly challenged and complex. More and more We write ‘about datasets’ ….The amount of digital data is That are too large to publish exploding, with a staggering 1.8 zettabytes in 2011 In narrative
    23. 23. Nanopublications & Cardinal Assertions Nanopublication A Nanopublication is the smallest unit of publishable information containing: 1.Assertion A statement of concepts in terms of one or more ‘subject -> predicate -> object’ (triple) relationships. 1.Provenance a)Attribution – Who made this assertion,1 ‘n’ when and where?identical different b)Supporting information – Any otherassertion provenances information which is relevant to the assertion (e.g. this assertion is only valid in humans under 18). A Cardinal Assertion aggregates all ‘n’ Nanopublications making the same assertion. It therefore has 1 assertion and ‘n’ provenances, eliminating redundancy. Cardinal Assertion
    24. 24. Under the hood……
    25. 25. Managing volume & complexityCombining Cardinal Assertions with 5 5Concept profiles reduces the amount ofdata with ≈99.999996% 4 4 1 1Individual 2 2Concept Profiles≈4x106IndividualCardinal Assertions 5 4 2 1> 10 11IndividualNanopublications> 1014
    26. 26. The LS concept web: 2x2x106 concepts (profiles)
    27. 27. A dynamic Concept Web versus a static Ontology28
    28. 28. = Known reference pairs = non-co-occurrence pairs More mutual informationNo increase in concept overlap Including manual curation More concepts in common Removal of low info paths
    29. 29. eScience…. in silico reasoning and in cerebro validation Expert Skype calls Reading up
    30. 30. Organisation of the ecosystemGlobal Authority Nanopublishers App & Service Users Providers Endorse CA Space Application Knowledge (OCS & ICS) development Management Providers Reasoning services Practices Academic & Best ONS/INSs technical and Commercial process Users consultancy project Knowledge Original delivery Discovery Assist & Data Owners capacity Certify
    31. 31. 33
    32. 32. IN ANY CASE: regardless of how ‘sensitive’ your data is, it is malpractice to: - Generate data without a solid stewardship plan - Build impenetrable SILOS - Fail to record provenance - Store them in non interoperable format - Think that data=information -EVEN if your only goal is the Nobel Prize (or for Dutch: a Spinoza Prize)34
    33. 33. Acceptance of Semantic Web ApproachOver the last decade, academicresearch organisations developednew methodologies and tools toaddress the Big Data problem.Global agreement by leadingscientists on uniqueNanopublication solution.100’s of millions already investedin the basis technologyApplicable as a technology across(STM) domains and industries.Pharmaceutical companies areearly adopters (InnovativeMedicine Initiative).
    34. 34. The ‘Dutch Team’ Acknowledging… • Herman van Haagen , MsC. (LUMC) • Dr. Peter Bram ‘t Hoen (LUMC) CWA- Open PHACTS • Dr. Marco Roos (LUMC) • Prof. Amos Bairoch (SIB, Switzerland, CWA) • Dr. Erik Schultes (LUMC) • Prof. Carole Goble (Mancheste, CWA, OPS) • Prof. Johan den Dunnen (LUMC) • Prof. Katy Borner (Indiana University CWA) • Prof. Gertjan van Ommen (LUMC) • Prof. Mark Musen (NCBO, Stanford CWA,OPS) • Dr. Erik van Mulligen (EMC) • Dr. Pascale Gaudet (UniProt, ISB, CWA • Dr. Jan Kors (EMC) • Dr. Mike Colon (VIVO, UF, CWA) • Dr. Martijn Schuemie (EMC) • Prof. Maryann Martone (Force 11, USC, CWA) • Prof. Johan van der Lei (EMC) • Dr. Nigam Shah (NCBO, Stanford, CWA, OPS) • Dr. Rob Hooft (NBIC) • Dr. Mark Wlikinson (Canada, CWA) • Dr. Christine Chichester (NBIC) • Abel Packer (Brazil, Scielo, CWA, OPS) • Dr. Leon Mei (NBIC) • Jan Velterop (ACKnowledge, CWA, OPS) • Kees Burger (NBIC) • Albert Mons (CWA, NBIC) • Bharat Singh (NBIC/EMC) • Prof. Frank van Harnelen (FUA/LARKC, CWA, OPS) • Dr. Marc van Driel (NBIC) • Dr. Chris Evelo (Maastrciht, CWA, OPS) • Dr. Ruben Kok (NBIC) • Dr. Antony Willams (RSC/ChemSpider, CWA,OPS) • Prof. Marcel Reinders (NBIC) • Dr. Richard Kidd (RSC, OPS) • Prof. Jaap Heringa (NBIC) • Dr. Paul Groth (FUA, CWA, OPS) • Prof. Gert Vriend (NBIC) • Dr. Michel Dumontier (Canada, CWA, OPS) • Dr. Morris Schwertz (BBMRI, CWA) • Dr .Andrew Gibson, UA, CWA, OPS) • Dr. Andra Waagmeester (NBIC) • Dr. Bryn Williams-Jones (Pfizer, OPS) • Dr. Kristina Hettne (LUMC) • Dr. Ian Dix (Astra Zeneca, OPS) • Dr. Rene van Schaik (eScience Cenrte) • Dr. Niklas Blomberg (Astra Zeneca, OPS) • Drs. Albert Mons (PHORTOS consultants) • Dr. Mike Barnes, GSK, OPS) • Mr. Drs. Arie Baak (PHORTOS consultants) • Prof. Jan-erik Litton (CWA, BBMRI)
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.