Ari Berman - Intel Big Data Seminar 9/6/2012

3,108
-1

Published on

Intel hosted a seminar on Managing Big Data in Life Sciences and Healthcare at the Broad Institute in Cambridge, MA on 9/6/2012. I was invited to give a talk on how BioTeam, Inc sees and is approaching the Big Data Mangement problem being faced by Life Sciences in the wake of modern genomics and next-generation sequencing.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,108
On Slideshare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
77
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Ari Berman - Intel Big Data Seminar 9/6/2012

    1. 1. BIOTEAMEnabling Science Storage InfrastructureFont: Optima Regular and Data ManagementColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) in Life Sciences Ari E. Berman, Ph.D. Senior Scientific Consultant, BioTeam, Inc. ©BioTeam, Inc. 2012 - http://www.bioteam.net
    2. 2. BIOTEAM A little about meEnabling Science • Ph.D. in Molecular Biology/NeuroscienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • Trained in laboratory and bioinformatics • 13 years experience as an IT infrastructure/ HPC geek/Perl monger • Odd mix of skills led me to BioTeam • Joined BioTeam in May ©BioTeam, Inc. 2012 - http://www.bioteam.net
    3. 3. BIOTEAM Who is BioTeam?Enabling Science • Independent Consulting PracticeFont: Optima RegularColors: • Made up of scientists vast experience inDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) software, HPC, and IT • Unique cross-section of skill sets • 10+ years of bridging the gap between technology and science • Functions as much as a think tank as a consulting practice. ©BioTeam, Inc. 2012 - http://www.bioteam.net
    4. 4. Why am I here talkingBIOTEAMEnabling Science to you? • We work on broad range of projects:Font: Optima Regular Pharma, Biotech, EDU, .gov, .mil, etc.Colors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • We are in a unique position: can see how people are approaching current problems • We work from a tech agnostic perspective: we provide what’s best for the customer • Our niche: 1000ft. overview of tech problems in life sciences ©BioTeam, Inc. 2012 - http://www.bioteam.net
    5. 5. BIOTEAMEnabling Science Why are we all here?Font: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) Big data in life-sciences: just when you thought it was safe to go back into the datacenter... ©BioTeam, Inc. 2012 - http://www.bioteam.net
    6. 6. BIOTEAM Big data: the tired storyEnabling Science • Next-generation sequencing, Mass spec, imaging, etc.Font: Optima Regular •Colors: High-throughputDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) experimentation • Clinical research/standard healthcare - personalized medicine • Un-natural expansion of technology (sequencing) • Now: we can get the data fast, what do we do with it? ©BioTeam, Inc. 2012 - http://www.bioteam.net
    7. 7. BIOTEAM Big data: the tired storyEnabling Science • At this point, this is an old problemFont: Optima Regular •Colors: Most sequencers generatingDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) 0.5TB/day • Final genomes around 300GB • High-volume quantitative methods quickly produce 100’s of TBs of data • The kicker: tight research budgets ©BioTeam, Inc. 2012 - http://www.bioteam.net
    8. 8. BIOTEAM Storing Big DataEnabling Science • Problem is less about storingFont: Optima Regular the data. We’ve solvedColors:Dark Blue #003399 (CMYK 96, 69, 3, 0) storage.Light Blue #6699CC (CMYK 62, 22, 3, 0) • We can now put in thousands of spindles in a semi-affordable manner • Lots of high-density boxes • The petabyte challenge has been met • Now, it needs to work well • And still be affordable ©BioTeam, Inc. 2012 - http://www.bioteam.net
    9. 9. Today’s problem: AccessingBIOTEAMEnabling Science Big DataFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0) • In practice - get to 1.5PB,Light Blue #6699CC (CMYK 62, 22, 3, 0) 500M files: metadata falls off a cliff • Directory listings take minutes • Sorting takes forever • Forget about filesystem profiling/optimization ©BioTeam, Inc. 2012 - http://www.bioteam.net
    10. 10. Today’s problem: AccessingBIOTEAMEnabling Science Big Data •Font: Optima RegularColors: What’s being done? •Dark Blue #003399 (CMYK 96, 69, 3, 0) SSDs thought to be ourLight Blue #6699CC (CMYK 62, 22, 3, 0) savior • Blazing fast, SLC, many in parallel • Parallel filesystems could cache metadata on SSDs • Reduce search time orders of magnitude ©BioTeam, Inc. 2012 - http://www.bioteam.net
    11. 11. Today’s problem: AccessingBIOTEAMEnabling Science Big Data •Font: Optima Regular Of course, it’s not thatColors: simpleDark Blue #003399 (CMYK 96, 69, 3, 0) •Light Blue #6699CC (CMYK 62, 22, 3, 0) Now, distribution and access points of SSDs matter • How they are addressed matters • How many small files on the filesystem matters • How the files are to be used matters ©BioTeam, Inc. 2012 - http://www.bioteam.net
    12. 12. BIOTEAM But wait: there’s moreEnabling Science • A consistent array of disksFont: Optima RegularColors: no longer enough beyond 1.5PBDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • Entire solutions of high- speed disks not cost- effective • Distribution of file access needs: some fast, some archive • Tiering of storage infrastructure ©BioTeam, Inc. 2012 - http://www.bioteam.net
    13. 13. BIOTEAM TieringEnabling Science • Keep archival data onFont: Optima Regular slower, cheaper disksColors: • No SSDsDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • Keep fast access files on smaller, high-speed disks with many (possibly all) SSDs (HPC, high throughput needs) • Mid-level tiers for administrative needs (documents, etc) • Can even add a tape tier for more permanent storage ©BioTeam, Inc. 2012 - http://www.bioteam.net
    14. 14. BIOTEAM Managing TiersEnabling Science •Font: Optima Regular Administratively difficult •Colors: Can manage by differentDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) mount points, quotas, user education • Better: policy engines • Use with parallel file systems (GPFS, OneFS, etc) • Policy based automated movement of files through tiers, even to tape ©BioTeam, Inc. 2012 - http://www.bioteam.net
    15. 15. BIOTEAM By golly, we’ve done it!Enabling ScienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • If done correctly, single namespace infrastructure can work well for all needs • Can handle HPC to archive • Can be done in a semi-affordable manner ©BioTeam, Inc. 2012 - http://www.bioteam.net
    16. 16. BIOTEAM Now what?Enabling ScienceColors: • Now, we’re faced with more problemsFont: Optima RegularDark Blue #003399 (CMYK 96, 69, 3, 0) • For NIH, HIPAA laws, and general sanity,Light Blue #6699CC (CMYK 62, 22, 3, 0) need DR • Need twice the space than you’ll use • No other way to do it right now • Use inexpensive, slow disk solutions to save money on DR ©BioTeam, Inc. 2012 - http://www.bioteam.net
    17. 17. BIOTEAM Now what?Enabling ScienceFont: Optima Regular •Colors: Also: how to keep trackDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) of data? • At 1PB and 0.5 billion files, creative directory structures lose out • Complexity too much for anyone to handle ©BioTeam, Inc. 2012 - http://www.bioteam.net
    18. 18. BIOTEAM Data ManagementEnabling Science •Font: Optima RegularColors: One solution: databasesDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • Keep higher depth of metadata (tagging, descriptions) • Cumbersome for the general user to use: adds complexity layer to user experience ©BioTeam, Inc. 2012 - http://www.bioteam.net
    19. 19. BIOTEAM Data ManagementEnabling Science •Font: Optima RegularColors: Databases can work,Dark Blue #003399 (CMYK 96, 69, 3, 0) thoughLight Blue #6699CC (CMYK 62, 22, 3, 0) • iRODS is a good example • Put the metadata database layer in- between the filesystem and the user ©BioTeam, Inc. 2012 - http://www.bioteam.net
    20. 20. BIOTEAM Data ManagementEnabling Science • Others working on this model as wellFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • Cambridge Computer: “As you approach billions of files, file exploring is no longer feasible.” • Need a new interface • Rich metadata to keep track of the files ©BioTeam, Inc. 2012 - http://www.bioteam.net
    21. 21. BIOTEAM Wait, more metadata?Enabling Science • More metadata? wasn’t this the original problemFont: Optima RegularColors: on large filesystems?Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • Wouldn’t this make matters worse? • Depends on how it is done. • Current models have metadata completely separate from files ©BioTeam, Inc. 2012 - http://www.bioteam.net
    22. 22. BIOTEAM Wait, more metadata?Enabling Science • And... •Font: Optima RegularColors: Who’s going to go back and type all of thatDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) metadata in? • No one - we kind of need to start over • ...or, need a way of inferring metadata and filling in the blanks from existing data • Still need legacy support for systems ©BioTeam, Inc. 2012 - http://www.bioteam.net
    23. 23. BIOTEAM Or: MiddlewareEnabling ScienceColors: • Use an interactive software productFont: Optima RegularDark Blue #003399 (CMYK 96, 69, 3, 0) between filesystem and userLight Blue #6699CC (CMYK 62, 22, 3, 0) • Can manage link between filesystem and extended metadata • Can enhance the scientific process: manage data, analysis, results, and facilitate collaboration ©BioTeam, Inc. 2012 - http://www.bioteam.net
    24. 24. BIOTEAM What are scientists doing?Enabling Science Lab Scientist w/ Excel •Accessible for most scientistsFont: Optima Regular •FlexibleColors: • Data maintenanceDark Blue #003399 (CMYK 96, 69, 3, 0) burden on lab scientistsLight Blue #6699CC (CMYK 62, 22, 3, 0) •Quickly overwhelmed in size and complexity •Data publication by email ©BioTeam, Inc. 2012 - http://www.bioteam.net
    25. 25. BIOTEAM What are scientists doing?Enabling Science Lab Scientist w/ Excel •Accessible for most scientistsFont: Optima Regular •FlexibleColors: • Data maintenanceDark Blue #003399 (CMYK 96, 69, 3, 0) Lab Bioinformatician •Quick development of web- burden on lab scientistsLight Blue #6699CC (CMYK 62, 22, 3, 0) •Quickly overwhelmed in based system size and complexity •Rapid turn around for •Data publication by emailscientist needs •Single point of failure •Limited breadth of experience •Poor documentation, poor transition ©BioTeam, Inc. 2012 - http://www.bioteam.net
    26. 26. BIOTEAM What are scientists doing?Enabling Science Lab Scientist w/ Excel •Accessible for most scientistsFont: Optima Regular •FlexibleColors: • Data maintenanceDark Blue #003399 (CMYK 96, 69, 3, 0) Lab Bioinformatician •Quick development of web- burden on lab scientistsLight Blue #6699CC (CMYK 62, 22, 3, 0) •Quickly overwhelmed in based system size and complexity •Rapid turn around for Outsource custom •Data publication by emailscientist needs software •Single point of failure •Limited breadth of •Stable, professional software experience •Well documented, easier transition •Poor documentation, poor Communication barrier with transition • scientists •Lack of domain knowledge leaves large functionality gaps •Inflexible design leaves software obsolete in a matter ©BioTeam, Inc. 2012 - http://www.bioteam.net
    27. 27. BIOTEAM What are scientists doing?Enabling Science Lab Scientist w/ Excel •Accessible for most scientistsFont: Optima Regular •FlexibleColors: • Data maintenanceDark Blue #003399 (CMYK 96, 69, 3, 0) Lab Bioinformatician •Quick development of web- burden on lab scientistsLight Blue #6699CC (CMYK 62, 22, 3, 0) •Quickly overwhelmed in based system size and complexity •Rapid turn around for Outsource custom •Data publication by emailscientist needs software •Single point of failure •Limited breadth of •Stable, professional software experience •Well documented, easier transition •Poor documentation, poor Communication barrier with transition • scientists “Shrink-wrapped” •Lack of domain knowledge software. leaves large functionality gaps •Lab data management solutions •Inflexible design leavesleverage many customers, years software obsolete in a experience of matter •Year-to-year enhancement of product •High purchase price due to limited market •Mismatch to local lab expertise and workflow •Unused complexity ©BioTeam, Inc. 2012 - http://www.bioteam.net
    28. 28. BIOTEAM LIMSEnabling ScienceFont: Optima Regular • Laboratory Information ManagementColors: SystemDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • Many out there now: standard and custom • Many focus markets • Basespace: Illumina (NGS) • Quartzy: general lab monkey • MiniLIMS ©BioTeam, Inc. 2012 - http://www.bioteam.net
    29. 29. BIOTEAM DisclaimerEnabling ScienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • This will feel like a sales pitch • Just want to illustrate how we’re tackling information mangement problem ©BioTeam, Inc. 2012 - http://www.bioteam.net
    30. 30. BIOTEAM The BioTeam Solution: MiniLIMSEnabling Science • An affordable software product that leveragesFont: Optima Regular real world experienceColors:Dark Blue #003399 (CMYK 96, 69, 3, 0) •Light Blue #6699CC (CMYK 62, 22, 3, 0) Decades of combined software and informatics expertise • Years of LIMS customization • $4995 license for academic labs • Flexible architecture that adapts to new processes and technologies • Schema-less design allows real time changes to data model • Plugin architecture allows mix and match functionality ©BioTeam, Inc. 2012 - http://www.bioteam.net
    31. 31. BIOTEAM The BioTeam Solution: MiniLIMSEnabling ScienceFont: Optima Regular • Customization options that match labColors: resourcesDark Blue #003399 (CMYK 96, 69, 3, 0) • End user customizable system and Excel import/Light Blue #6699CC (CMYK 62, 22, 3, 0) export that empowers lab scientists • Accessible source code and APIs for in-house developers • BioTeam consulting for labs without development resources, or development teams that are stretched thin ©BioTeam, Inc. 2012 - http://www.bioteam.net
    32. 32. BIOTEAM The BioTeam Solution: MiniLIMSEnabling Science End user configurableFont: Optima RegularColors: Form and Page Display GC Mass SpecDark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) Invoicing PHP API Data and Configuration Objects Analysis tools Data Broker NGS Schema-less MySQL persistence MiniLIMS Core MiniLIMS Plugins ©BioTeam, Inc. 2012 - http://www.bioteam.net
    33. 33. MiniLIMS: Linking lab toBIOTEAM datacenterEnabling Science Central auth Customer workflow MiniLIMS Core Reagent inventory Uptime User acct setup, login reporting Lab workflow MiniLIMS PluginFont: Optima RegularColors: MiniLIMS CustomDark Blue #003399 (CMYK 96, 69, 3, 0) Sample receiving Sample registrationLight Blue #6699CC (CMYK 62, 22, 3, 0) Sample / library prep, QC Run / slide / flowcell Sample status setup Instrument console Run monitoring Results delivery, billing Analysis launch, monitoring, results ©BioTeam, Inc. 2012 - http://www.bioteam.net
    34. 34. BIOTEAM Simple/Flexible ConceptEnabling ScienceFont: Optima Regular Type NameColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) Property Value ©BioTeam, Inc. 2012 - http://www.bioteam.net
    35. 35. BIOTEAM Simple to queryEnabling ScienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) ©BioTeam, Inc. 2012 - http://www.bioteam.net
    36. 36. BIOTEAM Customizations: pluginsEnabling ScienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) ©BioTeam, Inc. 2012 - http://www.bioteam.net
    37. 37. BIOTEAM Customizations: pluginsEnabling ScienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) 1. Select Bowtie 3. Click to run 2. Select the Fastq File & the protocol Name the experiment ©BioTeam, Inc. 2012 - http://www.bioteam.net
    38. 38. BIOTEAM Customizations: pluginsEnabling ScienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) ©BioTeam, Inc. 2012 - http://www.bioteam.net
    39. 39. BIOTEAM WorkflowsEnabling ScienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) ©BioTeam, Inc. 2012 - http://www.bioteam.net
    40. 40. BIOTEAM Moving forward: ApplianceEnabling Science • Turnkey solution •Font: Optima Regular MiniLIMS + Local Analysis EngineColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) • Plan is to link to cloud resources: automatic backup & link to hosted MiniLIMS • 16 cores, 96GB RAM, 18T redundant storage, SSD for OS. • Solution for any lab needing LIMS ©BioTeam, Inc. 2012 - http://www.bioteam.net
    41. 41. BIOTEAM How to enable scienceEnabling Science • Solidify storage infrastructureFont: Optima RegularColors: • Add tiered storage withDark Blue #003399 (CMYK 96, 69, 3, 0) policy engine to move dataLight Blue #6699CC (CMYK 62, 22, 3, 0) • Supply DR • Enable metadata acceleration: SSDs + cache • Implement middleware for rich metadata tracking • Make it easy for the scientists ©BioTeam, Inc. 2012 - http://www.bioteam.net
    42. 42. BIOTEAMEnabling ScienceFont: Optima RegularColors:Dark Blue #003399 (CMYK 96, 69, 3, 0)Light Blue #6699CC (CMYK 62, 22, 3, 0) Thank you! ©BioTeam, Inc. 2012 - http://www.bioteam.net
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×