Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Starting the Hadoop Journey at a Global Leader in Cancer Research

499 views

Published on

Starting the Hadoop Journey at a Global Leader in Cancer Research

Published in: Technology
  • Be the first to comment

Starting the Hadoop Journey at a Global Leader in Cancer Research

  1. 1. Vamshi Punugoti & Bryan Lari MD Anderson Cancer Center June 2016 HDP @ MD ANDERSON Starting the Hadoop Journey at a Global Leader in Cancer Research
  2. 2. Agenda • About MD Anderson • Big Data Program • Our Hadoop Implementation • Lessons Learned • Next Steps
  3. 3. • Who we are – One of the worlds largest centers devoted exclusively to cancer care – Created by the Texas legislature in 1941 – Named one of the nation's top two hospitals for cancer care every year since the survey began in 1990 • Mission – MD Anderson’s mission is to eliminate cancer in Texas, the nation and the world through exceptional programs that integrate patient care, research and prevention. About MD Anderson
  4. 4. About MD Anderson cont. Patient Care Education Research
  5. 5. Moon Shots Program • Launched in 2012 – to make a giant leap for patients • Accelerating the pace of converting scientific discoveries into clinical advances that reduce cancer deaths • Transdisciplinary team-science approach • Transformative professional platforms List of Moon Shots 12 Total Moon Shots B-cell Lymphoma Lung Cancer Breast Cancer Melanoma Colorectal Cancer Multiple Myeloma Glioblastoma Ovarian Cancer HPV-Related Cancers Pancreatic Cancer Leukemia (CLL, MDS, AML) Prostate Cancer http://www.cancermoonshots.org
  6. 6. Volume Variety Velocity Veracity
  7. 7. Gulf of Mexico Analogy
  8. 8. Goals of Big Data Program • Data driven organization • All “types” of data • “Access” for all customers • Clinicians • Researchers • Administrative / Operational • Enable discovery of “insights” • Improve patient care • Increase research discoveries • Improve operations • Govern data like an asset • Provide a platform / environment to enable all these things
  9. 9. To provide the right information to the right people at the right time with the right tools Goal data insight
  10. 10. Insights
  11. 11. Make big data additive and build upon foundation
  12. 12. What are we doing today? • FIRE Enterprise Data Warehouse • Natural Language Processing (NLP) • Data Governance • Hadoop NoSQL • Cognitive Computing • Data Visualization • Evolving our Platform / Architecture • Identifying big data use cases • Training & Skills
  13. 13. • Federated Institutional Reporting Environment • Centralized data repository supporting analytics, decision making, and business intelligence • Central repository for historical and operational data • Break-down data silos Enterprise RepositorySource Systems Dashboards KPI’s Analytic Reports Analytics & Reporting Discoveries Improve Patient Care Quality / Perf Improvements Genomic FIRE Program Radiology Labs Epic / Clarity Legacy Systems
  14. 14. • Vast amounts of unstructured data are stored on MDACC servers. • Conventional ETL tools are not designed to mine unstructured data. • Suite of tools make up the NLP Pipeline • Dictionaries were created to help Epic go-live (Provider Friendly Terminology) • Other examples: • Diagnosis from the pathology reports • Comorbidities • Family Cancer History • Cytogenetics • Obituary text • ICD10 Coding • Structured results feeding Moonshot TRA and OEA • Etc. IBM ECM NLP Engine Unstructured Data Sources Post NLP Database HDWF (FIRE) NLP Pipeline - Overview
  15. 15. Enterprise Business Clinical Big Data Peoplesoft Systems of Record Systems of Reporting Systems of Insights Kronos Point of Sale Volunteer Services Rotary House MyHR UTPD Facilities Clinic Station Epic Lab GE IDX Cerner CARE EPM Hyperion Oracle Business Intelligence Smart View Web Analytics FIRE EIW Business Objects Crystal Hyperion Interactive Reporting Facebook Twitter UPS Center for Disease Control The Weather Channel LinkedIN Youtube oracle.com Yelp! Reuters Google U.S. Census Medical Devices Medical Equipment Building Controls Campus Video Real-time Location Service Wayfinding Data Visualization Ad Hoc Cognitive Computing Big Data for Analytics & Cognitive Computing Presentation Cohort Explorer Parking Garages Pharmacy Research LCDR Melcore Gemini IPCT
  16. 16. Data Governance Data Stewardship Data Portal Data Profiling and Quality Data Standardization Compliance Metadata and Business Glossary Master Data Management
  17. 17. Data Repository Dashboards KPI’s Analytic Reports Analytics & Informatics Discoveries Improve Patient Care Quality / Perf Improvements Data Mgt & Operations Data Lake Data Discovery Profiling Standards / Quality Big Data (Structured and NoSQL) Insight Apps Genomic Radiology Labs Epic / Clarity Legacy Systems
  18. 18. Big Data – High Level
  19. 19. Big Data Technical Architecture
  20. 20. Our Hadoop Implementation
  21. 21. Our Hadoop Implementation cont.
  22. 22. Our Hadoop Implementation cont. Average number of messages per day: 1,556,688 Estimated amount of storage increase per day: 5.7 GB Number of channels currently being used: 24 Estimated daily message processing capacity: 4,320,000
  23. 23. Our Hadoop Implementation cont. Medical Device Data Flow Data Source Data Capture MDA Big DataData Lake Access Portals (Analytics/Visualization) Integration HUB Data ingestion Processing Channels HBase Data Loader Capsule Capsule DB Medical Device End-Users FIRE/Big Data Cloverleaf Engine Epic TCP-based Data Listener - Flume HIVE PIG HUNK Sqoop Validated HL7 with Patient ID (from Epic) HL7 Raw HL7 (from Capsule) Cleanse & Transform Raw HL7 Validated HL7
  24. 24. Our Hadoop Implementation cont. Developer Workstation/Sandbox SVN (source control server) Bamboo (build server) HDP Dev Cluster HDP QA Cluster HDP Prod Cluster Daily Checkin/Checkout Development Cycle On Dev Lead Approval: Build, Unit Test, Deploy & Tag On Successful UAT & Release Approval: Deploy Per Last Successful Build Tag Smoke Test Before Updating Task status Periodic Integration & Validation: Build, Unit Test & Notify On Error Development Cycle Deployment Cycle
  25. 25. process 1. It’s complex 2. It’s a journey 3. Leverage existing strengths 4. Collaborate openly 5. Learn from experts 6. One cluster – multiple use cases 7. Follow best practices Lessons Learned – what went well people
  26. 26. 1. Continue to expand/evolve our platform 2. Ingest more data and data types 3. Identify high value use cases 4. Develop/Train people with new skills Next Steps
  27. 27. Train People with new Skills Accessing data Computing data Visualizing data Insights & Cognitive Computing

×