Abstract:- There is an urgent need in the pediatric ICUs to collect, store and transform healthcare data to make accurate and timely predictions in the areas of patient outcomes and treatment recommendations. We are currently heavily invested in using open source big data stacks in order to achieve this goal and help our young ones. In this talk I can highlight how we go about managing structured and unstructured high frequency data generated from a disparate set of devices and systems and ultimately how we have created data pipelines to process the data and make it available for data scientists and app developers.
3. VPICU Mission
• Formed in 1998
• Assist doctors to make acurate decisions using
advanced computational techniques and AI based
on data
• Find similarities between a patient at bedside and
historical populations with known outcomes
• What kind of decisions:
• Diagnosis (sepsis, ARDS, etc.)
• Severity of illness
• Physiologic trajectory
• Re-admissions
• Treatment recommendations https://youtu.be/f2OqkRhj5uI
THE LAURA P. AND LELAND
K. WHITTIER VIRTUAL PICU
TTIER VIRTUAL PICU
3
5. But where is pediatric healthcare?
Source: http://www.lorienpratt.com/are-machine-learning-and-big-data-all-about-just-advertising-and-marketing/
5
6. Brief History of IOTs in ICU
• 1940s – 1990s:
• Several patient monitoring devices/systems
• Creation of EMR with data aggregration
• LAN/Web based connectivity
• HIPAA (Health Insurance Portability and
Accountability Act of 1996)
• Health Level 7 (HL7) standard
• 2000s:
• Clinical information systems
• Philips CareVue, GE Centricity
• RDBMS based data storage
• Sharing of high level information via
proprietary standards
• EHR(Cerner, Epic), Meaningful Use
6
8. Remaining Challenges
• Interoperability:
• Medical Device industry still uses proprietary data formats and protocols
• Some EHR vendors do not adhere to open standards
• Liability:
• HIPAA privacy and security regulations have lead to a protectionist culture
• Hospitals are reluctant to share data to address research needs
• Velocity/Volume of Big Data:
• High frequency waveforms generate petabytes of data
• Storage on traditional platforms is not scalable
• Skills: Healthcare IT have a skills gap when it comes to big data technologies
8
18. On-Prem Infrastructure
18
Spark Data Pipeline
• Data cleansing
• Data munging
• Data tagging
• Data aggreation
• Data de-identification
• Outputs ML dataset
23. How much data is enough?
23
Performance (on same
holdout set) increases
with amount of data
available for training
24. Conclusions
• Our goal is to save lives by making timely and accurate predictions
• More data and better algorithms can lead to breakthroughs
• Data science/Machine Learning has a great potential in our space
• Our team continues to make progress in using big data technologies
to enable research and development
24
25. VPICU Research Team
Clinical Researchers
• Randall Wetzel, MD
• Roby Khemani, MD
• Sareen Shah, MD
Computer Scientists - CHLA
• Melissa Aczon, PhD - Senior Data Scientist
• Brett Bailey - Software Engineer
• Alysia Flynn, PhD - Data Ninja
• Alec Gunney - Data Scientist
• Long Van Ho - Data Scientist
• David Ledbetter - Senior Data Scientist
• Mohit Mehra – Lead Data Engineer
• Mike Reilly - Infrastructure
• Paul Vee - Senior Program Manager
25
Human-computer interaction, visualization
Jeff Heer Stanford
Diana Maclean Stanford
Pia Pal Stanford
Machine learning, similarity search
Ben Marlin UMass Amherst
Artificial Intelligence, probabilistic models –
Christian Shelton UC Riverside
Busra Celikkaya UC Riverside
Large-scale statistical analysis
Amy Braverman NASA JPL, UCLA
Data systems and software architecture
Dan Crichton NASA JPL
Chris Mattmann NASA JPL, USC
28. DS Challenges
• How do we generate features
• Irregular, sparsely sampled time-series measurements (HR vs BP vs
oxygenation vs medications)
• Missing data (should we impute or not, unit conversion)
• Noise in the data (fidelity in human charted vs machine generated)
• Hard NLP problems (Is MAP = mean arterial pressure or mean airway
pressure?, heart rate = hr = pulse)
28
Editor's Notes
Our mission is to assist doctors and clinicians in making acurate decisions based on DATA
- Lots of big data companies
But they are focussed on cusomer analytics etc and not on saving lives
Why is pediatric healthcare not on this map
-1940 – 90s: lots of specialized medical devices
- 2000s: much larger EMRs storing lots of data
-2008 – 2015 has seen a steady growth in the adoption rate
BUT there are still challenges remaining
How are we addressing some of the challenges
Hadoop plays a big part in data engineering
1. We chose hadoop because of the promise of creating single distributed data repository
2. 3 year old journey with a small cluster resulting in a data lake to house different sources
Current Hadoop Stack:
HDP 2.5/Ambari 2.2
2 Name nodes/4 Data nodes
2 Edge nodes (MySQL, other)
Zeppelin, pyspark integration
Adding libraries such as matplatlib is pretty easy
Goal is to make a Machine Learning ingestable dataset.
This dataset is de-identified “long format” where all variables are stacked at each time interval
Since putting PHI is a challenge, we use AWS cluster primarily on de-identified, anonymized data
Using Hortonworks HDC to run spark jobs to ingest and transform de-identified/annymized data
- This is a really important slide as far as we are concerned.
- 50% training + 25% validation + 25% test