SlideShare a Scribd company logo
1 of 38
“Big Data”
For
Medicine & Health Care -
An Introductory Tutorial
Frank W Meissner, MD, RDMS, RDCS
FACP, FACC, FCCP, FASNC, CPHIMS, CCDS
Diplomate- Subspecialty Board of Advanced Heart Failure & Transplant Cardiology
Diplomate - Certification Board of Cardiovascular Computed Tomography
Certified Professional Health Information and Management Systems
Diplomate- Subspecialty Board of Cardiovascular Diseases
Diplomate - Subspecialty Board of Critical Care Medicine
Diplomate - Certification Board of Nuclear Cardiology
Diplomate - American Board of Forensic Medicine
Diplomate- American Board of Internal Medicine
Diplomate - National Board of Echocardiography
Certified Cardiac Device Specialist - Physician
Big Data - Definition
(With Apologies to Douglas Adams)
Big Data -
You just won't believe how vastly
hugely,
mind-bogglingly big it is.”
The Hitchhiker’s Guide to the Galaxy
Seriously,
Big Data A Real Definition
Big data is an evolving term that describes any
voluminous amount of structured, semi-structured and
unstructured data that has the potential to be mined for
information
Although big data doesn't refer to any specific quantity,
the term is often used when speaking about petabytes
(PB) and exabytes (EB) of data
1 PB = 1000000000000000B = 1015 bytes = 1000 terabytes
1 EB = 10006 bytes = 1018 bytes = 1000000000000000000 B = 1000
petabytes = 1 million terabytes = 1 billion gigabytes
Data Source/Streams
-4-
Big Data Analytics
‘Big Data’ - An Operational Definition
Big Data is
High Volume
High Speed
High Variety
High Veracity
THE Data demands new types and forms of info processing to
support decision support, insight discovery, and process
optimization
A Proto-Typical Big Data Project
(With More Apologies to Douglas Adams)
“O Deep Thought computer," he said,
"the task we have designed you to perform is this.
We want you to tell us...." he paused, "The Answer."
"The Answer?" said Deep Thought. "The Answer to what?"
"Life!" urged Fook.
"The Universe!" said Lunkwill.
"Everything!" they said in chorus.
Deep Thought paused for a moment's reflection.
"Tricky," he said finally.
"But can you do it?"
Again, a significant pause.
"Yes," said Deep Thought, "I can do it."
"There is an answer?" said Fook with breathless excitement.
"Yes," said Deep Thought. "Life, the Universe, and Everything.
There is an answer. But, I'll have to think about it."
The 3-Dimensions of Big Data
Volume, Velocity, Variety
Data Validity/Veracity
The 4th Dimension of Big Data
Raw data may not be valid
May be incomplete (missing attributes or values)
May be ‘noisy’ (contains outliers or errors)
May be inconsistent (Invalid data, e.g., state/zip code mismatch )
Data Variety
Aggregating structured and unstructured data
in preparation for data analysis
Nontrivial & complex task
As in all Informatics efforts standards for data
exchange are essential & vital
Data Velocity
Salient Issue #1 - How often to sample your
data
Salient Issue #2 - How much can you afford to
pay for data sampling
Answers to #1 & #2 define data velocity
Data Volume
Not just the magnitude of storage
Wide variety of data also essential driver for
the ‘Big’ in Big Data
So Volume & Variety inexorably intertwined
In fact Data Volume is directly proportional to
data Variety & Velocity, i.e., specify Variety of
data sources & Velocity of data streams =>
Data Volume Requirements
By 2015 Average Hospital
Generates 2/3 Petra-byte Patient Data Per Year
Predictable
‘Big Data’ Challenges
Analysis,
Capture,
Curation,
Search,
Sharing,
Storage,
Transfer,
Visualization,
Privacy Violations
Knowledge Discovery
Data Warehouse vs Big Data
Data Warehouse
Predefined & Structured Data
Non-operational relational data-base
On Line Analytical Processing of Data
Conventional SQL Query Tools
Exploratory Statistical Analysis
Data Visualization Techniques
K-nearest neighbor analysis
Decision Trees & Association Rules
Construction of Genetic Algorithms & Neural Network
Knowledge Discovery
Via
The Data Warehouse
Big Data Approach
Undefined & UnStructured Data
Non relational data-bases via Hadoop Distributed File
System
Massively Distributed Data Processing VIA Hadoop
(open-source Java-based programming framework
for processing large datasets in a distributed
computing environment) (Currently version 0.23)
Economical - traditional data storage $5 per
gigabyte - Hadoop storage $0.25 per gigabyte
Knowledge Discovery
Data Warehouse vs Big Data
Other Open Source Tools
Avro - data serialization system
Cassandra - scalable multi-master database (critical design feature no single
points of failure)
Chukwa - data collection system for managing large distributed systems
Hbase - scalable distributed database supporting structured data storage of large
tables
Hive - data warehouse infrastructure providing data summarization & ad hoc
query capacities
Mahout - scalable machine learning & data mining library
PIG - high-level data-flow language and execution framework for parallel
computation
ZooKeeper - high performance coordination service for distributed applications
Big Data System Architecture
Q: Why Hadoop?
A: Bigger Slice of the Info- Pie!
Classical Relational Data Model
Hadoop Data Model
Flat File Structure any Format
No data schema
Files automatically partitioned into defined blocks
Classical Distributed
Database Model
Transactional &
State Dependent
Atomicity
Consistency
Isolation
Durability
Hadoop Distributed
Database Model
Database “Job”
Job Divided into Tasks
Map-Reduce Computing Model
Every Task either a Map
or
Reduce
Hadoop Computing Framework
Two conceptual layers
Hadoop Distributed File System
File broken into definable blocks
Stored on minimum of 3 servers for fault tolerance
Execution engine (MapReduce)
Reduces file requests into smaller requests
Optimizes scalable use of CPU resources
A Simple Example: Word Count
Count Each Occurrence of a Single Word in a Dataset
A More Complex Task Join Databases
The network functions here like any peer-peer distributed file sharing
system such as that seen with the bit- torrent protocol
A Generalized Schema
MapReduce Generalized Flow Schema
Hadoop Cluster
Hadoop File System (HDFS) building block of the computing cluster
HDFS breaks incoming files into blocks and stores with triple
redundancy across the network
Computation on the block occurs at the storage node
The Well Known SETI@home project serves as easily
understandable example of this computing model
File Characteristics
‘Write Once’ files - original input data not modified -
triple redundantly stored
Input data streamed into HDFS - processed by
MapReduce - any results stored back in HDFS
Obviously HDFS not general purpose file system
HDFS System Architecture
MapReduce
Programming Model Enabling Massive Distributed Parallel Computations
Originally proprietary Google Technology
Map() procedure performs filtering and sorting
Reduce() procedure performs summary operation
Model was inspired but are not strictly analogous to the functional
programming map & reduce functions
The power of the model lays within the multi-threading capability that is
it’s essential design feature
Some have criticized the problem set approachable by this technique
Data Architecture Designs
Hadoop
(HDFS)
Hadoop
File System
data storage
component of
open source
Apache Hadoop
Project
Stores any type of data - structured, semi-structured,
& unstructured,
e.g., email, social data, XML data, videos, audio files, photos, GPS, satellite images,
sensor data, spreadsheets, web log data, mobile data, RFID tags, pdf docs
A Massively
Distributed
File
System
Optimized
for Parallel
Processing
Minimally intrusive
addition of
Hadoop
to enterprise
architecture
Data
Staging
Platform
Employing data
processing
power of Hadoop
with structured
data
Process
Data
Data Architecture Designs
Processing
Structured &
Unstructured
Data
Process
Data
Global Archiving
of all Data
Total
Global
Data
Storage
Data Architecture Designs
Processing
Structured &
Unstructured
Data Access via
EDW
Processing
Structured &
Unstructured
Data Access via
Hadoop
Preserving
The
Classical
Data
Model
Embracing
The
Future Data
Model
Data Architecture Designs
High Yield Areas 4 Use
Pharmacological Research
Genomic and Genetic Research
Psychiatry / Behavorial Health
Novel Sensors & Sensor Analysis Algorithms
Epidemiological Research
Much Talked About - Little Concrete
Actionable Effects
Conclusion
“Things have never been more like the way they are today in history.”
Dwight D Eisenhower
“Things are more like they are now than they’ve ever been before.”
Gerald Ford
“Those who cannot remember the past are condemned to repeat it.”
George Santayana
Random Smattering of Articles
Predicting Breast Cancer Survivability Using Data Mining Techniques Bellaachia A & Guven
E. Age 2006, 58:10-110.
A. McKenna, M. Hanna, E. Banks et al., “The genome analysis toolkit: a MapReduce
framework for analyzing next-generation DNA sequencing data,” Genome Research, vol. 20,
no. 9, pp.1297–1303, 2010.
R. C. Taylor, “An overview of the Hadoop/MapReduce/HBase framework and its current
applications in bioinformatics,” BMC Bioinformatics, vol. 11, no. 12, article S1, 2010.
J. D. Osborne, J. Flatow, M. Holko et al., “Annotating the human genome with disease
ontology,” BMC Genomics, vol. 10, supplement 1, article S6, 2009.
B. Giardine, C. Riemer, R. C. Hardison et al., “Galaxy: a platform for interactive large-scale
genome analysis,” Genome Research, vol. 15, no. 10, pp. 1451–1455, 2005.
Steinberg GB1, Church BW, McCall CJ, Scott AB, Kalis BP. Novel predictive models for
metabolic syndrome risk: a "big data" analytic approach. Am J Manag Care. 2014 Jun
1;20(6):e221-8.
Vaitsis C1, Nilsson G2, Zary N1. Big data in medical informatics: improving education through
visual analytics. Stud Health Technol Inform. 2014;205:1163-7.
Ross MK1, Wei W, Ohno-Machado L. "Big data" and the electronic health record. Yearb Med
Inform. 2014 Aug 15;9(1):97-104. doi: 10.15265/IY-2014-0003.

More Related Content

Similar to BigDataInMedicine.pptx

Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008bosc_2008
 
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdf
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdfBig Data in Healthcare Made Simple Where It Stands Today and Where .pdf
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdfannamalaiagencies
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Intelligent Heart Attack Prediction System Using Big Data
Intelligent Heart Attack Prediction System Using Big DataIntelligent Heart Attack Prediction System Using Big Data
Intelligent Heart Attack Prediction System Using Big Datapaperpublications3
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practicesRobert Oostenveld
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceEdureka!
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...San Diego Supercomputer Center
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfGraceOkeke3
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Anita de Waard
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data DiscoveryBenjamin Ashkar
 

Similar to BigDataInMedicine.pptx (20)

Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
 
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdf
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdfBig Data in Healthcare Made Simple Where It Stands Today and Where .pdf
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdf
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Intelligent Heart Attack Prediction System Using Big Data
Intelligent Heart Attack Prediction System Using Big DataIntelligent Heart Attack Prediction System Using Big Data
Intelligent Heart Attack Prediction System Using Big Data
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
 
Big Data in Clinical Research
Big Data in Clinical ResearchBig Data in Clinical Research
Big Data in Clinical Research
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data Science
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
 
METRO RDM Webinar
METRO RDM WebinarMETRO RDM Webinar
METRO RDM Webinar
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 

More from Frank Meissner

More from Frank Meissner (20)

Eating D/O
Eating D/OEating D/O
Eating D/O
 
Bipolar D/O
Bipolar D/O Bipolar D/O
Bipolar D/O
 
EKG Patterns of SCD - Can't Miss EKG Patterns for Generalist & Psychiatrist
EKG Patterns of SCD - Can't Miss EKG Patterns for Generalist & PsychiatristEKG Patterns of SCD - Can't Miss EKG Patterns for Generalist & Psychiatrist
EKG Patterns of SCD - Can't Miss EKG Patterns for Generalist & Psychiatrist
 
PE case
PE casePE case
PE case
 
Pediatric delirium
Pediatric deliriumPediatric delirium
Pediatric delirium
 
Taktosubo cardiomyopathy verona june2018
Taktosubo cardiomyopathy verona june2018Taktosubo cardiomyopathy verona june2018
Taktosubo cardiomyopathy verona june2018
 
Drug induced brugadda apa
Drug induced brugadda apaDrug induced brugadda apa
Drug induced brugadda apa
 
Curanderos presentation
Curanderos presentation Curanderos presentation
Curanderos presentation
 
Pills & Thrills
Pills & ThrillsPills & Thrills
Pills & Thrills
 
Verona sleep hx presentation
Verona sleep hx presentationVerona sleep hx presentation
Verona sleep hx presentation
 
Burns zagreb presentation
Burns zagreb presentationBurns zagreb presentation
Burns zagreb presentation
 
Hemmorhagic fever zagreb
Hemmorhagic fever zagrebHemmorhagic fever zagreb
Hemmorhagic fever zagreb
 
Tropical cardiology
Tropical cardiologyTropical cardiology
Tropical cardiology
 
Schistomasis
SchistomasisSchistomasis
Schistomasis
 
Onchocerciasis
OnchocerciasisOnchocerciasis
Onchocerciasis
 
Malaria
MalariaMalaria
Malaria
 
Visceral leishmanasis
Visceral leishmanasisVisceral leishmanasis
Visceral leishmanasis
 
Chest pain perals
Chest pain peralsChest pain perals
Chest pain perals
 
Cardiomyopathy
CardiomyopathyCardiomyopathy
Cardiomyopathy
 
Critical Care Arrhythmia
Critical Care ArrhythmiaCritical Care Arrhythmia
Critical Care Arrhythmia
 

Recently uploaded

Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Modelssonalikaur4
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliRewAs ALI
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...CALL GIRLS
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...Miss joya
 
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipurparulsinha
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safenarwatsonia7
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatorenarwatsonia7
 
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Miss joya
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...narwatsonia7
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...Garima Khatri
 
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Servicemakika9823
 
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safenarwatsonia7
 
Call Girl Bangalore Nandini 7001305949 Independent Escort Service Bangalore
Call Girl Bangalore Nandini 7001305949 Independent Escort Service BangaloreCall Girl Bangalore Nandini 7001305949 Independent Escort Service Bangalore
Call Girl Bangalore Nandini 7001305949 Independent Escort Service Bangalorenarwatsonia7
 

Recently uploaded (20)

Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas Ali
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
 
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
 
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
 
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
 
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
 
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
 
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
 
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
 
Call Girl Bangalore Nandini 7001305949 Independent Escort Service Bangalore
Call Girl Bangalore Nandini 7001305949 Independent Escort Service BangaloreCall Girl Bangalore Nandini 7001305949 Independent Escort Service Bangalore
Call Girl Bangalore Nandini 7001305949 Independent Escort Service Bangalore
 

BigDataInMedicine.pptx

  • 1. “Big Data” For Medicine & Health Care - An Introductory Tutorial Frank W Meissner, MD, RDMS, RDCS FACP, FACC, FCCP, FASNC, CPHIMS, CCDS Diplomate- Subspecialty Board of Advanced Heart Failure & Transplant Cardiology Diplomate - Certification Board of Cardiovascular Computed Tomography Certified Professional Health Information and Management Systems Diplomate- Subspecialty Board of Cardiovascular Diseases Diplomate - Subspecialty Board of Critical Care Medicine Diplomate - Certification Board of Nuclear Cardiology Diplomate - American Board of Forensic Medicine Diplomate- American Board of Internal Medicine Diplomate - National Board of Echocardiography Certified Cardiac Device Specialist - Physician
  • 2. Big Data - Definition (With Apologies to Douglas Adams) Big Data - You just won't believe how vastly hugely, mind-bogglingly big it is.” The Hitchhiker’s Guide to the Galaxy
  • 3. Seriously, Big Data A Real Definition Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information Although big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes (PB) and exabytes (EB) of data 1 PB = 1000000000000000B = 1015 bytes = 1000 terabytes 1 EB = 10006 bytes = 1018 bytes = 1000000000000000000 B = 1000 petabytes = 1 million terabytes = 1 billion gigabytes
  • 5. ‘Big Data’ - An Operational Definition Big Data is High Volume High Speed High Variety High Veracity THE Data demands new types and forms of info processing to support decision support, insight discovery, and process optimization
  • 6. A Proto-Typical Big Data Project (With More Apologies to Douglas Adams) “O Deep Thought computer," he said, "the task we have designed you to perform is this. We want you to tell us...." he paused, "The Answer." "The Answer?" said Deep Thought. "The Answer to what?" "Life!" urged Fook. "The Universe!" said Lunkwill. "Everything!" they said in chorus. Deep Thought paused for a moment's reflection. "Tricky," he said finally. "But can you do it?" Again, a significant pause. "Yes," said Deep Thought, "I can do it." "There is an answer?" said Fook with breathless excitement. "Yes," said Deep Thought. "Life, the Universe, and Everything. There is an answer. But, I'll have to think about it."
  • 7. The 3-Dimensions of Big Data Volume, Velocity, Variety
  • 8. Data Validity/Veracity The 4th Dimension of Big Data Raw data may not be valid May be incomplete (missing attributes or values) May be ‘noisy’ (contains outliers or errors) May be inconsistent (Invalid data, e.g., state/zip code mismatch )
  • 9. Data Variety Aggregating structured and unstructured data in preparation for data analysis Nontrivial & complex task As in all Informatics efforts standards for data exchange are essential & vital
  • 10. Data Velocity Salient Issue #1 - How often to sample your data Salient Issue #2 - How much can you afford to pay for data sampling Answers to #1 & #2 define data velocity
  • 11. Data Volume Not just the magnitude of storage Wide variety of data also essential driver for the ‘Big’ in Big Data So Volume & Variety inexorably intertwined In fact Data Volume is directly proportional to data Variety & Velocity, i.e., specify Variety of data sources & Velocity of data streams => Data Volume Requirements
  • 12. By 2015 Average Hospital Generates 2/3 Petra-byte Patient Data Per Year
  • 14. Knowledge Discovery Data Warehouse vs Big Data Data Warehouse Predefined & Structured Data Non-operational relational data-base On Line Analytical Processing of Data Conventional SQL Query Tools Exploratory Statistical Analysis Data Visualization Techniques K-nearest neighbor analysis Decision Trees & Association Rules Construction of Genetic Algorithms & Neural Network
  • 16. Big Data Approach Undefined & UnStructured Data Non relational data-bases via Hadoop Distributed File System Massively Distributed Data Processing VIA Hadoop (open-source Java-based programming framework for processing large datasets in a distributed computing environment) (Currently version 0.23) Economical - traditional data storage $5 per gigabyte - Hadoop storage $0.25 per gigabyte Knowledge Discovery Data Warehouse vs Big Data
  • 17. Other Open Source Tools Avro - data serialization system Cassandra - scalable multi-master database (critical design feature no single points of failure) Chukwa - data collection system for managing large distributed systems Hbase - scalable distributed database supporting structured data storage of large tables Hive - data warehouse infrastructure providing data summarization & ad hoc query capacities Mahout - scalable machine learning & data mining library PIG - high-level data-flow language and execution framework for parallel computation ZooKeeper - high performance coordination service for distributed applications
  • 18. Big Data System Architecture
  • 19. Q: Why Hadoop? A: Bigger Slice of the Info- Pie!
  • 21. Hadoop Data Model Flat File Structure any Format No data schema Files automatically partitioned into defined blocks
  • 22. Classical Distributed Database Model Transactional & State Dependent Atomicity Consistency Isolation Durability
  • 23. Hadoop Distributed Database Model Database “Job” Job Divided into Tasks Map-Reduce Computing Model Every Task either a Map or Reduce
  • 24. Hadoop Computing Framework Two conceptual layers Hadoop Distributed File System File broken into definable blocks Stored on minimum of 3 servers for fault tolerance Execution engine (MapReduce) Reduces file requests into smaller requests Optimizes scalable use of CPU resources
  • 25. A Simple Example: Word Count Count Each Occurrence of a Single Word in a Dataset
  • 26. A More Complex Task Join Databases The network functions here like any peer-peer distributed file sharing system such as that seen with the bit- torrent protocol
  • 27. A Generalized Schema MapReduce Generalized Flow Schema
  • 28. Hadoop Cluster Hadoop File System (HDFS) building block of the computing cluster HDFS breaks incoming files into blocks and stores with triple redundancy across the network Computation on the block occurs at the storage node The Well Known SETI@home project serves as easily understandable example of this computing model
  • 29. File Characteristics ‘Write Once’ files - original input data not modified - triple redundantly stored Input data streamed into HDFS - processed by MapReduce - any results stored back in HDFS Obviously HDFS not general purpose file system
  • 31. MapReduce Programming Model Enabling Massive Distributed Parallel Computations Originally proprietary Google Technology Map() procedure performs filtering and sorting Reduce() procedure performs summary operation Model was inspired but are not strictly analogous to the functional programming map & reduce functions The power of the model lays within the multi-threading capability that is it’s essential design feature Some have criticized the problem set approachable by this technique
  • 32. Data Architecture Designs Hadoop (HDFS) Hadoop File System data storage component of open source Apache Hadoop Project Stores any type of data - structured, semi-structured, & unstructured, e.g., email, social data, XML data, videos, audio files, photos, GPS, satellite images, sensor data, spreadsheets, web log data, mobile data, RFID tags, pdf docs A Massively Distributed File System Optimized for Parallel Processing
  • 33. Minimally intrusive addition of Hadoop to enterprise architecture Data Staging Platform Employing data processing power of Hadoop with structured data Process Data Data Architecture Designs
  • 34. Processing Structured & Unstructured Data Process Data Global Archiving of all Data Total Global Data Storage Data Architecture Designs
  • 35. Processing Structured & Unstructured Data Access via EDW Processing Structured & Unstructured Data Access via Hadoop Preserving The Classical Data Model Embracing The Future Data Model Data Architecture Designs
  • 36. High Yield Areas 4 Use Pharmacological Research Genomic and Genetic Research Psychiatry / Behavorial Health Novel Sensors & Sensor Analysis Algorithms Epidemiological Research Much Talked About - Little Concrete Actionable Effects
  • 37. Conclusion “Things have never been more like the way they are today in history.” Dwight D Eisenhower “Things are more like they are now than they’ve ever been before.” Gerald Ford “Those who cannot remember the past are condemned to repeat it.” George Santayana
  • 38. Random Smattering of Articles Predicting Breast Cancer Survivability Using Data Mining Techniques Bellaachia A & Guven E. Age 2006, 58:10-110. A. McKenna, M. Hanna, E. Banks et al., “The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,” Genome Research, vol. 20, no. 9, pp.1297–1303, 2010. R. C. Taylor, “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics,” BMC Bioinformatics, vol. 11, no. 12, article S1, 2010. J. D. Osborne, J. Flatow, M. Holko et al., “Annotating the human genome with disease ontology,” BMC Genomics, vol. 10, supplement 1, article S6, 2009. B. Giardine, C. Riemer, R. C. Hardison et al., “Galaxy: a platform for interactive large-scale genome analysis,” Genome Research, vol. 15, no. 10, pp. 1451–1455, 2005. Steinberg GB1, Church BW, McCall CJ, Scott AB, Kalis BP. Novel predictive models for metabolic syndrome risk: a "big data" analytic approach. Am J Manag Care. 2014 Jun 1;20(6):e221-8. Vaitsis C1, Nilsson G2, Zary N1. Big data in medical informatics: improving education through visual analytics. Stud Health Technol Inform. 2014;205:1163-7. Ross MK1, Wei W, Ohno-Machado L. "Big data" and the electronic health record. Yearb Med Inform. 2014 Aug 15;9(1):97-104. doi: 10.15265/IY-2014-0003.

Editor's Notes

  1. This talk will discuss basic concepts to allow understanding of the basic features of ‘big data’ and its analysis.
  2. According to the Author Douglas Adams, Big Data is vastly, hugely, mind-bogglingly big. Of course the sophisticated member of my audience will recognize that Adams was referring to the Universe itself in this quote, not too that subset of the universe called ‘Big Data.’
  3. This is a more conventional definition of Big Data. However, it doesn’t alter one witt, the mind-boggling characteristic of Big Data contained in the previous slides definition.
  4. This slide represents the most common data streams independent of knowledge domain that are incorporated into a ‘Big Data’ project.
  5. This is the conventional and most commonly articulated ‘definition’ of Big Data.
  6. The central story idea in The Hitchhikers Guide to the Galaxy revolves around the earth as the universes most advanced computation device designed by trans-dimensional beings to answer the Big Data Query noted above. Current plans and expectations for Big Data, now early in its hype-cycle seem only slightly less ambitious than answering the ultimate question.
  7. This illustration envisions Big Data as a data tsunami that feeds upon itself in an every expanding cycle of every greater velocities, varieties and quantities of data.
  8. “There is a fifth dimension beyond that which is known to man. It is a dimension as vast as space and as timeless as infinity. It is the middle ground between light and shadow, between science and superstition, and it lies between the pit of man's fears and the summit of his knowledge. This is the dimension of imagination. It is an area which we call the Twilight Zone.” While there is no 5th dimension to Big Data- it’s now classical 3-dimensional representation is often times augmented with the superposition of a 4th Dimension, a dimension to those of us interested in applying Big Data analytic products to the scientific practice of medicine, the most important of all data dimensions, data validity/veracity. As noted in this slide, raw data can be incomplete in a multitude of ways.
  9. Certainly the most exciting and potentially the most important dimension in the Big Data Tsunami is the free form mixture of structured and unstructured data elements. The Grandest unrealized challenge within the formal Grand Challenges of Medical Informatics ( D F Sittig. Grand challenges in medical informatics?J Am Med Inform Assoc. 1994 Sep-Oct; 1(5): 412–413.) has been its struggle to deal with the stubborn insistence of medical practitioners to prefer the use of unstructured and often times idiosyncratic formulations of diagnostic findings, hypotheses, diagnoses, and case summations. The full flighted pursuit of a unified controlled medical vocabulary which has obsessed the field literally since its inception seems doomed given the expanding and ever accelerating volume of new knowledge and biomedical concepts discovered every calendar year. Thus an analytical methodology able to efficiently deal with unstructured but relevant and germane data seems to make the unified controlled medical vocabulary grand challenge if not irrelevant at least theoretically manageable. The third line in this slide emphasizes that unlike the futile and hopeless dream that one can formally structure all clinical data input, all that is required for a ‘Big Data’ analysis is that the interface for data exchange is well formulated and has a relevant and pre-agreed data standard. Conceptually the difference can be visualized by the analogy with a Fax Machine. Rather than trying to specify all possible Fax transmission messages by type with a unified nomenclature, all that is required is that Fax transmission messages conform to a unified interface standard so that (A)Fax_machine can exchange text data of any conceptual type with (B)Fax_machine, without pre-knowledge of what type of content is being exchanged.
  10. This slide emphasizes that articulation and specification of sampling frequency coupled with an accurate estimate of the costs associated with data sampling and storage are critical planning factors prior to developing and implementing a ‘Big Data’ project. As such in Project Management terms specification of data velocity is in essence determining the scope of your project.
  11. This slide emphasizes that while Data Volume is conceptualized as an independent element of the Data Tsunami, in fact Data Volume appears to be a linear function of the other two dimensions, i.e., if one can accurately specify the source of the data streams while simultaneously specifying the velocity of those data streams than the data volume requirements for the project are uniquely and deterministically defined.
  12. It has been estimated that by next year, the average hospital in the US while generate a total of 2/3 Petra-byte of patient data of all types (predominately video data) emphasizing the necessity for deployment of ‘Big Data’ tools and techniques in taming the data tsunami that is threatening to wash away the foundations of US Healthcare.
  13. Just as there are Grand Challenges for the field of medical informatics, there remain predictable challenges for ‘Big Data.’ The elephant in the room here seems to me to be the potential for Privacy Violation and compromise of HIPAA mandated privacy laws and regulations as well as the bedrock ethical principle that patient-provider confidentiality is central to the medial encounter and is preserved and safe-guarded. In contradistinction to these legal and ethical mandates has to be our understanding that for the average layman their direct knowledge of ‘Big Data’ programs will probably be limited to those highly and recently publicized NSA programs such as Stellarwind and PRISM. As such, ‘Big Data’ programs within the medical domain have to be meticulous & proactive in defining and describing their safeguards so that data accumulation/manipulation/aggregation can occur at the same time that privacy and anonymity are guaranteed.
  14. This slide details the classical data warehouse approach to knowledge discovery.
  15. This flow diagram taken from my own paper on knowledge discovery via use of the data warehouse (Bothner U, Meissner FW. Wissen aus medizinischen Datenbanken nutzen. Dt Arztebl 1998;95: A-1336-1338(Heft 20]. In many ways the exploration of ‘Big Data’ is identical in terms of the analytical tools involved in the analysis of the data set. Specifically, all the on line analytical tools mentioned in the previous slide have been used for the analysis of Big Data sets. However, one critical difference characterizes analysis of Big Data sets. The analysis is done over the entire set of data, rather than extracted data subsets. As such any statistical analysis is done over the entire universe of discourse, rather than utilizing sampling sets as is done with conventional statistical analysis.
  16. The principle take home from this slide, is the enormous cost efficiency of the Hadoop Distributed File System.
  17. The analysis of ‘Big Data’ is facilitated by open source tools and techniques which contribute to its cost effectiveness. The tools discussed above are using with Hadoop to provide a full featured computing environment.
  18. The relationships between these tools & the Hadoop Distributed File System are made explicit in this block diagram.
  19. This slide emphasizes in the current ‘new data’ world, the vast majority of data is unstructured and resistant to relational database techniques with respect to organization and analysis of the data.
  20. In terms of compare and contrast, consider the Relational database Data model as illustrated above.
  21. Now consider the data model for Hadoop. Instead of a relational structure to the data model, i.e., each data element is characterized in relationship to other data elements and all are related to a data element key field; the Hadoop model is intrinsically flat and no predefined relationships are mandated on the data prior to data manipulation. The data is partitioned into defined blocks that are then distributed in a decentralized storage & computation schema.
  22. This slide illustrates the classical distributed database model. Conceptually, database operations are visualized as state dependent processes with a limited behavioral repertoire (insert data field, update data field, delete data field) with a final commit behavior once the data field manipulation is completed in the absence of error. In case of error or failure, the database state is returned to its pre-operation state.
  23. The Hadoop distributed database model is completely different. Each database operation is conceptualized as a ‘job’ with each job being divided into tasks by the Map-Reduce function. With each iteration of the job, either the task is reduced to a mapping function and the database job is concluded, or the task is further reduced to another sub-task and the process repeated until the task set is reduced to a mapping function.
  24. While this slide may have been more clear in front of the last slide, that order was selected to allow for compare and contrast with the relational distributed data base model. But in any case at the highest level of system analysis the Hadoop computing framework consists of the Hadoop distributed file system, that is responsible for breaking even the most huge data sets into definable and uniform computational chunks. Additionally, the HDFS is responsible for establishing at a minimum a triple redundancy to the data write operation. The other layer of the framework is the MapReduce execution engine which takes the data file blocks and further reduces file sized manipulation requests into smaller so-called task requests. The MapReduce function not only breaks the large data chunks into smaller tasks, it also tracks the tasks. In this way, optimal and maximal use of network CPU resources occurs. To reiterate and for emphasis, Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
  25. Let us consider the performance of MapReduce in the setting of a simple word count task. In this scenario, the input files of a defined size are received from the Hadoop Distributed File System for processing. Once they arrive at the MapReduce engine, the mapper reduces the input data set into two smaller sets with 1/2 of the data instances that we saw in the original data input data set, i.e., the Mapper function divides the original task into two tasks that contain 50% of the original data sets. Once this has occurred, the mapping function attempts to map to a single type of datum within the processed dataset. If this operation fails to yield a data set of unitary elements, the data set is then sorted and randomly shuffled. For the sake of illustration this data operation has resulted in different sized sets of unitary elements. But in the real processing of large quantities of elements, such a sort and shuffle operation must take place many times until a unitary element result occurs. At this point, the sets are reduced to key value pairs (fruit type, # of instances within the input data set). The key value pairs than represent the final program outputs. Now imagine this process occurring over a Petabyte data set and one can get a feel for the power of the MapReduce function.
  26. Here is a more complicated MapReduce task. The goal is to take elements of two different datasets and join them into an integrated dataset. As noted, the network functions much like any peer-peer distributed sharing system such as those seen with the bit-torrent protocol. The difference is that in addition to sharing the data across the network, operations on the data are performed at the same network nodes that function as storage nodes.
  27. Another way to look at MapReduce is as a 5-step parallel and distributed computation: Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value. Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values A1. "Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the A1…C8 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value. Run the user-provided Reduce() code – Reduce() is run exactly once for each A1…C8 key value produced by the Map step. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by A1…C8 to produce the final outcome. These five steps can be Logically thought of as running in sequence – each step starts only after the previous step is completed – although in practice they can be interleaved as long as the final result is not affected. MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted. In summary, "Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
  28. In addition to my appeal to the bit torrent file sharing protocol as a means to understand MapReduce & the Hadoop File System, I am encouraging the audience to recall the SETI@Home project which was probably the 1st well known example of massively parallel computing most layperson’s have been exposed too. In a similar way to the SETI system, Hadoop distributes data blocks with the Hadoop file sharing/information processing cluster resulting in a massively parallel effort to process large data sets in the search for simple comparisons across those data sets, i.e., returning a list of similar books ordered by customers who have bought the book you just bought on amazon.com. This is a search result we all now take for granted, but conceptually we can now understand how this occurs in real time, without implementation of truly impossible relational database structures.
  29. One of the ways that order is conferred on this very ad hoc file system, is both triple redundancy as well as ensuring all input files are ‘write once’ files, i.e., no modifications to input files is allowed to ensure absolute data integrity.
  30. This slide illustrates the high level systems architecture of HDFS. The name node is a single node in the computing cluster that is responsible for keeping track of the file system metadata. It additionally keeps a list of all the blocks within the HDFS as well as a list of all data nodes that host these blocks. I conceptualize the name node as analogous to a Domain Name Server in the TCP/IP protocol. Since it is a single point of failure in the system, it is provisioned with a resilient, highly available server. The datanode is a shared-nothing cluster of computers capable of executing the workload components of the system.
  31. A reiteration and summarization of the past several slides.
  32. Hadoop can be integrated into a Enterprise Wide Information system in various system configurations. This slide contrasts the independent Enterprise Data Warehouse with a standalone Hadoop file system.
  33. Hadoop can be integrated with the EDW (enterprise data warehouse) as a highly efficient distributed storage and data processing system for use with existing structured data sources.
  34. Additionally leaving the enterprise data warehouse as the sole vehicle for analysis of data, Hadoop can function to add and process unstructured as well as structured data to the EDW. Alternatively it can be used as an efficient data archive in which all enterprise data is archived and stored via Hadoop nodes.
  35. In this configuration, the EDW remains the single point of entry to all the available data but Hadoop can be utilized by conventional analytical programs for the purpose of analysis of large data sets utilizing defined tools. The final data architectural design utilizes Hadoop as the sole point of contact for all enterprise wide data and data analytics. The point of these last few slides was to emphasize the flexibility of the Hadoop system as well as too defeat the false dichotomy of either EDW or Hadoop, In fact Hadoop plays well with others.
  36. This slide demonstrates both current and projected areas of Big Data efforts in the fields of Biomedicine. Of course, given the enormous combinatorial complexity of Genomics research the application of Big Data techniques seems axiomatic. Additionally, given the financial resources and development costs related to drug research, simulation and advanced analysis systems have the potential to dramatically reduce drug development costs. By the way of analogy, the advent of modern ‘supercomputers’ was necessitated by treaty obligations that prevented all atomic weapons testing. Once the need for high speed weapons effects simulations became a national priority, high speed computing efforts became the focus of technological revolution. Not as obvious, but given that this type of computing (highly distributed, massively parallel) was pioneered by consumer driven web based enterprises that were trying to understand ‘individual consumer choices’ psychiatric and behavioral health analysis and applications seems as axiomatic as Genomics or pharmacological applications. Epidemiological research by reason of the potential size of their data sets also promise to yield significant insights from this computing methodology. Novel sensor analysis seems to me a long term benefit for this type of computational capacity. For example, while heart rate variability analysis has been a tool of cardiology for as long as my career, it has always been utilized in the isolated clinical case. Having massive amounts of heart rate data linked to personal activity logs and temporal data promise to yield dramatic insights into the area of sudden cardiac death, chronotropic dependences of AMI, neurohumoral and temporal factors dictating onset of atrial fibrillation, relationships between exercise and onset of cardiac disease, etc.
  37. While real results will be derived from this powerful new set of data manipulations, the reality is that we are on the ascending limb of the hype curve, and it is too soon to prognosticate if this is an evolutionary or revolutionary change in computing methodology.