Big Data In Medicine

“Big Data”
For
Medicine & Health Care -
An Introductory Tutorial
Frank W Meissner, MD, RDMS, RDCS
FACP, FACC, FCCP, FASNC, CPHIMS, CCDS
Diplomate- Subspecialty Board of Advanced Heart Failure & Transplant Cardiology
Diplomate - Certification Board of Cardiovascular Computed Tomography
Certified Professional Health Information and Management Systems
Diplomate- Subspecialty Board of Cardiovascular Diseases
Diplomate - Subspecialty Board of Critical Care Medicine
Diplomate - Certification Board of Nuclear Cardiology
Diplomate - American Board of Forensic Medicine
Diplomate- American Board of Internal Medicine
Diplomate - National Board of Echocardiography
Certified Cardiac Device Specialist - Physician

Big Data - Definition
(With Apologies to Douglas Adams)
Big Data -
You just won't believe how vastly,
hugely,
mind-bogglingly big it is.”
The Hitchhiker’s Guide to the Galaxy

Seriously,
Big Data A Real Definition
Big data is an evolving term that describes any
voluminous amount of structured, semi-structured and
unstructured data that has the potential to be mined for
information
Although big data doesn't refer to any specific quantity,
the term is often used when speaking about petabytes
(PB) and exabytes (EB) of data
1 PB = 1000000000000000B = 1015 bytes = 1000 terabytes
1 EB = 10006 bytes = 1018 bytes = 1000000000000000000 B = 1000
petabytes = 1 million terabytes = 1 billion gigabytes

Data Source/Streams
-4-
Big Data Analytics

‘Big Data’ - An Operational Definition
Big Data is
High Volume
High Speed
High Variety
High Veracity
THE Data demands new types and forms of info processing to
support decision support, insight discovery, and process
optimization

A Proto-Typical Big Data Project
(With More Apologies to Douglas Adams)
“O Deep Thought computer," he said,
"the task we have designed you to perform is this.
We want you to tell us...." he paused, "The Answer."
"The Answer?" said Deep Thought. "The Answer to what?"
"Life!" urged Fook.
"The Universe!" said Lunkwill.
"Everything!" they said in chorus.
Deep Thought paused for a moment's reflection.
"Tricky," he said finally.
"But can you do it?"
Again, a significant pause.
"Yes," said Deep Thought, "I can do it."
"There is an answer?" said Fook with breathless excitement.
"Yes," said Deep Thought. "Life, the Universe, and Everything.
There is an answer. But, I'll have to think about it."

The 3-Dimensions of Big Data
Volume, Velocity, Variety

Data Validity/Veracity
The 4th Dimension of Big Data
Raw data may not be valid
May be incomplete (missing attributes or values)
May be ‘noisy’ (contains outliers or errors)
May be inconsistent (Invalid data, e.g., state/zip code mismatch )

Data Variety
Aggregating structured and unstructured data
in preparation for data analysis
Nontrivial & complex task
As in all Informatics efforts standards for data
exchange are essential & vital

Data Velocity
Salient Issue #1 - How often to sample your
data
Salient Issue #2 - How much can you afford to
pay for data sampling
Answers to #1 & #2 define data velocity

Data Volume
Not just the magnitude of storage
Wide variety of data also essential driver for
the ‘Big’ in Big Data
So Volume & Variety inexorably intertwined
In fact Data Volume is directly proportional to
data Variety & Velocity, i.e., specify Variety of
data sources & Velocity of data streams =>
Data Volume Requirements

By 2015 Average Hospital
Generates 2/3 Petra-byte Patient Data Per Year

Predictable
‘Big Data’ Challenges
Analysis,
Capture,
Curation,
Search,
Sharing,
Storage,
Transfer,
Visualization,
Privacy Violations

Knowledge Discovery
Data Warehouse vs Big Data
Data Warehouse
Predefined & Structured Data
Non-operational relational data-base
On Line Analytical Processing of Data
Conventional SQL Query Tools
Exploratory Statistical Analysis
Data Visualization Techniques
K-nearest neighbor analysis
Decision Trees & Association Rules
Construction of Genetic Algorithms & Neural Network

Knowledge Discovery
Via
The Data Warehouse

Knowledge Discovery
Data Warehouse vs Big Data
Big Data Approach
Undefined & UnStructured Data
Non relational data-bases via Hadoop Distributed File
System
Massively Distributed Data Processing VIA Hadoop
(open-source Java-based programming framework
for processing large datasets in a distributed
computing environment) (Currently version 0.23)
Economical - traditional data storage $5 per
gigabyte - Hadoop storage $0.25 per gigabyte

Other Open Source Tools
Avro - data serialization system
Cassandra - scalable multi-master database (critical design feature no single
points of failure)
Chukwa - data collection system for managing large distributed systems
Hbase - scalable distributed database supporting structured data storage of large
tables
Hive - data warehouse infrastructure providing data summarization & ad hoc
query capacities
Mahout - scalable machine learning & data mining library
PIG - high-level data-flow language and execution framework for parallel
computation
ZooKeeper - high performance coordination service for distributed applications

Q: Why Hadoop?
A: Bigger Slice of the Info- Pie!

Classical Relational Data Model

Hadoop Data Model
Flat File Structure any Format
No data schema
Files automatically partitioned into defined blocks

Classical Distributed
Database Model
Transactional &
State Dependent
Atomicity
Consistency
Isolation
Durability

Hadoop Distributed
Database Model
Database “Job”
Job Divided into Tasks
Map-Reduce Computing Model
Every Task either a Map
or
Reduce

Hadoop Computing Framework
Two conceptual layers
Hadoop Distributed File System
File broken into definable blocks
Stored on minimum of 3 servers for fault tolerance
Execution engine (MapReduce)
Reduces file requests into smaller requests
Optimizes scalable use of CPU resources

A Simple Example: Word Count
Count Each Occurrence of a Single Word in a Dataset

A More Complex Task Join Databases
The network functions here like any peer-peer distributed file sharing
system such as that seen with the bit- torrent protocol

A Generalized Schema
MapReduce Generalized Flow Schema

Hadoop Cluster
Hadoop File System (HDFS) building block of the computing cluster
HDFS breaks incoming files into blocks and stores with triple
redundancy across the network
Computation on the block occurs at the storage node
The Well Known SETI@home project serves as easily
understandable example of this computing model

File Characteristics
‘Write Once’ files - original input data not modified -
triple redundantly stored
Input data streamed into HDFS - processed by
MapReduce - any results stored back in HDFS
Obviously HDFS not general purpose file system

MapReduce
Programming Model Enabling Massive Distributed Parallel Computations
Originally proprietary Google Technology
Map() procedure performs filtering and sorting
Reduce() procedure performs summary operation
Model was inspired but are not strictly analogous to the functional
programming map & reduce functions
The power of the model lays within the multi-threading capability that is
it’s essential design feature
Some have criticized the problem set approachable by this technique

Data Architecture Designs
Hadoop
(HDFS)
Hadoop
File System
data storage
component of
open source
Apache Hadoop
Project
Stores any type of data - structured, semi-structured,
& unstructured,
e.g., email, social data, XML data, videos, audio files, photos, GPS, satellite images,
sensor data, spreadsheets, web log data, mobile data, RFID tags, pdf docs
A Massively
Distributed
File
System
Optimized
for Parallel
Processing

Minimally intrusive
addition of
Hadoop
to enterprise
architecture
Data
Staging
Platform
Employing data
processing
power of Hadoop
with structured
data
Process
Data

Processing
Structured &
Unstructured
Data
Process
Data
Global Archiving
of all Data
Total
Global
Data
Storage

Processing
Structured &
Unstructured
Data Access via
EDW
Processing
Structured &
Unstructured
Data Access via
Hadoop
Preserving
The
Classical
Data
Model
Embracing
The
Future Data
Model

High Yield Areas 4 Use
Pharmacological Research
Genomic and Genetic Research
Psychiatry / Behavorial Health
Novel Sensors & Sensor Analysis Algorithms
Epidemiological Research
Much Talked About - Little Concrete
Actionable Effects

Conclusion
“Things have never been more like the way they are today in history.”
Dwight D Eisenhower
“Things are more like they are now than they’ve ever been before.”
Gerald Ford
“Those who cannot remember the past are condemned to repeat it.”
George Santayana

Random Smattering of Articles
Predicting Breast Cancer Survivability Using Data Mining Techniques Bellaachia A & Guven
E. Age 2006, 58:10-110.
A. McKenna, M. Hanna, E. Banks et al., “The genome analysis toolkit: a MapReduce
framework for analyzing next-generation DNA sequencing data,” Genome Research, vol. 20,
no. 9, pp.1297–1303, 2010.
R. C. Taylor, “An overview of the Hadoop/MapReduce/HBase framework and its current
applications in bioinformatics,” BMC Bioinformatics, vol. 11, no. 12, article S1, 2010.
J. D. Osborne, J. Flatow, M. Holko et al., “Annotating the human genome with disease
ontology,” BMC Genomics, vol. 10, supplement 1, article S6, 2009.
B. Giardine, C. Riemer, R. C. Hardison et al., “Galaxy: a platform for interactive large-scale
genome analysis,” Genome Research, vol. 15, no. 10, pp. 1451–1455, 2005.
Steinberg GB1, Church BW, McCall CJ, Scott AB, Kalis BP. Novel predictive models for
metabolic syndrome risk: a "big data" analytic approach. Am J Manag Care. 2014 Jun
1;20(6):e221-8.
Vaitsis C1, Nilsson G2, Zary N1. Big data in medical informatics: improving education through
visual analytics. Stud Health Technol Inform. 2014;205:1163-7.
Ross MK1, Wei W, Ohno-Machado L. "Big data" and the electronic health record. Yearb Med
Inform. 2014 Aug 15;9(1):97-104. doi: 10.15265/IY-2014-0003.

Big Data In Medicine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data In Medicine

Similar to Big Data In Medicine (20)

More from Frank Meissner

More from Frank Meissner (20)

Recently uploaded

Recently uploaded (20)

Big Data In Medicine

Editor's Notes