These slides use concepts from my (Jeff Funk) course entitled analyzing hi-tech opportunities to analyze how Big Data is becoming economically feasible for health care. These slides describe how the cost of sensors, data processing, data storage and data analyzing are falling, how new and better forms of storage and algorithms are being implemented, and what this means for sustainable health care. These changes are enabling a move towards personalized health care.
1. BIG DATA
DIGITAL HEALTH
REVOLUTION
Alex A0135681
Henri A0135487
Zheng A0121892
Pham A0095804
Yin A0119974
Kavitha A0110143
For information on other technologies, see http://www.slideshare.net/Funk98/presentations
21. SENSORS
IN FUTURE - BioMEMS and Microsystems
● Size decrease
● Better and smaller communication chips and algorithms
● micro supercapacitors
● This will facilitate the arrival of these new implantable chips
● Allows for non bothersome personal medicine
● Allow for more tailored medicine
● It will require more data analysis and more processing power
23. The Storage Medium used is
of More focus than the
Quantity of Storage used. It is
no longer one-size-fits-all
“Data Deluge” is
Fundamentally Changing the
way that Storage is
Approached.
HARDWARE
Introduction
24. ● Provide Real-time Or Near Real-time
Responses.
● Handle Huge Data Volumes Growing Rapidly
Key Characteristics of Big Data Infrastructure:
● High processing/IOPS performance
● Very Large Capacity.
HARDWARE
What’s Key to Efficient Data Processing?
25. KEY DIFFERENTIATOR
● Big Data is Largely Unstructured.
● Unstructured Data is Immutable
● Traditional File Systems have Built-in Functions to handle Insert/Update.
● Creates a Lot of Overhead in Terms of Performance, IOs Required to
Access Data and the Ability to Scale
HARDWARE
WHY DO WE NEED A DIFFERENT APPROACH?
FIG. GROWTH OF UNSTRUCTURED DATA ANNUA
26. ● Objects in one Large, Scalable Pool of Storage
● Stores metadata – Information about the
object
● An Object ID is stored, to Locate the Data
● Objects are immutable
● No File System Hierarchy
Products:
● Scality’s RING architecture
● Dell DX
● EMC’s Atmos
HARDWARE
OBJECT STORAGE – Choice of Storage
28. ● Access Times
SSDs exhibit Virtually no Access
time
● Random I/O Performance of SSD
SSD Delivers at least 6000 IO/Sec
15 times faster than HDD(400
IO/S)
● Reliability
SSDs 4-10 Times more Reliable
HARDWARE
Storage Medium Solid-State Drive (SSD) or Hard Disk
Drive(HDD)
SSD
29. REAL TIME APPLICATIONS OF SSD
● Read-Intensive Video-on-demand(VOD), and Image-Retrieval
Applications.
● Emerging Applications (Big Data/Hadoop/Cloud)
HARDWARE
COMPARISON OF BOOT TIMES USING SSD & HDD
30. 2011
Throughput 250 MB/s , Capacity 512GB
2014:
1000 MB/s Data Transfer , Capacity 4TB
Standard 2.5 inch form factor
Further Scale Down of Flash
Lithography
Leads to Continued Performance Gains
and Greater Capacity Points.
HARDWARE
Solid-State drives SSDs & Moore’s Law
Fig 1.HDD Aerial Density follow Moore’s
Law
Fig2. Avg. Price Comparison of SSD Vs.
33. RAID (REDUNDANT ARRAY OF INDEPENDENT DISKS)
● Originally Designed for Small Capacity Disks.
● Longer Time taken to Restore a Failed Drive as Capacity Increase.
● To Shorten Longer Rebuild cycles, RAID Systems Ship with Faster Processors,
Leading to High Energy Consumption.
REPLICATION
● Copies Add Additional Costs: Typically 133% or more Additional Storage is
needed for each Additional Copy
● Storage System will get More Expensive as the amount of Data Increases
HARDWARE
Limitations of Traditional Approaches
34. How Does it Work?
● Information Dispersal Algorithms (IDAs)
separate data into Unrecognizable slices of
information.
● It is then dispersed to Storage Nodes in
disparate Storage locations.
● It can be implemented Locally or
Distributed .
● Only a Pre-defined subset of the slices From
the Dispersed Storage Nodes is needed to fully
Retrieve all of the Data.
HARDWARE
Information Dispersal - Better Approach?
35. ● It is Resilient against Natural disasters or Technological failures, like
Drive failures, System Crashes and Network Failures.
● Data can still be Accessed in Real-time even if there are Multiple
Simultaneous Failures across a String of Hosting Devices, Servers or
Networks
● Five 9’s or More are Guaranteed with Overhead Low as 20% - As
Opposed To 3 Copies Requiring 200% Overhead.
HARDWARE
Benefits of Information Dispersal
37. When looking at Number of Years without Data loss, with a 99.99999% Confidence Level,
Information Dispersal doesn’t even appear on the Chart because even For a Large storage amount
like 524K Terabytes, the Confidence for Years without data loss is not within anyone’s
lifetime.(Theoretically Over 79 Million Years.)
HARDWARE
Cost Savings from IDA in Petabyte Storage over RAID and
Replication
38. When looking at Number of Years without Data loss, with a 99.99999% Confidence Level,
Information Dispersal doesn’t even appear on the Chart because even For a Large storage amount
like 524K Terabytes, the Confidence for Years without data loss is not within anyone’s
lifetime.(Theoretically Over 79 Million Years.)
HARDWARE
Cost Savings from IDA in Petabyte Storage over RAID and
Replication
40. How make the huge dataset to match the ICD 10?
ALGORITHMS
Deal with the huge data
41. ICD 10 Clinical
Modifications
69823
ICD CM Dataset • 3-7 characters
• Character 1 is alpha
• Character 2 is numeric
• Character 3-7 can be alpha
or numeric
ICD 10 Procedure
Coding System
76000
ICD 10 PCS Dataset • 7 characters
• Each one can be alpha or
numeric
• Numbers 0-9; letters A-H, J-
N, P-Z
ALGORITHMS
ICD 10 introduction
42. Analytics Algorithms
Machine Learning
Image Retrieval system
Huge Nonstandard
Data Source (4V)
Data Feature Selection
Huge multiple
characters mapping
databases
Data Analytics
Volume
Velocity
Variety
Veracity
ALGORITHMS
Why we need big data
44. Diagnosis is a relatively straightforward
machine learning problem. Clinical
decision making is highly suited for
rule-based systems because of the
nature of the data, such as ICD-10
codes, medications, etc.,
ALGORITHMS
Machine Learning in medical diagnosis
47. *ImageCLEF medical – competition on Medical Image Processing
Two main tasks:
● Image–based retrieval
● Case–based retrieval
source : http://www.imageclef.org/
# of images
ALGORITHMS
Database of ImageCLEF Data Medical
competition
48. • This is the classic medical retrieval task.
• Similar to Query by Image Example.
• Given the query image, find the most similar images.
http://www.imageclef.org/
# performance
ALGORITHMS
Image base retrieval Algorithm
Performance = Difficulty * Accuracy
# of images Mean average
precision
49. • This is a more complex
task; is closer to the
clinical workflow.
• A case description, with
patient demographics,
limited symptoms and test
results including imaging
studies, is provided (but
not the final diagnosis).
• The goal is to retrieve
cases including images
that might best suit the
provided case description.
http://www.imageclef.org/
ALGORITHMS
Case-based retrieval Algorithm
50. Speed Slow Fast
Accuracy Hard to keep Precision
Level to study Quite hard Easy to learn
Solution level Shallow Deep
Machine
Learning
NO YES
Result Hard to explain Perspective visualization
ALGORITHMS
Manual Calculate VS Software and Algorithm
53. More data can be
gathered to identify
patterns and
interactions
Doctors will use for
diagnosis and decision-
making
Health care costs will
decrease
Individual patient care
will improve
TECHNOLOGICAL FUSION
CONCLUSION