Call Girl Bangalore Nandini 7001305949 Independent Escort Service Bangalore
BigDataInMedicine.pptx
1. “Big Data”
For
Medicine & Health Care -
An Introductory Tutorial
Frank W Meissner, MD, RDMS, RDCS
FACP, FACC, FCCP, FASNC, CPHIMS, CCDS
Diplomate- Subspecialty Board of Advanced Heart Failure & Transplant Cardiology
Diplomate - Certification Board of Cardiovascular Computed Tomography
Certified Professional Health Information and Management Systems
Diplomate- Subspecialty Board of Cardiovascular Diseases
Diplomate - Subspecialty Board of Critical Care Medicine
Diplomate - Certification Board of Nuclear Cardiology
Diplomate - American Board of Forensic Medicine
Diplomate- American Board of Internal Medicine
Diplomate - National Board of Echocardiography
Certified Cardiac Device Specialist - Physician
2. Big Data - Definition
(With Apologies to Douglas Adams)
Big Data -
You just won't believe how vastly
hugely,
mind-bogglingly big it is.”
The Hitchhiker’s Guide to the Galaxy
3. Seriously,
Big Data A Real Definition
Big data is an evolving term that describes any
voluminous amount of structured, semi-structured and
unstructured data that has the potential to be mined for
information
Although big data doesn't refer to any specific quantity,
the term is often used when speaking about petabytes
(PB) and exabytes (EB) of data
1 PB = 1000000000000000B = 1015 bytes = 1000 terabytes
1 EB = 10006 bytes = 1018 bytes = 1000000000000000000 B = 1000
petabytes = 1 million terabytes = 1 billion gigabytes
5. ‘Big Data’ - An Operational Definition
Big Data is
High Volume
High Speed
High Variety
High Veracity
THE Data demands new types and forms of info processing to
support decision support, insight discovery, and process
optimization
6. A Proto-Typical Big Data Project
(With More Apologies to Douglas Adams)
“O Deep Thought computer," he said,
"the task we have designed you to perform is this.
We want you to tell us...." he paused, "The Answer."
"The Answer?" said Deep Thought. "The Answer to what?"
"Life!" urged Fook.
"The Universe!" said Lunkwill.
"Everything!" they said in chorus.
Deep Thought paused for a moment's reflection.
"Tricky," he said finally.
"But can you do it?"
Again, a significant pause.
"Yes," said Deep Thought, "I can do it."
"There is an answer?" said Fook with breathless excitement.
"Yes," said Deep Thought. "Life, the Universe, and Everything.
There is an answer. But, I'll have to think about it."
8. Data Validity/Veracity
The 4th Dimension of Big Data
Raw data may not be valid
May be incomplete (missing attributes or values)
May be ‘noisy’ (contains outliers or errors)
May be inconsistent (Invalid data, e.g., state/zip code mismatch )
9. Data Variety
Aggregating structured and unstructured data
in preparation for data analysis
Nontrivial & complex task
As in all Informatics efforts standards for data
exchange are essential & vital
10. Data Velocity
Salient Issue #1 - How often to sample your
data
Salient Issue #2 - How much can you afford to
pay for data sampling
Answers to #1 & #2 define data velocity
11. Data Volume
Not just the magnitude of storage
Wide variety of data also essential driver for
the ‘Big’ in Big Data
So Volume & Variety inexorably intertwined
In fact Data Volume is directly proportional to
data Variety & Velocity, i.e., specify Variety of
data sources & Velocity of data streams =>
Data Volume Requirements
12. By 2015 Average Hospital
Generates 2/3 Petra-byte Patient Data Per Year
14. Knowledge Discovery
Data Warehouse vs Big Data
Data Warehouse
Predefined & Structured Data
Non-operational relational data-base
On Line Analytical Processing of Data
Conventional SQL Query Tools
Exploratory Statistical Analysis
Data Visualization Techniques
K-nearest neighbor analysis
Decision Trees & Association Rules
Construction of Genetic Algorithms & Neural Network
16. Big Data Approach
Undefined & UnStructured Data
Non relational data-bases via Hadoop Distributed File
System
Massively Distributed Data Processing VIA Hadoop
(open-source Java-based programming framework
for processing large datasets in a distributed
computing environment) (Currently version 0.23)
Economical - traditional data storage $5 per
gigabyte - Hadoop storage $0.25 per gigabyte
Knowledge Discovery
Data Warehouse vs Big Data
17. Other Open Source Tools
Avro - data serialization system
Cassandra - scalable multi-master database (critical design feature no single
points of failure)
Chukwa - data collection system for managing large distributed systems
Hbase - scalable distributed database supporting structured data storage of large
tables
Hive - data warehouse infrastructure providing data summarization & ad hoc
query capacities
Mahout - scalable machine learning & data mining library
PIG - high-level data-flow language and execution framework for parallel
computation
ZooKeeper - high performance coordination service for distributed applications
24. Hadoop Computing Framework
Two conceptual layers
Hadoop Distributed File System
File broken into definable blocks
Stored on minimum of 3 servers for fault tolerance
Execution engine (MapReduce)
Reduces file requests into smaller requests
Optimizes scalable use of CPU resources
25. A Simple Example: Word Count
Count Each Occurrence of a Single Word in a Dataset
26. A More Complex Task Join Databases
The network functions here like any peer-peer distributed file sharing
system such as that seen with the bit- torrent protocol
28. Hadoop Cluster
Hadoop File System (HDFS) building block of the computing cluster
HDFS breaks incoming files into blocks and stores with triple
redundancy across the network
Computation on the block occurs at the storage node
The Well Known SETI@home project serves as easily
understandable example of this computing model
29. File Characteristics
‘Write Once’ files - original input data not modified -
triple redundantly stored
Input data streamed into HDFS - processed by
MapReduce - any results stored back in HDFS
Obviously HDFS not general purpose file system
31. MapReduce
Programming Model Enabling Massive Distributed Parallel Computations
Originally proprietary Google Technology
Map() procedure performs filtering and sorting
Reduce() procedure performs summary operation
Model was inspired but are not strictly analogous to the functional
programming map & reduce functions
The power of the model lays within the multi-threading capability that is
it’s essential design feature
Some have criticized the problem set approachable by this technique
32. Data Architecture Designs
Hadoop
(HDFS)
Hadoop
File System
data storage
component of
open source
Apache Hadoop
Project
Stores any type of data - structured, semi-structured,
& unstructured,
e.g., email, social data, XML data, videos, audio files, photos, GPS, satellite images,
sensor data, spreadsheets, web log data, mobile data, RFID tags, pdf docs
A Massively
Distributed
File
System
Optimized
for Parallel
Processing
33. Minimally intrusive
addition of
Hadoop
to enterprise
architecture
Data
Staging
Platform
Employing data
processing
power of Hadoop
with structured
data
Process
Data
Data Architecture Designs
35. Processing
Structured &
Unstructured
Data Access via
EDW
Processing
Structured &
Unstructured
Data Access via
Hadoop
Preserving
The
Classical
Data
Model
Embracing
The
Future Data
Model
Data Architecture Designs
36. High Yield Areas 4 Use
Pharmacological Research
Genomic and Genetic Research
Psychiatry / Behavorial Health
Novel Sensors & Sensor Analysis Algorithms
Epidemiological Research
Much Talked About - Little Concrete
Actionable Effects
37. Conclusion
“Things have never been more like the way they are today in history.”
Dwight D Eisenhower
“Things are more like they are now than they’ve ever been before.”
Gerald Ford
“Those who cannot remember the past are condemned to repeat it.”
George Santayana
38. Random Smattering of Articles
Predicting Breast Cancer Survivability Using Data Mining Techniques Bellaachia A & Guven
E. Age 2006, 58:10-110.
A. McKenna, M. Hanna, E. Banks et al., “The genome analysis toolkit: a MapReduce
framework for analyzing next-generation DNA sequencing data,” Genome Research, vol. 20,
no. 9, pp.1297–1303, 2010.
R. C. Taylor, “An overview of the Hadoop/MapReduce/HBase framework and its current
applications in bioinformatics,” BMC Bioinformatics, vol. 11, no. 12, article S1, 2010.
J. D. Osborne, J. Flatow, M. Holko et al., “Annotating the human genome with disease
ontology,” BMC Genomics, vol. 10, supplement 1, article S6, 2009.
B. Giardine, C. Riemer, R. C. Hardison et al., “Galaxy: a platform for interactive large-scale
genome analysis,” Genome Research, vol. 15, no. 10, pp. 1451–1455, 2005.
Steinberg GB1, Church BW, McCall CJ, Scott AB, Kalis BP. Novel predictive models for
metabolic syndrome risk: a "big data" analytic approach. Am J Manag Care. 2014 Jun
1;20(6):e221-8.
Vaitsis C1, Nilsson G2, Zary N1. Big data in medical informatics: improving education through
visual analytics. Stud Health Technol Inform. 2014;205:1163-7.
Ross MK1, Wei W, Ohno-Machado L. "Big data" and the electronic health record. Yearb Med
Inform. 2014 Aug 15;9(1):97-104. doi: 10.15265/IY-2014-0003.
Editor's Notes
This talk will discuss basic concepts to allow understanding of the basic features of ‘big data’ and its analysis.
According to the Author Douglas Adams, Big Data is vastly, hugely, mind-bogglingly big. Of course the sophisticated member of my audience will recognize that Adams was referring to the Universe itself in this quote, not too that subset of the universe called ‘Big Data.’
This is a more conventional definition of Big Data. However, it doesn’t alter one witt, the mind-boggling characteristic of Big Data contained in the previous slides definition.
This slide represents the most common data streams independent of knowledge domain that are incorporated into a ‘Big Data’ project.
This is the conventional and most commonly articulated ‘definition’ of Big Data.
The central story idea in The Hitchhikers Guide to the Galaxy revolves around the earth as the universes most advanced computation device designed by trans-dimensional beings to answer the Big Data Query noted above. Current plans and expectations for Big Data, now early in its hype-cycle seem only slightly less ambitious than answering the ultimate question.
This illustration envisions Big Data as a data tsunami that feeds upon itself in an every expanding cycle of every greater velocities, varieties and quantities of data.
“There is a fifth dimension beyond that which is known to man. It is a dimension as vast as space and as timeless as infinity. It is the middle ground between light and shadow, between science and superstition, and it lies between the pit of man's fears and the summit of his knowledge. This is the dimension of imagination. It is an area which we call the Twilight Zone.”
While there is no 5th dimension to Big Data- it’s now classical 3-dimensional representation is often times augmented with the superposition of a 4th Dimension, a dimension to those of us interested in applying Big Data analytic products to the scientific practice of medicine, the most important of all data dimensions, data validity/veracity. As noted in this slide, raw data can be incomplete in a multitude of ways.
Certainly the most exciting and potentially the most important dimension in the Big Data Tsunami is the free form mixture of structured and unstructured data elements.
The Grandest unrealized challenge within the formal Grand Challenges of Medical Informatics ( D F Sittig. Grand challenges in medical informatics?J Am Med Inform Assoc. 1994 Sep-Oct; 1(5): 412–413.) has been its struggle to deal with the stubborn insistence of medical practitioners to prefer the use of unstructured and often times idiosyncratic formulations of diagnostic findings, hypotheses, diagnoses, and case summations.
The full flighted pursuit of a unified controlled medical vocabulary which has obsessed the field literally since its inception seems doomed given the expanding and ever accelerating volume of new knowledge and biomedical concepts discovered every calendar year.
Thus an analytical methodology able to efficiently deal with unstructured but relevant and germane data seems to make the unified controlled medical vocabulary grand challenge if not irrelevant at least theoretically manageable.
The third line in this slide emphasizes that unlike the futile and hopeless dream that one can formally structure all clinical data input, all that is required for a ‘Big Data’ analysis is that the interface for data exchange is well formulated and has a relevant and pre-agreed data standard.
Conceptually the difference can be visualized by the analogy with a Fax Machine. Rather than trying to specify all possible Fax transmission messages by type with a unified nomenclature, all that is required is that Fax transmission messages conform to a unified interface standard so that (A)Fax_machine can exchange text data of any conceptual type with (B)Fax_machine, without pre-knowledge of what type of content is being exchanged.
This slide emphasizes that articulation and specification of sampling frequency coupled with an accurate estimate of the costs associated with data sampling and storage are critical planning factors prior to developing and implementing a ‘Big Data’ project.
As such in Project Management terms specification of data velocity is in essence determining the scope of your project.
This slide emphasizes that while Data Volume is conceptualized as an independent element of the Data Tsunami, in fact Data Volume appears to be a linear function of the other two dimensions, i.e., if one can accurately specify the source of the data streams while simultaneously specifying the velocity of those data streams than the data volume requirements for the project are uniquely and deterministically defined.
It has been estimated that by next year, the average hospital in the US while generate a total of 2/3 Petra-byte of patient data of all types (predominately video data) emphasizing the necessity for deployment of ‘Big Data’ tools and techniques in taming the data tsunami that is threatening to wash away the foundations of US Healthcare.
Just as there are Grand Challenges for the field of medical informatics, there remain predictable challenges for ‘Big Data.’
The elephant in the room here seems to me to be the potential for Privacy Violation and compromise of HIPAA mandated privacy laws and regulations as well as the bedrock ethical principle that patient-provider confidentiality is central to the medial encounter and is preserved and safe-guarded.
In contradistinction to these legal and ethical mandates has to be our understanding that for the average layman their direct knowledge of ‘Big Data’ programs will probably be limited to those highly and recently publicized NSA programs such as Stellarwind and PRISM.
As such, ‘Big Data’ programs within the medical domain have to be meticulous & proactive in defining and describing their safeguards so that data accumulation/manipulation/aggregation can occur at the same time that privacy and anonymity are guaranteed.
This slide details the classical data warehouse approach to knowledge discovery.
This flow diagram taken from my own paper on knowledge discovery via use of the data warehouse (Bothner U, Meissner FW. Wissen aus medizinischen Datenbanken nutzen. Dt Arztebl 1998;95: A-1336-1338(Heft 20].
In many ways the exploration of ‘Big Data’ is identical in terms of the analytical tools involved in the analysis of the data set. Specifically, all the on line analytical tools mentioned in the previous slide have been used for the analysis of Big Data sets.
However, one critical difference characterizes analysis of Big Data sets. The analysis is done over the entire set of data, rather than extracted data subsets. As such any statistical analysis is done over the entire universe of discourse, rather than utilizing sampling sets as is done with conventional statistical analysis.
The principle take home from this slide, is the enormous cost efficiency of the Hadoop Distributed File System.
The analysis of ‘Big Data’ is facilitated by open source tools and techniques which contribute to its cost effectiveness. The tools discussed above are using with Hadoop to provide a full featured computing environment.
The relationships between these tools & the Hadoop Distributed File System are made explicit in this block diagram.
This slide emphasizes in the current ‘new data’ world, the vast majority of data is unstructured and resistant to relational database techniques with respect to organization and analysis of the data.
In terms of compare and contrast, consider the Relational database Data model as illustrated above.
Now consider the data model for Hadoop. Instead of a relational structure to the data model, i.e., each data element is characterized in relationship to other data elements and all are related to a data element key field; the Hadoop model is intrinsically flat and no predefined relationships are mandated on the data prior to data manipulation. The data is partitioned into defined blocks that are then distributed in a decentralized storage & computation schema.
This slide illustrates the classical distributed database model. Conceptually, database operations are visualized as state dependent processes with a limited behavioral repertoire (insert data field, update data field, delete data field) with a final commit behavior once the data field manipulation is completed in the absence of error. In case of error or failure, the database state is returned to its pre-operation state.
The Hadoop distributed database model is completely different. Each database operation is conceptualized as a ‘job’ with each job being divided into tasks by the Map-Reduce function. With each iteration of the job, either the task is reduced to a mapping function and the database job is concluded, or the task is further reduced to another sub-task and the process repeated until the task set is reduced to a mapping function.
While this slide may have been more clear in front of the last slide, that order was selected to allow for compare and contrast with the relational distributed data base model.
But in any case at the highest level of system analysis the Hadoop computing framework consists of the Hadoop distributed file system, that is responsible for breaking even the most huge data sets into definable and uniform computational chunks. Additionally, the HDFS is responsible for establishing at a minimum a triple redundancy to the data write operation.
The other layer of the framework is the MapReduce execution engine which takes the data file blocks and further reduces file sized manipulation requests into smaller so-called task requests. The MapReduce function not only breaks the large data chunks into smaller tasks, it also tracks the tasks. In this way, optimal and maximal use of network CPU resources occurs.
To reiterate and for emphasis, Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Let us consider the performance of MapReduce in the setting of a simple word count task. In this scenario, the input files of a defined size are received from the Hadoop Distributed File System for processing. Once they arrive at the MapReduce engine, the mapper reduces the input data set into two smaller sets with 1/2 of the data instances that we saw in the original data input data set, i.e., the Mapper function divides the original task into two tasks that contain 50% of the original data sets. Once this has occurred, the mapping function attempts to map to a single type of datum within the processed dataset. If this operation fails to yield a data set of unitary elements, the data set is then sorted and randomly shuffled. For the sake of illustration this data operation has resulted in different sized sets of unitary elements. But in the real processing of large quantities of elements, such a sort and shuffle operation must take place many times until a unitary element result occurs.
At this point, the sets are reduced to key value pairs (fruit type, # of instances within the input data set). The key value pairs than represent the final program outputs.
Now imagine this process occurring over a Petabyte data set and one can get a feel for the power of the MapReduce function.
Here is a more complicated MapReduce task. The goal is to take elements of two different datasets and join them into an integrated dataset.
As noted, the network functions much like any peer-peer distributed sharing system such as those seen with the bit-torrent protocol. The difference is that in addition to sharing the data across the network, operations on the data are performed at the same network nodes that function as storage nodes.
Another way to look at MapReduce is as a 5-step parallel and distributed computation:
Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.
Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values A1.
"Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the A1…C8 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.
Run the user-provided Reduce() code – Reduce() is run exactly once for each A1…C8 key value produced by the Map step.
Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by A1…C8 to produce the final outcome.
These five steps can be Logically thought of as running in sequence – each step starts only after the previous step is completed – although in practice they can be interleaved as long as the final result is not affected.
MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
In summary,
"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
In addition to my appeal to the bit torrent file sharing protocol as a means to understand MapReduce & the Hadoop File System, I am encouraging the audience to recall the SETI@Home project which was probably the 1st well known example of massively parallel computing most layperson’s have been exposed too.
In a similar way to the SETI system, Hadoop distributes data blocks with the Hadoop file sharing/information processing cluster resulting in a massively parallel effort to process large data sets in the search for simple comparisons across those data sets, i.e., returning a list of similar books ordered by customers who have bought the book you just bought on amazon.com. This is a search result we all now take for granted, but conceptually we can now understand how this occurs in real time, without implementation of truly impossible relational database structures.
One of the ways that order is conferred on this very ad hoc file system, is both triple redundancy as well as ensuring all input files are ‘write once’ files, i.e., no modifications to input files is allowed to ensure absolute data integrity.
This slide illustrates the high level systems architecture of HDFS.
The name node is a single node in the computing cluster that is responsible for keeping track of the file system metadata. It additionally keeps a list of all the blocks within the HDFS as well as a list of all data nodes that host these blocks. I conceptualize the name node as analogous to a Domain Name Server in the TCP/IP protocol. Since it is a single point of failure in the system, it is provisioned with a resilient, highly available server.
The datanode is a shared-nothing cluster of computers capable of executing the workload components of the system.
A reiteration and summarization of the past several slides.
Hadoop can be integrated into a Enterprise Wide Information system in various system configurations.
This slide contrasts the independent Enterprise Data Warehouse with a standalone Hadoop file system.
Hadoop can be integrated with the EDW (enterprise data warehouse) as a highly efficient distributed storage and data processing system for use with existing structured data sources.
Additionally leaving the enterprise data warehouse as the sole vehicle for analysis of data, Hadoop can function to add and process unstructured as well as structured data to the EDW.
Alternatively it can be used as an efficient data archive in which all enterprise data is archived and stored via Hadoop nodes.
In this configuration, the EDW remains the single point of entry to all the available data but Hadoop can be utilized by conventional analytical programs for the purpose of analysis of large data sets utilizing defined tools.
The final data architectural design utilizes Hadoop as the sole point of contact for all enterprise wide data and data analytics.
The point of these last few slides was to emphasize the flexibility of the Hadoop system as well as too defeat the false dichotomy of either EDW or Hadoop,
In fact Hadoop plays well with others.
This slide demonstrates both current and projected areas of Big Data efforts in the fields of Biomedicine.
Of course, given the enormous combinatorial complexity of Genomics research the application of Big Data techniques seems axiomatic.
Additionally, given the financial resources and development costs related to drug research, simulation and advanced analysis systems have the potential to dramatically reduce drug development costs. By the way of analogy, the advent of modern ‘supercomputers’ was necessitated by treaty obligations that prevented all atomic weapons testing. Once the need for high speed weapons effects simulations became a national priority, high speed computing efforts became the focus of technological revolution.
Not as obvious, but given that this type of computing (highly distributed, massively parallel) was pioneered by consumer driven web based enterprises that were trying to understand ‘individual consumer choices’ psychiatric and behavioral health analysis and applications seems as axiomatic as Genomics or pharmacological applications.
Epidemiological research by reason of the potential size of their data sets also promise to yield significant insights from this computing methodology.
Novel sensor analysis seems to me a long term benefit for this type of computational capacity. For example, while heart rate variability analysis has been a tool of cardiology for as long as my career, it has always been utilized in the isolated clinical case. Having massive amounts of heart rate data linked to personal activity logs and temporal data promise to yield dramatic insights into the area of sudden cardiac death, chronotropic dependences of AMI, neurohumoral and temporal factors dictating onset of atrial fibrillation, relationships between exercise and onset of cardiac disease, etc.
While real results will be derived from this powerful new set of data manipulations, the reality is that we are on the ascending limb of the hype curve, and it is too soon to prognosticate if this is an evolutionary or revolutionary change in computing methodology.