Lies, Damn Lies, and Big Data
Applications, Limitations, Misconceptions
Brian Bissett
Senior Member
Institute of Electrical and Electronics Engineers (IEEE)
5/23/20151
5/23/20152
Overview
What is Big Data
Common Attributes of Big Data
Challenges of Working with Big Data
Validity Space
Outliers
Variance
Correlation and Causality
Summary
What is “Big Data”?
Depends who you ask. . . .
Gartner – define by the “three Vs”: Volume,
Velocity and Variety.
Oracle - the derivation of value from traditional
relational database-driven business decision
making, augmented with new sources of
unstructured data.
Intel – the generation of a median of 300
terabytes of data a week.
3 5/23/2015
What is “Big Data”?
Microsoft - the process of applying serious computing
power—the latest in machine learning and artificial
intelligence—to seriously massive and often highly
complex sets of information.
The Method for an Integrated Knowledge Environment
(MIKE) project argues that big data is not a function of
the size but of complexity. (A high degree of
permutations and interactions within a data set defines
big data.)
National Institute of Standards and Technology (NIST) -
big data “exceed(s) the capacity or capability of current
methods and systems.”
4 5/23/2015
The Current 8 V’s of Big Data
Volume
Velocity
Variety
Value – is this worth something to someone?
Validity – is this correct?
Viability – can this stand independently?
Variability – is the same result reported consistently?
Verifiability – do we know where this came from?
5 5/23/2015
The 5 P’s for Biomedical Big Data
Evidence Based, Outcome Driven, and Affordable
Health Care will Require the Five P’s:
Predictive
Precise
Preventive
Personalized
Patient-Centric
The Cancer Genome Atlas (TCGA)
6 5/23/2015
Challenges of Dealing with Big Data
Management – In 10 Years at Zettabyte Levels!
Infrastructure
Performance Analytics – TBD.
Unstructured – Lacks any Meaningful Standards.
Data Visualization – Humans see in 3D Only.
Navigation – Siloed Data is Difficult to Access.
Missing Data – Average of 30% from HIT Data.
Incorrect Data – Average of 25% - 30%.
7 5/23/2015
The Three C’s (Challenges)
Collection
– is it worth saving?
– Value = Actionable
Consolidation
– Clean it up! "Not Collected Here"
Consumption
– Easy to Add Processors
– Difficult to move Data.
8 5/23/2015
Transactions: Real Time & Queued
Real Time – must be done ASAP
– Retail: Credit Card Transactions
– Security: Is Passenger on the “no fly list”
– NICS Checks for Firearms Purchases
– Stock Purchases
Queued – Everything else that can wait
– Traffic Data, process images from Traffic Cameras to
determine speed and volume.
– Daily Customer Counts
– Daily or Monthly Volume for Stock Transactions
9 5/23/2015
When are the Conclusions Drawn
from Big Data Most Accurate?
Big Data is most reliable when working in Two and
sometimes Three Dimensional Matrices.
Where the Assumption to be derived is Boolean.
Where the Data Acquired is known to be of Good
Quality.
Example: Traffic Data at Checkpoint
– Record: Number of Cars, Time, Maybe Speed
– Derive: Is Traffic Flowing without Delay?
10 5/23/2015
Big Data = Big Problems
More Excess Data as Compared to Real Signals =
More Spurious Relationships.
11 5/23/2015
Source: N.N. Taleb
Outliers: Goldmine or Nuisance
An Outlier can either be a Goldmine (the needle in
the haystack sought) or a Nuisance (an artifact to be
ignored)
Example: Lipinski’s Rule of 5 (Ro5)
16% of oral drugs violate at least one of the criteria,
and 6% fail two or more.
Billion Dollar Drugs that have failed the Ro5 criteria:
Lipitor, Singulair
12 5/23/2015
Outliers: Goldmine or Nuisance
Example: Nuisance Outlier
The speed of the Motorcycle in no
way reflects the true speed of the
Traffic.
13 5/23/2015
No rigid mathematical definition exists of what
constitutes an outlier, or when an Outlier may be
omitted from an analysis.
Mahalanobis Distance - distance between data point
and a multivariate space's centroid (overall mean).
(Commonly used in Linear Regression)
Outliers – Bonedigger and Milo
Bonedigger the lion and Milo the sausage dog are inseparable. The
friendship between an 11-pound wiener dog and a 500-pound lion is
the only one ever seen in the world.
14 5/23/2015
Melanoma Example
Dealing with Variance
Impossible to Positively Discern without Biopsy
15 5/23/2015
Melanoma ~ 80% Diagnostic Rate
with Current Image Algorithms
Because Melanoma can present in all Colors, Shapes,
Granularities, and Textures; More Data is unlikely to
improve Current Diagnostic Image algorithms.
Sensitivity – Rule out Condition when Negative
= true positives/(true positives + false negatives)
80% Sensitive Test will Detect 8 out of 10 Cancers.
Specificity – Rule in Condition when Positive
=true negatives/(true negatives + false positives)
95% Specific Test -> False Positive rate of 5%
Sensitivity and specificity are inversely proportional16 5/23/2015
Variance – The Batch Effect
High-throughput technologies.
Batch Effects when measurements are affected by
laboratory conditions, reagent lots, and personnel
differences.
Pharmaceutical Mergers - Particularly troubling
when merging data sets from different labs.
Normalization for Batch Effects is extremely
difficult.
“What level is your pain on a scale from 1 to 10?”
17 5/23/2015
Qualitative Variance
Massachusetts General Hospital Harvard Medical
School investigated discrepancy rates for the
interpretation of Radiology Films.
60 examinations - 30 previously interpreted by
themselves and 30 interpreted by their peers.
Interobserver Disagreement Rate = 26%.
Intraobserver Disagreement Rate = 32%.
Radiologists agreed with other Radiologists more
than themselves.
18 5/23/2015
Correlation vs. Causation
Correlation is easy to prove.
How much of a Correlation is Easy to Prove.
R2 = 1.0 – Perfect Correlation.
R2 = 0.0 – No Correlation.
Causation is nearly Impossible to Prove.
US Spending on Science, Space, and Technology
correlates Nearly Perfectly (R2 = 0.99208) with
Suicides by Hanging, Strangulation and Suffocation.
19 5/23/2015
Bradford Hill Causality Proof
Strong – Five or Ten Fold Increase
Consistent – Populations or Time does not Effect
Specific – A Link (a location, mechanism, etc.)
Temporal - Association Increases with Duration
Gradient - Association Increases with Exposure
Plausible – Association Easily Seen
Coherent – Experimental Evidence Supports
Similar Behavior in Analogous Situations
20 5/23/2015
Big Data Governance Does not Exist
No laws exist to address the utilization of big data.
Concerns about citizen privacy and business liability
have yet to be addressed.
Critical Challenge to the Federal Government.
Federal Agencies that Utilize Big Data do so on an
ad-hoc basis.
Little guidance exists on using petabyte sizes of
private citizen data for predictive analytics.
– Privacy Act of 1974 and HIPAA 1996.
21 5/23/2015
Hierarchy of Evidence
22 5/23/2015
Big Data = Observational Study
Data is not Collected to Examine a Specific
Problem using a Protocol.
The Treatment Group and the Control Group are
outside the control of the Investigator.
Groups Differing in Outcome are identified and
compared on the basis of a supposed causal
attribute.
Longitudinal - repeated observations of the same
variables over long periods of time.
23 5/23/2015
Summary
The World is Accumulating a Lot of Data.
Nobody Agrees on What “Big” is.
On Average, 30% of the Data is Incorrect.
On Average, 30% of the Data is Missing.
Correlation is the Easy Part.
Bradford Hill gives Guidance on Proving Causation.
There is a Hierarchy of Evidence and Expert Opinion
and Big Data are at the bottom of it.
24 5/23/2015
Selected Publications
Automated Data Analysis with Excel
– Softcover: 442 Pages
– Chapman & Hall (June 2007)
– Second Edition Coming in 2016
– ISBN: 1-58488-885-7
Practical Pharmaceutical Laboratory
Automation
– Hardcover: 464 pages
– Publisher: CRC Press (May 2003)
– ISBN: 0849318149
25 5/23/2015
References
Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL. Quantifying the chemical beauty of
drugs. Nat Chem. 2012;4:90–98. doi: 10.1038/nchem.1243.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3524573/
The Big Data Conundrum: How to Define It? http://www.technologyreview.com/view/519851/the-
big-data-conundrum-how-to-define-it/
Abujudeh, HH, Boland, GW, Kaewalai, R, et al. Abdominal and Pelvic Computed Tomography (CT)
Interpretation: discrepancy rates among experienced radiologists. Eur Radiol.2010;20(8): 1952-7.
Maryam Ramezani, Alireza Karimian, and Payman Moallem. Automatic Detection of Malignant
Melanoma using Macroscopic Images. J Med Signals Sens. 2014 Oct-Dec; 4(4): 281–290. PMCID:
PMC4236807
26 5/23/2015

Lies, Damn Lies, and Big Data

  • 1.
    Lies, Damn Lies,and Big Data Applications, Limitations, Misconceptions Brian Bissett Senior Member Institute of Electrical and Electronics Engineers (IEEE) 5/23/20151
  • 2.
    5/23/20152 Overview What is BigData Common Attributes of Big Data Challenges of Working with Big Data Validity Space Outliers Variance Correlation and Causality Summary
  • 3.
    What is “BigData”? Depends who you ask. . . . Gartner – define by the “three Vs”: Volume, Velocity and Variety. Oracle - the derivation of value from traditional relational database-driven business decision making, augmented with new sources of unstructured data. Intel – the generation of a median of 300 terabytes of data a week. 3 5/23/2015
  • 4.
    What is “BigData”? Microsoft - the process of applying serious computing power—the latest in machine learning and artificial intelligence—to seriously massive and often highly complex sets of information. The Method for an Integrated Knowledge Environment (MIKE) project argues that big data is not a function of the size but of complexity. (A high degree of permutations and interactions within a data set defines big data.) National Institute of Standards and Technology (NIST) - big data “exceed(s) the capacity or capability of current methods and systems.” 4 5/23/2015
  • 5.
    The Current 8V’s of Big Data Volume Velocity Variety Value – is this worth something to someone? Validity – is this correct? Viability – can this stand independently? Variability – is the same result reported consistently? Verifiability – do we know where this came from? 5 5/23/2015
  • 6.
    The 5 P’sfor Biomedical Big Data Evidence Based, Outcome Driven, and Affordable Health Care will Require the Five P’s: Predictive Precise Preventive Personalized Patient-Centric The Cancer Genome Atlas (TCGA) 6 5/23/2015
  • 7.
    Challenges of Dealingwith Big Data Management – In 10 Years at Zettabyte Levels! Infrastructure Performance Analytics – TBD. Unstructured – Lacks any Meaningful Standards. Data Visualization – Humans see in 3D Only. Navigation – Siloed Data is Difficult to Access. Missing Data – Average of 30% from HIT Data. Incorrect Data – Average of 25% - 30%. 7 5/23/2015
  • 8.
    The Three C’s(Challenges) Collection – is it worth saving? – Value = Actionable Consolidation – Clean it up! "Not Collected Here" Consumption – Easy to Add Processors – Difficult to move Data. 8 5/23/2015
  • 9.
    Transactions: Real Time& Queued Real Time – must be done ASAP – Retail: Credit Card Transactions – Security: Is Passenger on the “no fly list” – NICS Checks for Firearms Purchases – Stock Purchases Queued – Everything else that can wait – Traffic Data, process images from Traffic Cameras to determine speed and volume. – Daily Customer Counts – Daily or Monthly Volume for Stock Transactions 9 5/23/2015
  • 10.
    When are theConclusions Drawn from Big Data Most Accurate? Big Data is most reliable when working in Two and sometimes Three Dimensional Matrices. Where the Assumption to be derived is Boolean. Where the Data Acquired is known to be of Good Quality. Example: Traffic Data at Checkpoint – Record: Number of Cars, Time, Maybe Speed – Derive: Is Traffic Flowing without Delay? 10 5/23/2015
  • 11.
    Big Data =Big Problems More Excess Data as Compared to Real Signals = More Spurious Relationships. 11 5/23/2015 Source: N.N. Taleb
  • 12.
    Outliers: Goldmine orNuisance An Outlier can either be a Goldmine (the needle in the haystack sought) or a Nuisance (an artifact to be ignored) Example: Lipinski’s Rule of 5 (Ro5) 16% of oral drugs violate at least one of the criteria, and 6% fail two or more. Billion Dollar Drugs that have failed the Ro5 criteria: Lipitor, Singulair 12 5/23/2015
  • 13.
    Outliers: Goldmine orNuisance Example: Nuisance Outlier The speed of the Motorcycle in no way reflects the true speed of the Traffic. 13 5/23/2015 No rigid mathematical definition exists of what constitutes an outlier, or when an Outlier may be omitted from an analysis. Mahalanobis Distance - distance between data point and a multivariate space's centroid (overall mean). (Commonly used in Linear Regression)
  • 14.
    Outliers – Bonediggerand Milo Bonedigger the lion and Milo the sausage dog are inseparable. The friendship between an 11-pound wiener dog and a 500-pound lion is the only one ever seen in the world. 14 5/23/2015
  • 15.
    Melanoma Example Dealing withVariance Impossible to Positively Discern without Biopsy 15 5/23/2015
  • 16.
    Melanoma ~ 80%Diagnostic Rate with Current Image Algorithms Because Melanoma can present in all Colors, Shapes, Granularities, and Textures; More Data is unlikely to improve Current Diagnostic Image algorithms. Sensitivity – Rule out Condition when Negative = true positives/(true positives + false negatives) 80% Sensitive Test will Detect 8 out of 10 Cancers. Specificity – Rule in Condition when Positive =true negatives/(true negatives + false positives) 95% Specific Test -> False Positive rate of 5% Sensitivity and specificity are inversely proportional16 5/23/2015
  • 17.
    Variance – TheBatch Effect High-throughput technologies. Batch Effects when measurements are affected by laboratory conditions, reagent lots, and personnel differences. Pharmaceutical Mergers - Particularly troubling when merging data sets from different labs. Normalization for Batch Effects is extremely difficult. “What level is your pain on a scale from 1 to 10?” 17 5/23/2015
  • 18.
    Qualitative Variance Massachusetts GeneralHospital Harvard Medical School investigated discrepancy rates for the interpretation of Radiology Films. 60 examinations - 30 previously interpreted by themselves and 30 interpreted by their peers. Interobserver Disagreement Rate = 26%. Intraobserver Disagreement Rate = 32%. Radiologists agreed with other Radiologists more than themselves. 18 5/23/2015
  • 19.
    Correlation vs. Causation Correlationis easy to prove. How much of a Correlation is Easy to Prove. R2 = 1.0 – Perfect Correlation. R2 = 0.0 – No Correlation. Causation is nearly Impossible to Prove. US Spending on Science, Space, and Technology correlates Nearly Perfectly (R2 = 0.99208) with Suicides by Hanging, Strangulation and Suffocation. 19 5/23/2015
  • 20.
    Bradford Hill CausalityProof Strong – Five or Ten Fold Increase Consistent – Populations or Time does not Effect Specific – A Link (a location, mechanism, etc.) Temporal - Association Increases with Duration Gradient - Association Increases with Exposure Plausible – Association Easily Seen Coherent – Experimental Evidence Supports Similar Behavior in Analogous Situations 20 5/23/2015
  • 21.
    Big Data GovernanceDoes not Exist No laws exist to address the utilization of big data. Concerns about citizen privacy and business liability have yet to be addressed. Critical Challenge to the Federal Government. Federal Agencies that Utilize Big Data do so on an ad-hoc basis. Little guidance exists on using petabyte sizes of private citizen data for predictive analytics. – Privacy Act of 1974 and HIPAA 1996. 21 5/23/2015
  • 22.
  • 23.
    Big Data =Observational Study Data is not Collected to Examine a Specific Problem using a Protocol. The Treatment Group and the Control Group are outside the control of the Investigator. Groups Differing in Outcome are identified and compared on the basis of a supposed causal attribute. Longitudinal - repeated observations of the same variables over long periods of time. 23 5/23/2015
  • 24.
    Summary The World isAccumulating a Lot of Data. Nobody Agrees on What “Big” is. On Average, 30% of the Data is Incorrect. On Average, 30% of the Data is Missing. Correlation is the Easy Part. Bradford Hill gives Guidance on Proving Causation. There is a Hierarchy of Evidence and Expert Opinion and Big Data are at the bottom of it. 24 5/23/2015
  • 25.
    Selected Publications Automated DataAnalysis with Excel – Softcover: 442 Pages – Chapman & Hall (June 2007) – Second Edition Coming in 2016 – ISBN: 1-58488-885-7 Practical Pharmaceutical Laboratory Automation – Hardcover: 464 pages – Publisher: CRC Press (May 2003) – ISBN: 0849318149 25 5/23/2015
  • 26.
    References Bickerton GR, PaoliniGV, Besnard J, Muresan S, Hopkins AL. Quantifying the chemical beauty of drugs. Nat Chem. 2012;4:90–98. doi: 10.1038/nchem.1243. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3524573/ The Big Data Conundrum: How to Define It? http://www.technologyreview.com/view/519851/the- big-data-conundrum-how-to-define-it/ Abujudeh, HH, Boland, GW, Kaewalai, R, et al. Abdominal and Pelvic Computed Tomography (CT) Interpretation: discrepancy rates among experienced radiologists. Eur Radiol.2010;20(8): 1952-7. Maryam Ramezani, Alireza Karimian, and Payman Moallem. Automatic Detection of Malignant Melanoma using Macroscopic Images. J Med Signals Sens. 2014 Oct-Dec; 4(4): 281–290. PMCID: PMC4236807 26 5/23/2015

Editor's Notes

  • #3 1:55 Lies, Damn Lies, and Big Data: How to Best Utilize Data to Drive Decisions Brian Bissett, Senior Member, Institute of Electrical and Electronics Engineers Big Data is hailed as the solution to many problems in industry. In many respects this is a fallacy because it only takes a small amount of erroneous data to corrupt the usefulness of a large dataset. While Big Data can be extremely useful in predicting patterns for the masses such as traffic patterns and peak usage hours for a utility, its usefulness begins to diminish in situations where quality is more important than quantity. In addition, the underlying assumption of Big Data that the behavior of the masses is the correct course of action is not always true. The audience will gain an appreciation for how to best utilize data to drive decisions. Common fallacies will be addressed including the notion that Big Data sets are always superior to smaller data sets. The limitations of big data sets, the importance of quality data, effective display of quantitative information, boundary conditions, and the evaluation of quantitative and qualitative factors will all be discussed.
  • #6 Variability – Go to Doctor – What is your level of Pain from 1 to 10.
  • #8 1 ZB = 1 Trillion TB or 10^21 bytes How do you store it? How do you Visualize on more than 3 Parameters? How do you effectively query on something with so many parameters?
  • #14 Lanesplitting.
  • #19 interobserver (between two different radiologists) intraobserver (disagreeing with one’s self) Pathologists are worse 40% discrepancy rate with each other.