RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
Lies, Damn Lies, and Big Data
1. Lies, Damn Lies, and Big Data
Applications, Limitations, Misconceptions
Brian Bissett
Senior Member
Institute of Electrical and Electronics Engineers (IEEE)
5/23/20151
2. 5/23/20152
Overview
What is Big Data
Common Attributes of Big Data
Challenges of Working with Big Data
Validity Space
Outliers
Variance
Correlation and Causality
Summary
3. What is “Big Data”?
Depends who you ask. . . .
Gartner – define by the “three Vs”: Volume,
Velocity and Variety.
Oracle - the derivation of value from traditional
relational database-driven business decision
making, augmented with new sources of
unstructured data.
Intel – the generation of a median of 300
terabytes of data a week.
3 5/23/2015
4. What is “Big Data”?
Microsoft - the process of applying serious computing
power—the latest in machine learning and artificial
intelligence—to seriously massive and often highly
complex sets of information.
The Method for an Integrated Knowledge Environment
(MIKE) project argues that big data is not a function of
the size but of complexity. (A high degree of
permutations and interactions within a data set defines
big data.)
National Institute of Standards and Technology (NIST) -
big data “exceed(s) the capacity or capability of current
methods and systems.”
4 5/23/2015
5. The Current 8 V’s of Big Data
Volume
Velocity
Variety
Value – is this worth something to someone?
Validity – is this correct?
Viability – can this stand independently?
Variability – is the same result reported consistently?
Verifiability – do we know where this came from?
5 5/23/2015
6. The 5 P’s for Biomedical Big Data
Evidence Based, Outcome Driven, and Affordable
Health Care will Require the Five P’s:
Predictive
Precise
Preventive
Personalized
Patient-Centric
The Cancer Genome Atlas (TCGA)
6 5/23/2015
7. Challenges of Dealing with Big Data
Management – In 10 Years at Zettabyte Levels!
Infrastructure
Performance Analytics – TBD.
Unstructured – Lacks any Meaningful Standards.
Data Visualization – Humans see in 3D Only.
Navigation – Siloed Data is Difficult to Access.
Missing Data – Average of 30% from HIT Data.
Incorrect Data – Average of 25% - 30%.
7 5/23/2015
8. The Three C’s (Challenges)
Collection
– is it worth saving?
– Value = Actionable
Consolidation
– Clean it up! "Not Collected Here"
Consumption
– Easy to Add Processors
– Difficult to move Data.
8 5/23/2015
9. Transactions: Real Time & Queued
Real Time – must be done ASAP
– Retail: Credit Card Transactions
– Security: Is Passenger on the “no fly list”
– NICS Checks for Firearms Purchases
– Stock Purchases
Queued – Everything else that can wait
– Traffic Data, process images from Traffic Cameras to
determine speed and volume.
– Daily Customer Counts
– Daily or Monthly Volume for Stock Transactions
9 5/23/2015
10. When are the Conclusions Drawn
from Big Data Most Accurate?
Big Data is most reliable when working in Two and
sometimes Three Dimensional Matrices.
Where the Assumption to be derived is Boolean.
Where the Data Acquired is known to be of Good
Quality.
Example: Traffic Data at Checkpoint
– Record: Number of Cars, Time, Maybe Speed
– Derive: Is Traffic Flowing without Delay?
10 5/23/2015
11. Big Data = Big Problems
More Excess Data as Compared to Real Signals =
More Spurious Relationships.
11 5/23/2015
Source: N.N. Taleb
12. Outliers: Goldmine or Nuisance
An Outlier can either be a Goldmine (the needle in
the haystack sought) or a Nuisance (an artifact to be
ignored)
Example: Lipinski’s Rule of 5 (Ro5)
16% of oral drugs violate at least one of the criteria,
and 6% fail two or more.
Billion Dollar Drugs that have failed the Ro5 criteria:
Lipitor, Singulair
12 5/23/2015
13. Outliers: Goldmine or Nuisance
Example: Nuisance Outlier
The speed of the Motorcycle in no
way reflects the true speed of the
Traffic.
13 5/23/2015
No rigid mathematical definition exists of what
constitutes an outlier, or when an Outlier may be
omitted from an analysis.
Mahalanobis Distance - distance between data point
and a multivariate space's centroid (overall mean).
(Commonly used in Linear Regression)
14. Outliers – Bonedigger and Milo
Bonedigger the lion and Milo the sausage dog are inseparable. The
friendship between an 11-pound wiener dog and a 500-pound lion is
the only one ever seen in the world.
14 5/23/2015
16. Melanoma ~ 80% Diagnostic Rate
with Current Image Algorithms
Because Melanoma can present in all Colors, Shapes,
Granularities, and Textures; More Data is unlikely to
improve Current Diagnostic Image algorithms.
Sensitivity – Rule out Condition when Negative
= true positives/(true positives + false negatives)
80% Sensitive Test will Detect 8 out of 10 Cancers.
Specificity – Rule in Condition when Positive
=true negatives/(true negatives + false positives)
95% Specific Test -> False Positive rate of 5%
Sensitivity and specificity are inversely proportional16 5/23/2015
17. Variance – The Batch Effect
High-throughput technologies.
Batch Effects when measurements are affected by
laboratory conditions, reagent lots, and personnel
differences.
Pharmaceutical Mergers - Particularly troubling
when merging data sets from different labs.
Normalization for Batch Effects is extremely
difficult.
“What level is your pain on a scale from 1 to 10?”
17 5/23/2015
18. Qualitative Variance
Massachusetts General Hospital Harvard Medical
School investigated discrepancy rates for the
interpretation of Radiology Films.
60 examinations - 30 previously interpreted by
themselves and 30 interpreted by their peers.
Interobserver Disagreement Rate = 26%.
Intraobserver Disagreement Rate = 32%.
Radiologists agreed with other Radiologists more
than themselves.
18 5/23/2015
19. Correlation vs. Causation
Correlation is easy to prove.
How much of a Correlation is Easy to Prove.
R2 = 1.0 – Perfect Correlation.
R2 = 0.0 – No Correlation.
Causation is nearly Impossible to Prove.
US Spending on Science, Space, and Technology
correlates Nearly Perfectly (R2 = 0.99208) with
Suicides by Hanging, Strangulation and Suffocation.
19 5/23/2015
20. Bradford Hill Causality Proof
Strong – Five or Ten Fold Increase
Consistent – Populations or Time does not Effect
Specific – A Link (a location, mechanism, etc.)
Temporal - Association Increases with Duration
Gradient - Association Increases with Exposure
Plausible – Association Easily Seen
Coherent – Experimental Evidence Supports
Similar Behavior in Analogous Situations
20 5/23/2015
21. Big Data Governance Does not Exist
No laws exist to address the utilization of big data.
Concerns about citizen privacy and business liability
have yet to be addressed.
Critical Challenge to the Federal Government.
Federal Agencies that Utilize Big Data do so on an
ad-hoc basis.
Little guidance exists on using petabyte sizes of
private citizen data for predictive analytics.
– Privacy Act of 1974 and HIPAA 1996.
21 5/23/2015
23. Big Data = Observational Study
Data is not Collected to Examine a Specific
Problem using a Protocol.
The Treatment Group and the Control Group are
outside the control of the Investigator.
Groups Differing in Outcome are identified and
compared on the basis of a supposed causal
attribute.
Longitudinal - repeated observations of the same
variables over long periods of time.
23 5/23/2015
24. Summary
The World is Accumulating a Lot of Data.
Nobody Agrees on What “Big” is.
On Average, 30% of the Data is Incorrect.
On Average, 30% of the Data is Missing.
Correlation is the Easy Part.
Bradford Hill gives Guidance on Proving Causation.
There is a Hierarchy of Evidence and Expert Opinion
and Big Data are at the bottom of it.
24 5/23/2015
25. Selected Publications
Automated Data Analysis with Excel
– Softcover: 442 Pages
– Chapman & Hall (June 2007)
– Second Edition Coming in 2016
– ISBN: 1-58488-885-7
Practical Pharmaceutical Laboratory
Automation
– Hardcover: 464 pages
– Publisher: CRC Press (May 2003)
– ISBN: 0849318149
25 5/23/2015
26. References
Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL. Quantifying the chemical beauty of
drugs. Nat Chem. 2012;4:90–98. doi: 10.1038/nchem.1243.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3524573/
The Big Data Conundrum: How to Define It? http://www.technologyreview.com/view/519851/the-
big-data-conundrum-how-to-define-it/
Abujudeh, HH, Boland, GW, Kaewalai, R, et al. Abdominal and Pelvic Computed Tomography (CT)
Interpretation: discrepancy rates among experienced radiologists. Eur Radiol.2010;20(8): 1952-7.
Maryam Ramezani, Alireza Karimian, and Payman Moallem. Automatic Detection of Malignant
Melanoma using Macroscopic Images. J Med Signals Sens. 2014 Oct-Dec; 4(4): 281–290. PMCID:
PMC4236807
26 5/23/2015
Editor's Notes
1:55 Lies, Damn Lies, and Big Data: How to Best Utilize Data to Drive Decisions
Brian Bissett, Senior Member, Institute of Electrical and Electronics Engineers
Big Data is hailed as the solution to many problems in industry. In many respects this is a fallacy because it only takes a small amount of erroneous data to corrupt the usefulness of a large dataset. While Big Data can be extremely useful in predicting patterns for the masses such as traffic patterns and peak usage hours for a utility, its usefulness begins to diminish in situations where quality is more important than quantity. In addition, the underlying assumption of Big Data that the behavior of the masses is the correct course of action is not always true. The audience will gain an appreciation for how to best utilize data to drive decisions. Common fallacies will be addressed including the notion that Big Data sets are always superior to smaller data sets. The limitations of big data sets, the importance of quality data, effective display of quantitative information, boundary conditions, and the evaluation of quantitative and qualitative factors will all be discussed.
Variability – Go to Doctor – What is your level of Pain from 1 to 10.
1 ZB = 1 Trillion TB or 10^21 bytes
How do you store it?
How do you Visualize on more than 3 Parameters?
How do you effectively query on something with so many parameters?
Lanesplitting.
interobserver (between two different radiologists)
intraobserver (disagreeing with one’s self)
Pathologists are worse 40% discrepancy rate with each other.