Keynote at Open Data Science Conference, San Francisco, Nov 2015, outlines the evolution of Data Science akin to evolution of alchemy to chemistry; Intel's motivations for releasing Trusted Analytics Platform to open source.
2. 2
Datascience:Philosopher’sstone Data Science has grow from a tongue-in-
cheek epithet (see “rocket science”) into
a real profession. Data Scientists now
have great power in enterprises. We hold
the Philosopher's Stone that transforms
raw data into intelligence. But with great
power comes great responsibility.
For Data Science to evolve into a peer of
physical sciences like chemistry, our
community needs to help it develop the
essential character of a Science:
Openness, methodological consistency,
substantive body of knowledge, reuse,
reproducibility, open research questions,
ethics and professional responsibility.
Our team at Intel has been working on
these issues helping to evolve Data
Science from alchemy to chemistry.
5. + =
THINGS VALUE
Revenue
Growth
Cost
Savings
Margin Gain
50 Billion 35 ZB
DATA
Personalized
Ubiquitous
New Ventures
Higher Productivity
Greater Efficiency
Better Products
Engaged
Customers
New
Solutions
TransmutationofDataintoValue
Value
Innovation
8. Sciencefriction
Data Science:
• Iterative error-prone drudgery
• One-off, ad hoc models in isolation
Analytics Processing:
• Single-threaded, single-node processing
• Proprietary, fixed-function solutions
Application Code:
• Monolithic architecture
• Legacy components
From data science to big data analytics: Less alchemy, more chemistry
8
9. Open source software
project to accelerate
creation of cloud native
apps driven by big data
analytics. TAP provides a
shared environment for
app developers to
collaborate with data
scientists, making it
easier to use advanced
analytics on big data in
the Cloud.
TrustedAnalyticsPlatform
Graph
11. Modelbuildingservices
11
Data Preparation
Join, filter, and
cleanse data sets
Model Evaluation
Accuracy measures,
cross-validation
Application Integration
Invoke model via APIs
Hypothesis Selection
Define inferential or predictive
hypothesis
Model Training
Use ML to find β
Model Deployment
Run in scoring engine,
track concept drift
13. Casestudy:patientreadmissionpredictionatpennmedicine
13
LDA-derived medication features led to
15% improvement in accuracy
Raw Medication Lists
Cleaned Medication Lists
(text processing methods,
regular expressions)
LDA-derived Features
Data are noisy and sparse[ ]
Data are less noisy, but sparse[ ]
Data are neither noisy nor sparse
[ ]
42,358 features
23,663 features
23,663 features
20 features
Penn Medicine wants to identify and stratify heart failure patients
at risk of re-admission within 30 or 90 days of discharge.
• Patient phenotype approach to risk classification
• Use of patient medication history
• Applying unsupervised text analytics algorithms, such as
Latent Dirichlet Allocation (LDA), to model relationship
between medications and medical conditions
• Using this model with patient health records to identify high-
risk patient profiles
• Evaluating individual patient risk of re-admission for new and
existing patients