Data Science: Philosopher's Stone

Philosopher’sstone
Open Data Science Conference, San Francisco
November 2015
Vin Sharma
@ciphr | vin.sharma@intel.com

2
Datascience:Philosopher’sstone Data Science has grow from a tongue-in-
cheek epithet (see “rocket science”) into
a real profession. Data Scientists now
have great power in enterprises. We hold
the Philosopher's Stone that transforms
raw data into intelligence. But with great
power comes great responsibility.
For Data Science to evolve into a peer of
physical sciences like chemistry, our
community needs to help it develop the
essential character of a Science:
Openness, methodological consistency,
substantive body of knowledge, reuse,
reproducibility, open research questions,
ethics and professional responsibility.
Our team at Intel has been working on
these issues helping to evolve Data
Science from alchemy to chemistry.

+ =
THINGS VALUE
Revenue
Growth
Cost
Savings
Margin Gain
50 Billion 35 ZB
DATA
TransmutationofDataintoValue

+ =
THINGS VALUE
Revenue
Growth
Cost
Savings
Margin Gain
50 Billion 35 ZB
DATA
Personalized
Ubiquitous
New Ventures
Higher Productivity
Greater Efficiency
Better Products
Engaged
Customers
New
Solutions
TransmutationofDataintoValue
Value
Innovation

Delaysanddetours
+ =
THINGS VALUE
Revenue
Growth
Cost
Savings
Margin Gain
50 Billion 35 ZB
NO NO NO
TRUST INSIGHT PROOF
 Fail to Scale
Lack of Use Cases
 Fail to Secure
Scarcity of Skills Complexity of Systems
 Fail to show ROI
DATA

IoT
Developer
Platform
Wearables
Developer
Platform
Parkinson’s
Research
Platform
Retail
Analytics
Solutions
Power
Distribution
Analytics
Digital
Oil Field
Population
Genomics
Data
Source
Use
Cases
Maker
solutions on
intel® Galileo &
Intel® Edison
Customer
device usage
analyses for
fashion watch
ODM
Disease
progression
tracking via
sensors
RFID-based
inventory
tracking; social
media based
demand
forecasting
Grid overlay
network data
analysis
Preventive
maintenance
for oil field
assets
Compare the
anonymized
genome data
of a local
patient with
genome data
in public data
sets
Conceptsolutions

Sciencefriction
Data Science:
• Iterative error-prone drudgery
• One-off, ad hoc models in isolation
Analytics Processing:
• Single-threaded, single-node processing
• Proprietary, fixed-function solutions
Application Code:
• Monolithic architecture
• Legacy components
From data science to big data analytics: Less alchemy, more chemistry
8

Open source software
project to accelerate
creation of cloud native
apps driven by big data
analytics. TAP provides a
shared environment for
app developers to
collaborate with data
scientists, making it
easier to use advanced
analytics on big data in
the Cloud.
TrustedAnalyticsPlatform
Graph

TrustedanalyticsPlatform
Connectors
Message Brokers & Queues
Kafka, RabbitMQ
MQTT, WS, REST…
Processors
Stream & Batch
Hadoop, Spark, GearPump…
Manage Orchestration, Telemetry, Security
Stores
Polyglot Persistence
HDFS, HBase, PostgreSQL,
MySQL, Redis, MongoDB,
InfluxDB, Objectivity, etc…
Models
Develop, train, evaluate,
deploy models as services
Data Scientist
Develop  Deploy
Intel, DataRobot, DL4J, H2O
Runtimes
Polyglot App Runtime
Python, R, Java, Scala, Go…
Develop, test, push
applications; manage lifecycle
App DeveloperSystem Operator
Infrastructure (IaaS)
Appliance

Modelbuildingservices
11
Data Preparation
Join, filter, and
cleanse data sets
Model Evaluation
Accuracy measures,
cross-validation
Application Integration
Invoke model via APIs
Hypothesis Selection
Define inferential or predictive
hypothesis
Model Training
Use ML to find β
Model Deployment
Run in scoring engine,
track concept drift

Casestudy:patientreadmissionpredictionatpennmedicine
13
LDA-derived medication features led to
15% improvement in accuracy
Raw Medication Lists
Cleaned Medication Lists
(text processing methods,
regular expressions)
LDA-derived Features
Data are noisy and sparse[ ]
Data are less noisy, but sparse[ ]
Data are neither noisy nor sparse
[ ]
42,358 features
23,663 features
23,663 features
20 features
Penn Medicine wants to identify and stratify heart failure patients
at risk of re-admission within 30 or 90 days of discharge.
• Patient phenotype approach to risk classification
• Use of patient medication history
• Applying unsupervised text analytics algorithms, such as
Latent Dirichlet Allocation (LDA), to model relationship
between medications and medical conditions
• Using this model with patient health records to identify high-
risk patient profiles
• Evaluating individual patient risk of re-admission for new and
existing patients

14
Vin Sharma / @ciphr / vin.sharma@intel.com

Data Science: Philosopher's Stone

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Data Science: Philosopher's Stone

Similar to Data Science: Philosopher's Stone (20)

Recently uploaded

Recently uploaded (20)

Data Science: Philosopher's Stone