THE DATA UNIFICATION IMPERATIVE
ANDY PALMER | CO-FOUNDER, TAMR
BACKGROUND
Career is a mashup of:
start-ups + enterprise
customer + vendor
data + application
technical + business
HEALTHCARE INVESTMENTS
HUGE INVESTMENT IN ENTERPRISE IT & BIG DATA
Companies invested $3-4 Trillion in IT over last 20+ years
And now are investing billions in “Big Data” and Analytics 3.0...
DIRTY LITTLE SECRET: DATA VARIETY IN ENTERPRISE
Most investments oriented towards
some “silo” in the enterprise
● application
● function
● division
● geography
Data tied to these investments
is extremely siloed
BIG DATA & ANALYTICS NEED CLEAN + UNIFIED DATA
“Consider the more than $44 billion projected by Gartner to be spent on big data in
2014. The vast majority of it — $37.4 billion — is going to IT services. Enterprise software
only accounts for about a tenth. The disproportionate spending on services is a sign of
immaturity in how we manage data.” - Mahesh S. Kumar, Harvard Business Review
TACKLING THE ENTERPRISE DATA SILO PROBLEM
All are necessary but not sufficient to truly address next-gen challenges
● Democratized visualization and modeling - radical consumption heterogeneity
● SemanticWeb/LinkedData - radical source heterogeneity
● Provenance for data to improve reliability
● Rapid iteration/change requires reproducability from source
● Desire for longitudinal data across many entities
● Need for automated data quality / assurance
Traditional approaches...
● Standardization - worth trying
● Aggregation - yes - but actually makes the problem worse
● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data
THE MYTH OF THE SINGLE TECH VENDOR SOLUTION
“Use my brand and data unification will just happen!”
REALLY?
HEALTHCARE/BIOPHARMA IS THE FRONT LINE
The diversity of data and
decentralized nature of healthcare
and specifically biopharmaceutical
research make our industry the
place where next gen data
management will develop.
TABULAR DATA IS KEY ASSET
But it’s messy ...
CURATION AT SCALE
Hiring More Data Scientists Makes the Problem Worse
Reality Enterprise RealityGoal
• Manual data collection
and preparation
• Long lead time to
analyses
• Limited individual view
on variety of data
• Extensive rework
• No cohesive view of
data efforts
• Expertise across
organization underutilized
NEW TOOLS ARE NECESSARY
New transformation tools are necessary… but not sufficient to
solve the enterprise data variety problem
Unified View
A few sources...
Thousands of sources
SOLUTION: BOTTOM-UP, PROBABILISTIC DATA MODELING & “COLLABORATIVE CURATION”
Time to embrace the reality of extreme data variety
across the entire enterprise - “Unified Data”
Back to the future
● 1990’s web: probabilistic search / website connection
● 2020’s enterprise: probabilistic data source connection
& curation
Requires a bottom-up, probabilistic and collaborative
approach to data (complements deterministic)
● Rules for transformation are necessary but not sufficient
to solve broad problem of broad integration
● Mix of 80% probabilistic & 20% deterministic
● Iteratively and systematically engage data experts
CORE OF TAMR
Machine Learning with Human Insight
Identify sources, understand relationships and curate the massive variety of siloed data
Structured and
Semi-structured
Data Sources
Collaborative
Curation
Data Experts
(Source
owners)
Data Stewards
and Curators
Data
Inventory
APIs
Systems
Tools
Data
Scientists
Advanced
Algorithms &
Machine
Learning
Expert
Input
Integrated Data
& Metadata
Expert
Directory
FORTUNE 5 BIOPHARMA
Challenges
• 7k+ scientists
• Decentralized organization
• Assay data in spreadsheets
• 30k+ tables
• 100k+ unique attributes
• Error detection in units
Tamr Unified View
Thousands of
Potential Sources
SOLUTION OVERVIEW: CDISC CONVERSION
The Problem
• Clinical trial data reported in wide variety of
formats, ontologies and standards
• Underspecified attribute names, varying
qualities of annotation, duplicate data, etc…
The Solution
• A scalable, replicable way to automatically
unify and convert clinical trial data to CDISC
format.
Benefit
• Tamr technology solves common CDISC problems: schema mapping and expert sourcing
• Faster way to aggregate and report ongoing trial data for regulatory filings
• Simplified reporting for various agency ontologies
TAMR
TAMR
Thank You

Tamr | Biogen data unification imperative

  • 1.
    THE DATA UNIFICATIONIMPERATIVE ANDY PALMER | CO-FOUNDER, TAMR
  • 2.
    BACKGROUND Career is amashup of: start-ups + enterprise customer + vendor data + application technical + business
  • 3.
  • 4.
    HUGE INVESTMENT INENTERPRISE IT & BIG DATA Companies invested $3-4 Trillion in IT over last 20+ years And now are investing billions in “Big Data” and Analytics 3.0...
  • 5.
    DIRTY LITTLE SECRET:DATA VARIETY IN ENTERPRISE Most investments oriented towards some “silo” in the enterprise ● application ● function ● division ● geography Data tied to these investments is extremely siloed
  • 6.
    BIG DATA &ANALYTICS NEED CLEAN + UNIFIED DATA “Consider the more than $44 billion projected by Gartner to be spent on big data in 2014. The vast majority of it — $37.4 billion — is going to IT services. Enterprise software only accounts for about a tenth. The disproportionate spending on services is a sign of immaturity in how we manage data.” - Mahesh S. Kumar, Harvard Business Review
  • 7.
    TACKLING THE ENTERPRISEDATA SILO PROBLEM All are necessary but not sufficient to truly address next-gen challenges ● Democratized visualization and modeling - radical consumption heterogeneity ● SemanticWeb/LinkedData - radical source heterogeneity ● Provenance for data to improve reliability ● Rapid iteration/change requires reproducability from source ● Desire for longitudinal data across many entities ● Need for automated data quality / assurance Traditional approaches... ● Standardization - worth trying ● Aggregation - yes - but actually makes the problem worse ● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data
  • 8.
    THE MYTH OFTHE SINGLE TECH VENDOR SOLUTION “Use my brand and data unification will just happen!” REALLY?
  • 9.
    HEALTHCARE/BIOPHARMA IS THEFRONT LINE The diversity of data and decentralized nature of healthcare and specifically biopharmaceutical research make our industry the place where next gen data management will develop.
  • 10.
    TABULAR DATA ISKEY ASSET But it’s messy ...
  • 11.
    CURATION AT SCALE HiringMore Data Scientists Makes the Problem Worse Reality Enterprise RealityGoal • Manual data collection and preparation • Long lead time to analyses • Limited individual view on variety of data • Extensive rework • No cohesive view of data efforts • Expertise across organization underutilized
  • 12.
    NEW TOOLS ARENECESSARY New transformation tools are necessary… but not sufficient to solve the enterprise data variety problem Unified View A few sources... Thousands of sources
  • 13.
    SOLUTION: BOTTOM-UP, PROBABILISTICDATA MODELING & “COLLABORATIVE CURATION” Time to embrace the reality of extreme data variety across the entire enterprise - “Unified Data” Back to the future ● 1990’s web: probabilistic search / website connection ● 2020’s enterprise: probabilistic data source connection & curation Requires a bottom-up, probabilistic and collaborative approach to data (complements deterministic) ● Rules for transformation are necessary but not sufficient to solve broad problem of broad integration ● Mix of 80% probabilistic & 20% deterministic ● Iteratively and systematically engage data experts
  • 14.
    CORE OF TAMR MachineLearning with Human Insight Identify sources, understand relationships and curate the massive variety of siloed data Structured and Semi-structured Data Sources Collaborative Curation Data Experts (Source owners) Data Stewards and Curators Data Inventory APIs Systems Tools Data Scientists Advanced Algorithms & Machine Learning Expert Input Integrated Data & Metadata Expert Directory
  • 15.
    FORTUNE 5 BIOPHARMA Challenges •7k+ scientists • Decentralized organization • Assay data in spreadsheets • 30k+ tables • 100k+ unique attributes • Error detection in units Tamr Unified View Thousands of Potential Sources
  • 16.
    SOLUTION OVERVIEW: CDISCCONVERSION The Problem • Clinical trial data reported in wide variety of formats, ontologies and standards • Underspecified attribute names, varying qualities of annotation, duplicate data, etc… The Solution • A scalable, replicable way to automatically unify and convert clinical trial data to CDISC format. Benefit • Tamr technology solves common CDISC problems: schema mapping and expert sourcing • Faster way to aggregate and report ongoing trial data for regulatory filings • Simplified reporting for various agency ontologies
  • 17.
  • 18.
  • 19.