Enterprise Data Unification in Practice
IHAB ILYAS
Professor, University of Waterloo
Co-founder, Tamr, Inc.
@ihabilyas
Top-Down Data Integration Limits Data Quality and Connectedness
<10%
Enterprise data
is siloed . . .
. . . expensive to
connect & curate
# of sources
$
The Consequences:
• Limited data available
• Missed opportunity
• Ballooning costs
Hiring More Data Experts Is Not the Answer
Reality Enterprise RealityGoal
• Manual data collection
and preparation
• Long lead-time to
analyses
• Limited individual view
on variety of data
• Extensive rework
• No cohesive view of
data efforts
• Expertise across organization
is underutilized
Data Curation: Many Definitions and One Goal
Extract Value from Data
“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”
NYtimes August, 2014
Exploding Big Data Variety Will Make the Problem Worse
RadicalIncreasein
DataVariety
0
2000 2011
Source: IDC 2011 Digital Universe Study
1.0
2.0
Corporate databases
Semi-structured data
JSON Sources
Increasingly valuable
Missing Capability:
Connecting and
curating in an
automated way
Structured and
Semi-structured
Data Sources
Collaborative
Curation
Data Experts
(Source owners)
Data Stewards
and Curators
Data
Inventory
APIs
Systems
Tools
Data
Scientists
The Core of Tamr: Machine Learning with Human Insight
Advanced
Algorithms &
Machine
Learning
Expert
Input
Integrated
Data &
Metadata
Identify sources, understand relationships and curate the massive variety of siloed data
Expert
Directory
DemoExample
Use Cases
Solution Overview: Sourcing & Supply Chain Spend Optimization
The Problem
• Part/supplier data in ERPs, life cycle management
systems, and catalogs across departments
• Inaccurate data / incongruent naming conventions
The Solution
• Create a unified schema that leverages all
relevant data sources, including parts,
procurement, logistical, and vendor data
Benefit
• Discover opportunities to optimize purchases
across different suppliers and lines of business Tamr Unified View
Hundreds of Potential Sources
Solution Overview: Customer Data Integration
The Problem
• Customer data stored in CRMs, data warehouses,
back-office applications, and other enriching sources
• Complexity of unifying personal data / incongruent
naming conventions / data sparseness / manual entry
The Solution
• Create a holistic and adaptive customer view by
unifying disparate data sources across the enterprise
Benefits
• Apply a unified and enriched customer view across
multiple channels / lines of business
• Discover hidden opportunities to improve upsell /
cross-sell, reduce churn, and identify key opinion
leaders (KOL) via enhanced segmentation/targeting
Solution Overview: Clinical Trials
The Problem
• Clinical trial data is reported in a wide variety of
formats, ontologies and standards
• Underspecified attribute names, varying
qualities of annotation, duplicate data etc…
The Solution
• Unify attribute names to build a common clinical
trial data model
Benefit
• Ability to cluster clinical trials based on drug, target or investigator
• Easier way to aggregate and report ongoing trial data
• Simplified reporting for various agency ontologies
Solution Overview: Medical Instruments
The Problem
• Instruments perform experiments at thousands
of labs and hospitals across the world
• Data stored in inconsistent formats and
standards across various labs and hospitals
The Solution
• Build a unified view of instruments leveraging all
available internal/external data-sources
Benefit
• Ability to cluster analysis based on instrument,
location and other attributes
Tamr Architecture: a Data Curation Stack
DemoQuestions?
@Tamr_Inc
www.tamr.com

Tamr | cdo-summit

  • 1.
    Enterprise Data Unificationin Practice IHAB ILYAS Professor, University of Waterloo Co-founder, Tamr, Inc. @ihabilyas
  • 2.
    Top-Down Data IntegrationLimits Data Quality and Connectedness <10% Enterprise data is siloed . . . . . . expensive to connect & curate # of sources $ The Consequences: • Limited data available • Missed opportunity • Ballooning costs
  • 3.
    Hiring More DataExperts Is Not the Answer Reality Enterprise RealityGoal • Manual data collection and preparation • Long lead-time to analyses • Limited individual view on variety of data • Extensive rework • No cohesive view of data efforts • Expertise across organization is underutilized
  • 4.
    Data Curation: ManyDefinitions and One Goal Extract Value from Data “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights” NYtimes August, 2014
  • 5.
    Exploding Big DataVariety Will Make the Problem Worse RadicalIncreasein DataVariety 0 2000 2011 Source: IDC 2011 Digital Universe Study 1.0 2.0 Corporate databases Semi-structured data JSON Sources Increasingly valuable Missing Capability: Connecting and curating in an automated way
  • 6.
    Structured and Semi-structured Data Sources Collaborative Curation DataExperts (Source owners) Data Stewards and Curators Data Inventory APIs Systems Tools Data Scientists The Core of Tamr: Machine Learning with Human Insight Advanced Algorithms & Machine Learning Expert Input Integrated Data & Metadata Identify sources, understand relationships and curate the massive variety of siloed data Expert Directory
  • 7.
  • 8.
    Solution Overview: Sourcing& Supply Chain Spend Optimization The Problem • Part/supplier data in ERPs, life cycle management systems, and catalogs across departments • Inaccurate data / incongruent naming conventions The Solution • Create a unified schema that leverages all relevant data sources, including parts, procurement, logistical, and vendor data Benefit • Discover opportunities to optimize purchases across different suppliers and lines of business Tamr Unified View Hundreds of Potential Sources
  • 9.
    Solution Overview: CustomerData Integration The Problem • Customer data stored in CRMs, data warehouses, back-office applications, and other enriching sources • Complexity of unifying personal data / incongruent naming conventions / data sparseness / manual entry The Solution • Create a holistic and adaptive customer view by unifying disparate data sources across the enterprise Benefits • Apply a unified and enriched customer view across multiple channels / lines of business • Discover hidden opportunities to improve upsell / cross-sell, reduce churn, and identify key opinion leaders (KOL) via enhanced segmentation/targeting
  • 10.
    Solution Overview: ClinicalTrials The Problem • Clinical trial data is reported in a wide variety of formats, ontologies and standards • Underspecified attribute names, varying qualities of annotation, duplicate data etc… The Solution • Unify attribute names to build a common clinical trial data model Benefit • Ability to cluster clinical trials based on drug, target or investigator • Easier way to aggregate and report ongoing trial data • Simplified reporting for various agency ontologies
  • 11.
    Solution Overview: MedicalInstruments The Problem • Instruments perform experiments at thousands of labs and hospitals across the world • Data stored in inconsistent formats and standards across various labs and hospitals The Solution • Build a unified view of instruments leveraging all available internal/external data-sources Benefit • Ability to cluster analysis based on instrument, location and other attributes
  • 12.
    Tamr Architecture: aData Curation Stack
  • 13.

Editor's Notes

  • #3 1) You need to make decisions based on variety of data 2) Your data is siloed and expensive to connect and curate 3) Because of this, you typically use less than 10% of your available date Background/notes: On average Large Enterprises have 2,000+ data sources (Examples: a backend system that does data management and tracking, for others, any excel spreadsheet or application populated by data.) According to a 2012 Price-Waterhouse Cooper study, “A global study on master data management,” Of those, on average, less than 20 data sources are getting into the data - Also, I talked with a project manager that used to work for Informatica (see 'Srinivas Varada' in evernote). He said that the range was from 5-40 sources. The 40 source deal was with Pfizer and was $10M in services. 
  • #5 For that we … Ingest different formats and organize contents Remove errors Fill-in missing information Transform units and formats Map and align attributes Remove duplicate records Fix integrity constraints violations
  • #6 Number and diversity of data sources (private and public) is exploding and the A sub-set of the green area is becoming semi-structured.
  • #7 Tamr combines machine learning algorithms with human insight to identify sources, understand relationships and curate the massive variety of siloed data.
  • #11 Clinical trial data is reported in a wide a variety of formats / ontologies / standards Duplicate Data, underspecified attribute names, varying qualities of annotation Unify attribute names to build common clinical trial data model Ability to cluster clinical trials based on drug, target, investigator Easier to aggregate and repot ongoing trial data via approved standards and formats Simplified reporting for various agency ontologies