SlideShare a Scribd company logo
1 of 35
Tackling Data Curation in
Three Generations
Mike Stonebraker
Silos everywhere….
The Current State of Affairs
By the Numbers
Number of data
stores in a typical
enterprise:
5,000
Number of data
stores in a LARGE
telco company:
10,000
• Enterprises are divided into business units, which are typically
independent
• With independent data stores
• One large money center bank had hundreds
• The last time I looked
Why so many data stores?
• Enterprises buy other enterprises
• With great regularity
• Such acquired silos are difficult to remove
• Customer contracts
• Different mechanisms for treating employees, retirees ….
Why so many data stores?
• CFO’s budget is on a spreadsheet on his PC
• Lots of Excel data
• And there is public data from the web with business value
• Weather, population, census tracts, ZIP codes …
• Data.gov
Not to Mention . . .
• Business units are independent
• Different customer ids, product ids, …
• Enterprises have tried to construct such models in the past…..
• Multi-year project
• Out-of-date on day 1 of the project, let alone on the proposed
completion date
• Standards are difficult
• Remember how difficult it is to stamp out multiple DBMSs in an
enterprise
• Let alone Macs…
And there is NO Global Data Model
• The sins of your predecessors
• Your CEO is not in IT
• May not have the COBOL source code
• Politics
• Data is power
Lots of Silos is a Fact of Life
• Cross selling
• Combining procurement orders
• To get better pricing
• Social networking
• People working on the same thing
• Rollups/better information
• How many employees do we have?
• Etc….
Why Integrate Silos?
• Biggest problem facing many
enterprises
Data Integration is a VERY Big Deal
• Ingest
• The data source
• Validate
• Have to get rid of (or correct) garbage
• Transform
• E.g., Euros to dollar; Airport code to city name
• Match Schemas
• Your salary is my wages
• Consolidate (dedup)(entity resolution)
• E.g., Mike Stonebraker and Michael Stonebraker
Requirement: Data Curation
• Gen 1 (1990s): Traditional ETL
• Gen 2 (2000s): ETL on steroids
• Gen 3 (appearing now): Scalable Data Curation
Three Generations of Data Curation Products
• Retail sector started integrating sales data into a data warehouse in the
mid 1990’s
• To make better stock decisions
• Pet rocks are out, Barbie dolls are in
• Tie up the Barbie doll factory with a big order
• Send the pet rocks back or discount them up front
• Warehouse paid for itself within 6 months with smarter buying
decisions!
Gen 1 (Early Data Warehouses)
• Essentially all enterprises followed suit and built warehouses of
customer-facing data
• Serviced by so-called Extract-Transform-and-Load (ETL) tools
The Pile-On
• Average system was 2-3X over budget
• and 2-3X late
• Because of data integration headaches
The Dark Side . . .
• Bought $100K of widgets from IBM, Inc.
• Bought 800K Euros of m-widgets from IBM, SA
• Bought -9999 of *wids* from 500 Madison Ave., NY, NY 10022
• Insufficient/incomplete meta-data: May not know that 800K is in Euros
• Missing data: -9999 is a code for “I don’t know”
• Dirty data: *wids* means what?
Why is Data Integration Hard?
• Bought $100K of widgets from IBM, Inc.
• Bought 800K Euros of m-widgets from IBM, SA
• Bought -9999 of *wids* from 500 Madison Ave., NY, NY 10022
• Disparate fields: Have to translate currencies to a common form
• Entity resolution: Is IBM, SA the same as IBM, Inc.?
• Entity resolution: Are m-widgets the same as widgets?
Why is Data Integration Hard?
Local data Source(s)
Local Schema
Data Warehouse
Global SchemaETL
ETL Architecture
• Human defines a global schema
• Up front
• Assign a programmer to each data source to
• Understand it
• Write local to global mapping (in a scripting language)
• Write cleaning routine
• Run the ETL
• Scales to (maybe) 25 data sources
• Twist my arm, and I will give you 50
Traditional ETL Wisdom
• Bigger global schema upfront is really hard
• Too much manual heavy lifting
• By a trained programmer
• No automation
Why?
Gen 2 – Curation Tools Added to ETL
• Deduplication systems
– For addresses, names, …
• Outlier detection for data cleaning
• Standard domains for data cleaning
• …
• Augments the generation 1 architecture
– Still only scales to 25 data sources!
• Enterprises want to integrate more and more data sources
– Milwaukee beer example
• Weather data
• Business analysts have an insatiable demand for “MORE”
Current Situation
• Enterprises want to integrate more and more data sources
– Big Pharma example
• Has a traditional data warehouse of bio assay data
• Has ~3,000 scientists doing “wet” biology and chemistry across multiple
types of experiments
• And writing results in an electronic lab notebook (think 27,000
spreadsheets)
• No standard vocabulary (Is an ICU-50 the same as an ICE-50?) – both are
biophysical parameters of drugs
• No standard units and units may not even be recorded
• No standard language (e.g., English)
• Variable encoding (some results are numeric, some are text, some are
numbers stored as text with text comments!)
Current Situation
• Enterprises want to integrate more and more data sources
– Web aggregator example
• Currently integrating 80,000 web URLs
• With “event” and “things to do” data
• All the standard headaches
– At scale 80,000
Current Situation
• Traditional ETL won’t scale to these kinds of numbers
– Too much manual effort
– I.e., traditional ETL way too heavy-weight!!!
• Also a personnel mismatch
– Are widgets and m-widgets the same thing?
– Only a business expert knows the answer
– The ETL programmer certainly does not!!!!
Current Situation
Gen 3: Scalability
26
• Must pick the low-hanging fruit automatically
– Machine learning
– Statistics
• Rarely an upfront global schema
– Must build it “bottom up”
• Must involve human (non-programmer) experts to help with the
cleaning
Tamr is an example of this 3rd generation!
Ingest
Schema
integration
Crowd
Sourcing
De-
Duplication
Vis/XForm
Cleaning
Tamr Architecture
27
Tamr
Console
RDBMS
• Starts integrating data sources
– Using synonyms, templates, and authoritative tables for help
– 1st couple of sources may require help from the human experts
– System learns over time and gets better and better
Tamr – Schema Integration
Tamr – Schema Integration
• Inner loop is a collection of “experts” (programs)
• T-test on the data
• Cosine similarity on attribute names
• Cosine similarity on the data
• Scores combined heuristically
• After modest training, gets 90+% of the matching attributes
automatically
• In several domains
• Cuts human cost dramatically!!!
• Hierarchy of experts
• With specializations
• With algorithms to adjust the “expertness” of experts
• And a marketplace to perform load balancing
• Working well at scale!!!
• Biggest problem: getting the experts to participate.
Tamr – Expert Sourcing
• Can adjust the threshold for automatic acceptance
• Cost-accuracy tradeoff
• Even if a human checks everything (threshold is certainty), you still
save money -- Tamr organizes the information and makes humans
more productive
Tamr – Entity Consolidation
• A major consolidator of financial data
• Entity consolidation and expert sourcing on a collection of internal
and external sources
• ROI relative to existing homebrew system
• A major manufacturing conglomerate
• Combine disparate ERP systems
• ROI is better procurement
Tamr Customer Success Stories
• A major bio-pharm company
• Combining inputs from 2000 medical-diagnostic pieces of
equipment by equipment type
• Decision support – how is stuff used?
• ROI is order-of-magnitude faster integration
• A major car company
• Customer data from multiple countries in Europe
• ROI is better marketing across a continent
• ROI is more effective sales engagement
Tamr Customer Success Stories
• Text sources
• Relationships
• More adaptors for different data sources and sinks
• Better algorithms
• User-defined operations
• For popular cleaning tools like Google Refine
• Web transformation tool
• Syntactic transformations (e.g., dates)
• Semantic transformations (e.g., airport codes)
Tamr Future
www.tamr.com
Thank you!

More Related Content

What's hot

Graph Grid by Atom Rain
Graph Grid by Atom RainGraph Grid by Atom Rain
Graph Grid by Atom RainMeg Vorland
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Businessazuyo.com
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analyticsThe Marketing Distillery
 
A Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceA Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceDenodo
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data AnalyticsVijay Rao
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...DATAVERSITY
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationSrinivasan Sankar
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Kevin Pledge
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introductionIBM Analytics
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries ciaKevin Pledge
 
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...Neo4j
 
Business Value of Data
Business Value of Data Business Value of Data
Business Value of Data UIResearchPark
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big DataDATAVERSITY
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challengeLenia Miltiadous
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?Thanakrit Lersmethasakul
 
Bigdata for sme-industrial intelligence information-24july2017-final
Bigdata for sme-industrial intelligence information-24july2017-finalBigdata for sme-industrial intelligence information-24july2017-final
Bigdata for sme-industrial intelligence information-24july2017-finalstelligence
 
Big Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data AnalyticsBig Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data AnalyticsSystems Limited
 

What's hot (20)

Graph Grid by Atom Rain
Graph Grid by Atom RainGraph Grid by Atom Rain
Graph Grid by Atom Rain
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Business
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
 
A Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceA Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-Service
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - Presentation
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
 
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
 
Business Value of Data
Business Value of Data Business Value of Data
Business Value of Data
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challenge
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
 
Bigdata for sme-industrial intelligence information-24july2017-final
Bigdata for sme-industrial intelligence information-24july2017-finalBigdata for sme-industrial intelligence information-24july2017-final
Bigdata for sme-industrial intelligence information-24july2017-final
 
Data Quality Definitions
Data Quality DefinitionsData Quality Definitions
Data Quality Definitions
 
Big Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data AnalyticsBig Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data Analytics
 

Viewers also liked

Michael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex AnalyticsMichael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex AnalyticsMassTLC
 
Tamr presentation
Tamr presentationTamr presentation
Tamr presentationAdam Hasler
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
K mac donaldwk2teampaper
K mac donaldwk2teampaperK mac donaldwk2teampaper
K mac donaldwk2teampaperKaren MacDonald
 
презентация бим-радио
презентация бим-радиопрезентация бим-радио
презентация бим-радиоSimon Yeah
 
Sqrrl September Webinar: Cell-Level Security
Sqrrl September Webinar: Cell-Level SecuritySqrrl September Webinar: Cell-Level Security
Sqrrl September Webinar: Cell-Level SecuritySqrrl
 
Text Indexing in Accumulo
Text Indexing in AccumuloText Indexing in Accumulo
Text Indexing in AccumuloAaron Cordova
 
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensStrata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensMapR Technologies
 
Map r seattle streams meetup oct 2016
Map r seattle streams meetup   oct 2016Map r seattle streams meetup   oct 2016
Map r seattle streams meetup oct 2016Nitin Kumar
 
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...Day Software
 
Hadoop Self-Service Data Prep Fuels Analytics
Hadoop Self-Service Data Prep Fuels AnalyticsHadoop Self-Service Data Prep Fuels Analytics
Hadoop Self-Service Data Prep Fuels AnalyticsSenturus
 

Viewers also liked (15)

Michael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex AnalyticsMichael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex Analytics
 
Tamr presentation
Tamr presentationTamr presentation
Tamr presentation
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
Zija ppt6 shari 05-2015
Zija ppt6  shari 05-2015Zija ppt6  shari 05-2015
Zija ppt6 shari 05-2015
 
K mac donaldwk2teampaper
K mac donaldwk2teampaperK mac donaldwk2teampaper
K mac donaldwk2teampaper
 
презентация бим-радио
презентация бим-радиопрезентация бим-радио
презентация бим-радио
 
Sqrrl September Webinar: Cell-Level Security
Sqrrl September Webinar: Cell-Level SecuritySqrrl September Webinar: Cell-Level Security
Sqrrl September Webinar: Cell-Level Security
 
Accumulo on EC2
Accumulo on EC2Accumulo on EC2
Accumulo on EC2
 
Spain
SpainSpain
Spain
 
Text Indexing in Accumulo
Text Indexing in AccumuloText Indexing in Accumulo
Text Indexing in Accumulo
 
Starsoft tm1
Starsoft tm1Starsoft tm1
Starsoft tm1
 
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensStrata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
 
Map r seattle streams meetup oct 2016
Map r seattle streams meetup   oct 2016Map r seattle streams meetup   oct 2016
Map r seattle streams meetup oct 2016
 
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
 
Hadoop Self-Service Data Prep Fuels Analytics
Hadoop Self-Service Data Prep Fuels AnalyticsHadoop Self-Service Data Prep Fuels Analytics
Hadoop Self-Service Data Prep Fuels Analytics
 

Similar to Tamr | Strata hadoop 2014 Michael Stonebraker

Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for AnalyticsIke Ellis
 
Data Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and RetailData Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and RetailAndrei Lopatenko
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonDatabricks
 
Toigo Critical Convergence
Toigo  Critical ConvergenceToigo  Critical Convergence
Toigo Critical Convergencehypknight
 
Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Bernardo Najlis
 
dataWarehouse.pptx
dataWarehouse.pptxdataWarehouse.pptx
dataWarehouse.pptxhqlm1
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...TamrMarketing
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Data Detectives - Presentation
Data Detectives - PresentationData Detectives - Presentation
Data Detectives - PresentationClint Campbell
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsWhere Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsInsightDataScience
 
Lecture 01.ppt
Lecture 01.pptLecture 01.ppt
Lecture 01.pptHFLEX
 
Creating a Smarter Shopping Experience with IBM Solutions at Carter's
Creating a Smarter Shopping Experience with IBM Solutions at Carter'sCreating a Smarter Shopping Experience with IBM Solutions at Carter's
Creating a Smarter Shopping Experience with IBM Solutions at Carter'sPerficient, Inc.
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
The final frontier
The final frontierThe final frontier
The final frontierTerry Bunio
 

Similar to Tamr | Strata hadoop 2014 Michael Stonebraker (20)

Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for Analytics
 
Data Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and RetailData Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and Retail
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
 
Toigo Critical Convergence
Toigo  Critical ConvergenceToigo  Critical Convergence
Toigo Critical Convergence
 
Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)
 
dataWarehouse.pptx
dataWarehouse.pptxdataWarehouse.pptx
dataWarehouse.pptx
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Data Detectives - Presentation
Data Detectives - PresentationData Detectives - Presentation
Data Detectives - Presentation
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsWhere Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
 
Lecture 01.ppt
Lecture 01.pptLecture 01.ppt
Lecture 01.ppt
 
Ch 1 intro_dw
Ch 1 intro_dwCh 1 intro_dw
Ch 1 intro_dw
 
Creating a Smarter Shopping Experience with IBM Solutions at Carter's
Creating a Smarter Shopping Experience with IBM Solutions at Carter'sCreating a Smarter Shopping Experience with IBM Solutions at Carter's
Creating a Smarter Shopping Experience with IBM Solutions at Carter's
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
The final frontier
The final frontierThe final frontier
The final frontier
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Information systems
Information systemsInformation systems
Information systems
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Tamr | Strata hadoop 2014 Michael Stonebraker

  • 1. Tackling Data Curation in Three Generations Mike Stonebraker
  • 3. By the Numbers Number of data stores in a typical enterprise: 5,000 Number of data stores in a LARGE telco company: 10,000
  • 4. • Enterprises are divided into business units, which are typically independent • With independent data stores • One large money center bank had hundreds • The last time I looked Why so many data stores?
  • 5. • Enterprises buy other enterprises • With great regularity • Such acquired silos are difficult to remove • Customer contracts • Different mechanisms for treating employees, retirees …. Why so many data stores?
  • 6. • CFO’s budget is on a spreadsheet on his PC • Lots of Excel data • And there is public data from the web with business value • Weather, population, census tracts, ZIP codes … • Data.gov Not to Mention . . .
  • 7. • Business units are independent • Different customer ids, product ids, … • Enterprises have tried to construct such models in the past….. • Multi-year project • Out-of-date on day 1 of the project, let alone on the proposed completion date • Standards are difficult • Remember how difficult it is to stamp out multiple DBMSs in an enterprise • Let alone Macs… And there is NO Global Data Model
  • 8. • The sins of your predecessors • Your CEO is not in IT • May not have the COBOL source code • Politics • Data is power Lots of Silos is a Fact of Life
  • 9. • Cross selling • Combining procurement orders • To get better pricing • Social networking • People working on the same thing • Rollups/better information • How many employees do we have? • Etc…. Why Integrate Silos?
  • 10. • Biggest problem facing many enterprises Data Integration is a VERY Big Deal
  • 11. • Ingest • The data source • Validate • Have to get rid of (or correct) garbage • Transform • E.g., Euros to dollar; Airport code to city name • Match Schemas • Your salary is my wages • Consolidate (dedup)(entity resolution) • E.g., Mike Stonebraker and Michael Stonebraker Requirement: Data Curation
  • 12. • Gen 1 (1990s): Traditional ETL • Gen 2 (2000s): ETL on steroids • Gen 3 (appearing now): Scalable Data Curation Three Generations of Data Curation Products
  • 13. • Retail sector started integrating sales data into a data warehouse in the mid 1990’s • To make better stock decisions • Pet rocks are out, Barbie dolls are in • Tie up the Barbie doll factory with a big order • Send the pet rocks back or discount them up front • Warehouse paid for itself within 6 months with smarter buying decisions! Gen 1 (Early Data Warehouses)
  • 14. • Essentially all enterprises followed suit and built warehouses of customer-facing data • Serviced by so-called Extract-Transform-and-Load (ETL) tools The Pile-On
  • 15. • Average system was 2-3X over budget • and 2-3X late • Because of data integration headaches The Dark Side . . .
  • 16. • Bought $100K of widgets from IBM, Inc. • Bought 800K Euros of m-widgets from IBM, SA • Bought -9999 of *wids* from 500 Madison Ave., NY, NY 10022 • Insufficient/incomplete meta-data: May not know that 800K is in Euros • Missing data: -9999 is a code for “I don’t know” • Dirty data: *wids* means what? Why is Data Integration Hard?
  • 17. • Bought $100K of widgets from IBM, Inc. • Bought 800K Euros of m-widgets from IBM, SA • Bought -9999 of *wids* from 500 Madison Ave., NY, NY 10022 • Disparate fields: Have to translate currencies to a common form • Entity resolution: Is IBM, SA the same as IBM, Inc.? • Entity resolution: Are m-widgets the same as widgets? Why is Data Integration Hard?
  • 18. Local data Source(s) Local Schema Data Warehouse Global SchemaETL ETL Architecture
  • 19. • Human defines a global schema • Up front • Assign a programmer to each data source to • Understand it • Write local to global mapping (in a scripting language) • Write cleaning routine • Run the ETL • Scales to (maybe) 25 data sources • Twist my arm, and I will give you 50 Traditional ETL Wisdom
  • 20. • Bigger global schema upfront is really hard • Too much manual heavy lifting • By a trained programmer • No automation Why?
  • 21. Gen 2 – Curation Tools Added to ETL • Deduplication systems – For addresses, names, … • Outlier detection for data cleaning • Standard domains for data cleaning • … • Augments the generation 1 architecture – Still only scales to 25 data sources!
  • 22. • Enterprises want to integrate more and more data sources – Milwaukee beer example • Weather data • Business analysts have an insatiable demand for “MORE” Current Situation
  • 23. • Enterprises want to integrate more and more data sources – Big Pharma example • Has a traditional data warehouse of bio assay data • Has ~3,000 scientists doing “wet” biology and chemistry across multiple types of experiments • And writing results in an electronic lab notebook (think 27,000 spreadsheets) • No standard vocabulary (Is an ICU-50 the same as an ICE-50?) – both are biophysical parameters of drugs • No standard units and units may not even be recorded • No standard language (e.g., English) • Variable encoding (some results are numeric, some are text, some are numbers stored as text with text comments!) Current Situation
  • 24. • Enterprises want to integrate more and more data sources – Web aggregator example • Currently integrating 80,000 web URLs • With “event” and “things to do” data • All the standard headaches – At scale 80,000 Current Situation
  • 25. • Traditional ETL won’t scale to these kinds of numbers – Too much manual effort – I.e., traditional ETL way too heavy-weight!!! • Also a personnel mismatch – Are widgets and m-widgets the same thing? – Only a business expert knows the answer – The ETL programmer certainly does not!!!! Current Situation
  • 26. Gen 3: Scalability 26 • Must pick the low-hanging fruit automatically – Machine learning – Statistics • Rarely an upfront global schema – Must build it “bottom up” • Must involve human (non-programmer) experts to help with the cleaning Tamr is an example of this 3rd generation!
  • 28. • Starts integrating data sources – Using synonyms, templates, and authoritative tables for help – 1st couple of sources may require help from the human experts – System learns over time and gets better and better Tamr – Schema Integration
  • 29. Tamr – Schema Integration • Inner loop is a collection of “experts” (programs) • T-test on the data • Cosine similarity on attribute names • Cosine similarity on the data • Scores combined heuristically • After modest training, gets 90+% of the matching attributes automatically • In several domains • Cuts human cost dramatically!!!
  • 30. • Hierarchy of experts • With specializations • With algorithms to adjust the “expertness” of experts • And a marketplace to perform load balancing • Working well at scale!!! • Biggest problem: getting the experts to participate. Tamr – Expert Sourcing
  • 31. • Can adjust the threshold for automatic acceptance • Cost-accuracy tradeoff • Even if a human checks everything (threshold is certainty), you still save money -- Tamr organizes the information and makes humans more productive Tamr – Entity Consolidation
  • 32. • A major consolidator of financial data • Entity consolidation and expert sourcing on a collection of internal and external sources • ROI relative to existing homebrew system • A major manufacturing conglomerate • Combine disparate ERP systems • ROI is better procurement Tamr Customer Success Stories
  • 33. • A major bio-pharm company • Combining inputs from 2000 medical-diagnostic pieces of equipment by equipment type • Decision support – how is stuff used? • ROI is order-of-magnitude faster integration • A major car company • Customer data from multiple countries in Europe • ROI is better marketing across a continent • ROI is more effective sales engagement Tamr Customer Success Stories
  • 34. • Text sources • Relationships • More adaptors for different data sources and sinks • Better algorithms • User-defined operations • For popular cleaning tools like Google Refine • Web transformation tool • Syntactic transformations (e.g., dates) • Semantic transformations (e.g., airport codes) Tamr Future