SlideShare a Scribd company logo
COMBINING HUMAN & MACHINE INTELLIGENCE TO
SUCCESSFULLY INTEGRATE BIOMEDICAL DATA
TIMOTHY DANFORD | TAMR, INC.
THE DATA INTEGRATION PROBLEM
● flat files: every file has its own columns
● bioinformatics: every tool has its own
file format
● graph data: RDF, OWL, “knowledge
graphs”
● proprietary / legacy formats: SAS,
DBF
● relational databases: inconsistent data
models
Biomedical Data Integration is a
Constantly Moving Target
THE DATA INTEGRATION PROBLEM
● One solution: hire or train data curators
who understand the subject area
● Benefits: accuracy
● Problems
o Low bandwidth
o Difficult to scale to larger problems
o Recording decisions
o Consistency between curators
Data Curation Teams Do Not Scale
THE DATA INTEGRATION PROBLEM
● Build an automated or rules-based
system to perform data integration
● Benefits: scale
● Problems
o Accuracy, edge-cases
o Programmers do not scale
o Out-of-band communication
o Expensive to maintain
o Brittle in the face of new data
Rule-based Integration Is Brittle
TAMR AUTOMATES DATA INTEGRATION
● Solution: combine learning rules with
asking experts
● Modern machine learning techniques
o semi-supervised learning
o active learning
● Benefits
o speed of an automated system
o accuracy of human experts
o auditability
o responds well to changing
requirements
Use Probabilistic Rules with Active
Learning
TAMR AUTOMATES DATA INTEGRATION
● Build a unified schema and link it to
source attributes
● Engage subject matter experts to
answer questions
● Automate data transformation
● Eliminate redundant records with de-
duplication
Tamr Combines Machine Learning
and Expert Feedback
CASE STUDY: CLINICAL STUDY DATA
● Clinical study data integration is motivated
by a single schema: CDISC
o mandated by FDA for data submission
o common schema for clinical data
warehouses
● Mostly performed by SAS scripting today
● Tamr learns attribute mapping and
transformations using human feedback
An Example: Clinical Study Data Integration
Thank You
THE BIOMEDICAL DATA INTEGRATION PROBLEM
Fundamentally, many scientific analyses are tabular
rows are ‘entities’
columns are ‘attributes’
graphs (paths) and hierarchies (part/whole) are other shapes
tables emphasize independence of entities and attributes
Tabular Datasets are a Core Data Shape
THE BIOMEDICAL DATA INTEGRATION PROBLEM
● Column-oriented: Find the matching attributes
● Row-oriented: Discover duplicate entities
Data Integration Proceeds In Two Directions
● 80% of clinical data today goes unused
● Clinical Data Warehouses capture legacy data
● Improved analytics = better trials, less $$
Advanced Analytics, Better Clinical Trials
TAMR BUILDS LASTING VALUE
SAS
Faster Regulatory
Filings
Better Clinical
Analytics
Data Mining for
New Indications
Dynamic, Integrated View of 15k Existing and New
Sources: Biopharma
Result
• Replaced 10+ man years of human curation effort with Tamr
• Engage 600 Scientists in data quality ownership
Challenges
• $2B in research and silos of experimental results
• 15,000 sources of experimental results
• Hundreds of decentralized labs
• 1M+ rows with >100k attribute names
• Non-standardized attribute names & measurement units
• Manual curation prohibitively time & cost intensive
Solution
• Integrate data to find similar experiments
• Scaling data curation to incorporate all sources at
reasonable cost
• Engage owners of data sources in improving quality of data
15k sources integrated into one view
Tamr Output
TACKLING THE ENTERPRISE DATA SILO PROBLEM
All are necessary but not sufficient to truly address next-gen challenges
● Democratized visualization and modeling - radical consumption heterogeneity
● SemanticWeb/LinkedData - radical source heterogeneity
● Provenance for data to improve reliability
● Rapid iteration/change requires reproduceability from source
● Desire for longitudinal data across many entities
● Need for automated data quality / assurance
Traditional approaches...
● Standardization - worth trying
● Aggregation - yes - but actually makes the problem worse
● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data

More Related Content

What's hot

IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
Bill Kohnen
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
Cambridge Semantics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
madynav
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
Caserta
 
Bi overview
Bi overviewBi overview
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Edureka!
 
SEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data WarehouseSEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data Warehouse
HelpSystems
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
CleverDATA
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
obieefans
 
DATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTUREDATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTURE
Sachin Batham
 
Big Data Modeling
Big Data ModelingBig Data Modeling
Big Data Modeling
Hans Hultgren
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
Vibrant Technologies & Computers
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkk
guest4e975e2
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
William Sharp
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data Warehousing
Alex Meadows
 
Datawarehouse & bi introduction
Datawarehouse & bi introductionDatawarehouse & bi introduction
Datawarehouse & bi introduction
guest7b34c2
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
Sana Alvi
 
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Edureka!
 
Data Wearhouse (Dw) concepts
Data Wearhouse (Dw)  conceptsData Wearhouse (Dw)  concepts
Data Wearhouse (Dw) concepts
Being Topper
 

What's hot (20)

IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Bi overview
Bi overviewBi overview
Bi overview
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
SEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data WarehouseSEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data Warehouse
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
 
DATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTUREDATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTURE
 
Big Data Modeling
Big Data ModelingBig Data Modeling
Big Data Modeling
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkk
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data Warehousing
 
Datawarehouse & bi introduction
Datawarehouse & bi introductionDatawarehouse & bi introduction
Datawarehouse & bi introduction
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
 
Data Wearhouse (Dw) concepts
Data Wearhouse (Dw)  conceptsData Wearhouse (Dw)  concepts
Data Wearhouse (Dw) concepts
 

Viewers also liked

Cours
CoursCours
Cours
Adlih Lozz
 
Matt Schultz 4.4
Matt Schultz 4.4Matt Schultz 4.4
Matt Schultz 4.4
MattSchultzie
 
Slaid tokoh perniagaan
Slaid tokoh perniagaanSlaid tokoh perniagaan
Slaid tokoh perniagaan
Azwar Anis
 
Formas y animaciones
Formas y animacionesFormas y animaciones
Formas y animaciones
leidyfabiana17
 
Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2
tonychoper6104
 
Space Apps 2015
Space Apps 2015Space Apps 2015
Space Apps 2015
jacobtomlinson
 
11 le thithuydung.mul1
11 le thithuydung.mul111 le thithuydung.mul1
11 le thithuydung.mul1
Dung Le
 
Formas y animaciones leidy
Formas y animaciones leidyFormas y animaciones leidy
Formas y animaciones leidy
leidyfabiana17
 
Formas y animaciones
Formas y animacionesFormas y animaciones
Formas y animaciones
leidyfabiana17
 
Looping e
Looping   eLooping   e
Looping e
200Hussain
 
Webcatalog sale
Webcatalog saleWebcatalog sale
Webcatalog sale
ronnagr
 
Thomas Salzano - Best Romantic travel destinations
Thomas Salzano - Best Romantic travel destinationsThomas Salzano - Best Romantic travel destinations
Thomas Salzano - Best Romantic travel destinations
Thomas Salzano aka Thomas N Salzano
 
Priamry data type
Priamry data typePriamry data type
Priamry data type
200Hussain
 
Saroj_Mahanta
Saroj_MahantaSaroj_Mahanta
Saroj_Mahanta
Saroj Mahanta
 
Question 1
Question 1Question 1
Question 1
Paitin du Plooy
 
Catalogo2017
Catalogo2017Catalogo2017
Catalogo2017
Pakko Difratex
 
Ke hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hocKe hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hoc
MyLan2014
 
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt YouPennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Curley & Rothman, LLC
 
Towards a critical history of child protection social work
Towards a critical history of child protection social workTowards a critical history of child protection social work
Towards a critical history of child protection social work
BASPCAN
 
Stc call sheet 1-1
Stc call sheet 1-1Stc call sheet 1-1
Stc call sheet 1-1
Manuel Bxyan
 

Viewers also liked (20)

Cours
CoursCours
Cours
 
Matt Schultz 4.4
Matt Schultz 4.4Matt Schultz 4.4
Matt Schultz 4.4
 
Slaid tokoh perniagaan
Slaid tokoh perniagaanSlaid tokoh perniagaan
Slaid tokoh perniagaan
 
Formas y animaciones
Formas y animacionesFormas y animaciones
Formas y animaciones
 
Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2
 
Space Apps 2015
Space Apps 2015Space Apps 2015
Space Apps 2015
 
11 le thithuydung.mul1
11 le thithuydung.mul111 le thithuydung.mul1
11 le thithuydung.mul1
 
Formas y animaciones leidy
Formas y animaciones leidyFormas y animaciones leidy
Formas y animaciones leidy
 
Formas y animaciones
Formas y animacionesFormas y animaciones
Formas y animaciones
 
Looping e
Looping   eLooping   e
Looping e
 
Webcatalog sale
Webcatalog saleWebcatalog sale
Webcatalog sale
 
Thomas Salzano - Best Romantic travel destinations
Thomas Salzano - Best Romantic travel destinationsThomas Salzano - Best Romantic travel destinations
Thomas Salzano - Best Romantic travel destinations
 
Priamry data type
Priamry data typePriamry data type
Priamry data type
 
Saroj_Mahanta
Saroj_MahantaSaroj_Mahanta
Saroj_Mahanta
 
Question 1
Question 1Question 1
Question 1
 
Catalogo2017
Catalogo2017Catalogo2017
Catalogo2017
 
Ke hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hocKe hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hoc
 
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt YouPennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
 
Towards a critical history of child protection social work
Towards a critical history of child protection social workTowards a critical history of child protection social work
Towards a critical history of child protection social work
 
Stc call sheet 1-1
Stc call sheet 1-1Stc call sheet 1-1
Stc call sheet 1-1
 

Similar to Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data

Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
Christopher Eaker
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
Christopher Eaker
 
Sql server ___________session_1-intro
Sql server  ___________session_1-introSql server  ___________session_1-intro
Sql server ___________session_1-intro
Ehtisham Ali
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
Mark Kromer
 
lecture5 (1) (2).pptx
lecture5 (1) (2).pptxlecture5 (1) (2).pptx
lecture5 (1) (2).pptx
RabiullahNazari
 
lecture 1.pdf
lecture 1.pdflecture 1.pdf
lecture 1.pdf
AhmadHussainShafiSE3
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
Amin Chowdhury
 
Qiagram
QiagramQiagram
Qiagram
jwppz
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP Students
Carly Strasser
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
CongChen35
 
Preprocess
PreprocessPreprocess
Preprocess
sharmilajohn
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
dapaasproject
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
Institute of Contemporary Sciences
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
RahulSingh986955
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
AnwarrChaudary
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
YashikaSengar2
 
Data pre processing
Data pre processingData pre processing
Data pre processing
Dr.Bechoo Lal
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
Neo4j
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
 

Similar to Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data (20)

Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Sql server ___________session_1-intro
Sql server  ___________session_1-introSql server  ___________session_1-intro
Sql server ___________session_1-intro
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
lecture5 (1) (2).pptx
lecture5 (1) (2).pptxlecture5 (1) (2).pptx
lecture5 (1) (2).pptx
 
lecture 1.pdf
lecture 1.pdflecture 1.pdf
lecture 1.pdf
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Qiagram
QiagramQiagram
Qiagram
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP Students
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
 
Preprocess
PreprocessPreprocess
Preprocess
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
 

Recently uploaded

Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 

Recently uploaded (20)

Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 

Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data

  • 1. COMBINING HUMAN & MACHINE INTELLIGENCE TO SUCCESSFULLY INTEGRATE BIOMEDICAL DATA TIMOTHY DANFORD | TAMR, INC.
  • 2. THE DATA INTEGRATION PROBLEM ● flat files: every file has its own columns ● bioinformatics: every tool has its own file format ● graph data: RDF, OWL, “knowledge graphs” ● proprietary / legacy formats: SAS, DBF ● relational databases: inconsistent data models Biomedical Data Integration is a Constantly Moving Target
  • 3. THE DATA INTEGRATION PROBLEM ● One solution: hire or train data curators who understand the subject area ● Benefits: accuracy ● Problems o Low bandwidth o Difficult to scale to larger problems o Recording decisions o Consistency between curators Data Curation Teams Do Not Scale
  • 4. THE DATA INTEGRATION PROBLEM ● Build an automated or rules-based system to perform data integration ● Benefits: scale ● Problems o Accuracy, edge-cases o Programmers do not scale o Out-of-band communication o Expensive to maintain o Brittle in the face of new data Rule-based Integration Is Brittle
  • 5. TAMR AUTOMATES DATA INTEGRATION ● Solution: combine learning rules with asking experts ● Modern machine learning techniques o semi-supervised learning o active learning ● Benefits o speed of an automated system o accuracy of human experts o auditability o responds well to changing requirements Use Probabilistic Rules with Active Learning
  • 6. TAMR AUTOMATES DATA INTEGRATION ● Build a unified schema and link it to source attributes ● Engage subject matter experts to answer questions ● Automate data transformation ● Eliminate redundant records with de- duplication Tamr Combines Machine Learning and Expert Feedback
  • 7. CASE STUDY: CLINICAL STUDY DATA ● Clinical study data integration is motivated by a single schema: CDISC o mandated by FDA for data submission o common schema for clinical data warehouses ● Mostly performed by SAS scripting today ● Tamr learns attribute mapping and transformations using human feedback An Example: Clinical Study Data Integration
  • 9. THE BIOMEDICAL DATA INTEGRATION PROBLEM Fundamentally, many scientific analyses are tabular rows are ‘entities’ columns are ‘attributes’ graphs (paths) and hierarchies (part/whole) are other shapes tables emphasize independence of entities and attributes Tabular Datasets are a Core Data Shape
  • 10. THE BIOMEDICAL DATA INTEGRATION PROBLEM ● Column-oriented: Find the matching attributes ● Row-oriented: Discover duplicate entities Data Integration Proceeds In Two Directions
  • 11.
  • 12. ● 80% of clinical data today goes unused ● Clinical Data Warehouses capture legacy data ● Improved analytics = better trials, less $$ Advanced Analytics, Better Clinical Trials TAMR BUILDS LASTING VALUE SAS Faster Regulatory Filings Better Clinical Analytics Data Mining for New Indications
  • 13. Dynamic, Integrated View of 15k Existing and New Sources: Biopharma Result • Replaced 10+ man years of human curation effort with Tamr • Engage 600 Scientists in data quality ownership Challenges • $2B in research and silos of experimental results • 15,000 sources of experimental results • Hundreds of decentralized labs • 1M+ rows with >100k attribute names • Non-standardized attribute names & measurement units • Manual curation prohibitively time & cost intensive Solution • Integrate data to find similar experiments • Scaling data curation to incorporate all sources at reasonable cost • Engage owners of data sources in improving quality of data 15k sources integrated into one view Tamr Output
  • 14. TACKLING THE ENTERPRISE DATA SILO PROBLEM All are necessary but not sufficient to truly address next-gen challenges ● Democratized visualization and modeling - radical consumption heterogeneity ● SemanticWeb/LinkedData - radical source heterogeneity ● Provenance for data to improve reliability ● Rapid iteration/change requires reproduceability from source ● Desire for longitudinal data across many entities ● Need for automated data quality / assurance Traditional approaches... ● Standardization - worth trying ● Aggregation - yes - but actually makes the problem worse ● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data

Editor's Notes

  1. Key Messages: Today I’ll be speaking about how data variety, the natural, siloed nature of data as it’s created, is creating a bottleneck to analytics, and how deterministic data unification approaches aren’t alone sufficient to scale to the variety of hundreds or thousands of data silos found within the enterprise.
  2. What we won’t worry about today: incremental updates, data velocity scale
  3. What we won’t worry about today: incremental updates, data velocity scale
  4. What we won’t worry about today: incremental updates, data velocity scale
  5. What we won’t worry about today: incremental updates, data velocity scale
  6. graph data: rows are nodes, columns are nodes or edges. genomics - rows: genes, variants, ‘features’, and columns: position or: rows are people and columns are variants or: rows are people and columns are phenotypes or: rows are phenotypes and columns are variants (sort of a pivot version) clinical study data: rows are people, or visits, or measurements, and columns are dates, observation codes, categories, names. Sometimes the data just *is* in spreadsheets! (A large Swiss pharmaceutical company, every screening experiment was captured in a separate spreadsheet. “Which experiments were even run?”) A single insight that crosses data silos Discovery that doesn’t “double count” evidence Matching for causal inference
  7. No single method can solve this problem! We need an iterative approach, that automates integration but is guided and corrected by human feedback.
  8. Looking to get an integrated view—previously w/ manual effort and cannot redo—need an automated system to work w humans to create a catalogue Mapping to 80% accuracy Opened discussion up across departments
  9. This slide has animation. You need to click once. Traditional approaches, while necessary, are not alone sufficient to truly address next-gen data challenges Democratized visualization and modeling - radical consumption heterogeneity New visualization and modeling tools have helped democratize analytics, changing the ways in which business users across the enterprise want to consume data. Today, more users require access to high-quality data for varying analytics projects. How do rule base approaches scale with more users consuming data in different ways? SemanticWeb/LinkedData - radical source heterogeneity Extensions for structuring and understanding data on the web have introduced a radical new source of heterogeneous data, presenting challenges to traditional top down data-integration approaches. If we already struggle with scale of our own internal enterprise data, how do you leverage a source with the scale and variety of the web? Provenance for data to improve reliability To be able to reproduce results and ensure data quality, you need to able to understand how the data has been used and transformed over time. Understanding the inputs, entities, systems, and processes that influence data of interest in an automated, programmatic way can improve reliability Rapid iteration/change requires reproducability from source Can you reproduce the same analysis and transformations from the source data, over time? Desire for longitudinal data across many entities For many organizations, it’s important to understand how the relationships between a given set of entities has changed over time. For instance, understanding the relationships between a part, supplier, and product can lead to buying the highest quality part at the cheapest price, from the most reliable manufacturer.