SlideShare a Scribd company logo
1 of 14
COMBINING HUMAN & MACHINE INTELLIGENCE TO
SUCCESSFULLY INTEGRATE BIOMEDICAL DATA
TIMOTHY DANFORD | TAMR, INC.
THE DATA INTEGRATION PROBLEM
● flat files: every file has its own columns
● bioinformatics: every tool has its own
file format
● graph data: RDF, OWL, “knowledge
graphs”
● proprietary / legacy formats: SAS,
DBF
● relational databases: inconsistent data
models
Biomedical Data Integration is a
Constantly Moving Target
THE DATA INTEGRATION PROBLEM
● One solution: hire or train data curators
who understand the subject area
● Benefits: accuracy
● Problems
o Low bandwidth
o Difficult to scale to larger problems
o Recording decisions
o Consistency between curators
Data Curation Teams Do Not Scale
THE DATA INTEGRATION PROBLEM
● Build an automated or rules-based
system to perform data integration
● Benefits: scale
● Problems
o Accuracy, edge-cases
o Programmers do not scale
o Out-of-band communication
o Expensive to maintain
o Brittle in the face of new data
Rule-based Integration Is Brittle
TAMR AUTOMATES DATA INTEGRATION
● Solution: combine learning rules with
asking experts
● Modern machine learning techniques
o semi-supervised learning
o active learning
● Benefits
o speed of an automated system
o accuracy of human experts
o auditability
o responds well to changing
requirements
Use Probabilistic Rules with Active
Learning
TAMR AUTOMATES DATA INTEGRATION
● Build a unified schema and link it to
source attributes
● Engage subject matter experts to
answer questions
● Automate data transformation
● Eliminate redundant records with de-
duplication
Tamr Combines Machine Learning
and Expert Feedback
CASE STUDY: CLINICAL STUDY DATA
● Clinical study data integration is motivated
by a single schema: CDISC
o mandated by FDA for data submission
o common schema for clinical data
warehouses
● Mostly performed by SAS scripting today
● Tamr learns attribute mapping and
transformations using human feedback
An Example: Clinical Study Data Integration
Thank You
THE BIOMEDICAL DATA INTEGRATION PROBLEM
Fundamentally, many scientific analyses are tabular
rows are ‘entities’
columns are ‘attributes’
graphs (paths) and hierarchies (part/whole) are other shapes
tables emphasize independence of entities and attributes
Tabular Datasets are a Core Data Shape
THE BIOMEDICAL DATA INTEGRATION PROBLEM
● Column-oriented: Find the matching attributes
● Row-oriented: Discover duplicate entities
Data Integration Proceeds In Two Directions
● 80% of clinical data today goes unused
● Clinical Data Warehouses capture legacy data
● Improved analytics = better trials, less $$
Advanced Analytics, Better Clinical Trials
TAMR BUILDS LASTING VALUE
SAS
Faster Regulatory
Filings
Better Clinical
Analytics
Data Mining for
New Indications
Dynamic, Integrated View of 15k Existing and New
Sources: Biopharma
Result
• Replaced 10+ man years of human curation effort with Tamr
• Engage 600 Scientists in data quality ownership
Challenges
• $2B in research and silos of experimental results
• 15,000 sources of experimental results
• Hundreds of decentralized labs
• 1M+ rows with >100k attribute names
• Non-standardized attribute names & measurement units
• Manual curation prohibitively time & cost intensive
Solution
• Integrate data to find similar experiments
• Scaling data curation to incorporate all sources at
reasonable cost
• Engage owners of data sources in improving quality of data
15k sources integrated into one view
Tamr Output
TACKLING THE ENTERPRISE DATA SILO PROBLEM
All are necessary but not sufficient to truly address next-gen challenges
● Democratized visualization and modeling - radical consumption heterogeneity
● SemanticWeb/LinkedData - radical source heterogeneity
● Provenance for data to improve reliability
● Rapid iteration/change requires reproduceability from source
● Desire for longitudinal data across many entities
● Need for automated data quality / assurance
Traditional approaches...
● Standardization - worth trying
● Aggregation - yes - but actually makes the problem worse
● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data

More Related Content

What's hot

IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...Bill Kohnen
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsCambridge Semantics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15madynav
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityCaserta
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEdureka!
 
SEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data WarehouseSEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data WarehouseHelpSystems
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse conceptsobieefans
 
DATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTUREDATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTURESachin Batham
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkkguest4e975e2
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality DashboardsWilliam Sharp
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data WarehousingAlex Meadows
 
Datawarehouse & bi introduction
Datawarehouse & bi introductionDatawarehouse & bi introduction
Datawarehouse & bi introductionguest7b34c2
 
Data Warehouse
Data WarehouseData Warehouse
Data WarehouseSana Alvi
 
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...Edureka!
 
Data Wearhouse (Dw) concepts
Data Wearhouse (Dw)  conceptsData Wearhouse (Dw)  concepts
Data Wearhouse (Dw) conceptsBeing Topper
 

What's hot (20)

IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Bi overview
Bi overviewBi overview
Bi overview
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
SEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data WarehouseSEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data Warehouse
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
 
DATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTUREDATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTURE
 
Big Data Modeling
Big Data ModelingBig Data Modeling
Big Data Modeling
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkk
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data Warehousing
 
Datawarehouse & bi introduction
Datawarehouse & bi introductionDatawarehouse & bi introduction
Datawarehouse & bi introduction
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
 
Data Wearhouse (Dw) concepts
Data Wearhouse (Dw)  conceptsData Wearhouse (Dw)  concepts
Data Wearhouse (Dw) concepts
 

Viewers also liked

Slaid tokoh perniagaan
Slaid tokoh perniagaanSlaid tokoh perniagaan
Slaid tokoh perniagaanAzwar Anis
 
Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2tonychoper6104
 
11 le thithuydung.mul1
11 le thithuydung.mul111 le thithuydung.mul1
11 le thithuydung.mul1Dung Le
 
Formas y animaciones leidy
Formas y animaciones leidyFormas y animaciones leidy
Formas y animaciones leidyleidyfabiana17
 
Webcatalog sale
Webcatalog saleWebcatalog sale
Webcatalog saleronnagr
 
Priamry data type
Priamry data typePriamry data type
Priamry data type200Hussain
 
Ke hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hocKe hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hocMyLan2014
 
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt YouPennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt YouCurley & Rothman, LLC
 
Towards a critical history of child protection social work
Towards a critical history of child protection social workTowards a critical history of child protection social work
Towards a critical history of child protection social workBASPCAN
 
Stc call sheet 1-1
Stc call sheet 1-1Stc call sheet 1-1
Stc call sheet 1-1Manuel Bxyan
 

Viewers also liked (20)

Cours
CoursCours
Cours
 
Matt Schultz 4.4
Matt Schultz 4.4Matt Schultz 4.4
Matt Schultz 4.4
 
Slaid tokoh perniagaan
Slaid tokoh perniagaanSlaid tokoh perniagaan
Slaid tokoh perniagaan
 
Formas y animaciones
Formas y animacionesFormas y animaciones
Formas y animaciones
 
Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2
 
Space Apps 2015
Space Apps 2015Space Apps 2015
Space Apps 2015
 
11 le thithuydung.mul1
11 le thithuydung.mul111 le thithuydung.mul1
11 le thithuydung.mul1
 
Formas y animaciones leidy
Formas y animaciones leidyFormas y animaciones leidy
Formas y animaciones leidy
 
Formas y animaciones
Formas y animacionesFormas y animaciones
Formas y animaciones
 
Looping e
Looping   eLooping   e
Looping e
 
Webcatalog sale
Webcatalog saleWebcatalog sale
Webcatalog sale
 
Thomas Salzano - Best Romantic travel destinations
Thomas Salzano - Best Romantic travel destinationsThomas Salzano - Best Romantic travel destinations
Thomas Salzano - Best Romantic travel destinations
 
Priamry data type
Priamry data typePriamry data type
Priamry data type
 
Saroj_Mahanta
Saroj_MahantaSaroj_Mahanta
Saroj_Mahanta
 
Question 1
Question 1Question 1
Question 1
 
Catalogo2017
Catalogo2017Catalogo2017
Catalogo2017
 
Ke hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hocKe hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hoc
 
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt YouPennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
 
Towards a critical history of child protection social work
Towards a critical history of child protection social workTowards a critical history of child protection social work
Towards a critical history of child protection social work
 
Stc call sheet 1-1
Stc call sheet 1-1Stc call sheet 1-1
Stc call sheet 1-1
 

Similar to Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data

Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesChristopher Eaker
 
Sql server ___________session_1-intro
Sql server  ___________session_1-introSql server  ___________session_1-intro
Sql server ___________session_1-introEhtisham Ali
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesMark Kromer
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 
Qiagram
QiagramQiagram
Qiagramjwppz
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCarly Strasser
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetCongChen35
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Datadapaasproject
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Institute of Contemporary Sciences
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10AnwarrChaudary
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...Neo4j
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platformsJamesAnderson599331
 

Similar to Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data (20)

Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Sql server ___________session_1-intro
Sql server  ___________session_1-introSql server  ___________session_1-intro
Sql server ___________session_1-intro
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
lecture5 (1) (2).pptx
lecture5 (1) (2).pptxlecture5 (1) (2).pptx
lecture5 (1) (2).pptx
 
lecture 1.pdf
lecture 1.pdflecture 1.pdf
lecture 1.pdf
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Qiagram
QiagramQiagram
Qiagram
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP Students
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
 
Preprocess
PreprocessPreprocess
Preprocess
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data

  • 1. COMBINING HUMAN & MACHINE INTELLIGENCE TO SUCCESSFULLY INTEGRATE BIOMEDICAL DATA TIMOTHY DANFORD | TAMR, INC.
  • 2. THE DATA INTEGRATION PROBLEM ● flat files: every file has its own columns ● bioinformatics: every tool has its own file format ● graph data: RDF, OWL, “knowledge graphs” ● proprietary / legacy formats: SAS, DBF ● relational databases: inconsistent data models Biomedical Data Integration is a Constantly Moving Target
  • 3. THE DATA INTEGRATION PROBLEM ● One solution: hire or train data curators who understand the subject area ● Benefits: accuracy ● Problems o Low bandwidth o Difficult to scale to larger problems o Recording decisions o Consistency between curators Data Curation Teams Do Not Scale
  • 4. THE DATA INTEGRATION PROBLEM ● Build an automated or rules-based system to perform data integration ● Benefits: scale ● Problems o Accuracy, edge-cases o Programmers do not scale o Out-of-band communication o Expensive to maintain o Brittle in the face of new data Rule-based Integration Is Brittle
  • 5. TAMR AUTOMATES DATA INTEGRATION ● Solution: combine learning rules with asking experts ● Modern machine learning techniques o semi-supervised learning o active learning ● Benefits o speed of an automated system o accuracy of human experts o auditability o responds well to changing requirements Use Probabilistic Rules with Active Learning
  • 6. TAMR AUTOMATES DATA INTEGRATION ● Build a unified schema and link it to source attributes ● Engage subject matter experts to answer questions ● Automate data transformation ● Eliminate redundant records with de- duplication Tamr Combines Machine Learning and Expert Feedback
  • 7. CASE STUDY: CLINICAL STUDY DATA ● Clinical study data integration is motivated by a single schema: CDISC o mandated by FDA for data submission o common schema for clinical data warehouses ● Mostly performed by SAS scripting today ● Tamr learns attribute mapping and transformations using human feedback An Example: Clinical Study Data Integration
  • 9. THE BIOMEDICAL DATA INTEGRATION PROBLEM Fundamentally, many scientific analyses are tabular rows are ‘entities’ columns are ‘attributes’ graphs (paths) and hierarchies (part/whole) are other shapes tables emphasize independence of entities and attributes Tabular Datasets are a Core Data Shape
  • 10. THE BIOMEDICAL DATA INTEGRATION PROBLEM ● Column-oriented: Find the matching attributes ● Row-oriented: Discover duplicate entities Data Integration Proceeds In Two Directions
  • 11.
  • 12. ● 80% of clinical data today goes unused ● Clinical Data Warehouses capture legacy data ● Improved analytics = better trials, less $$ Advanced Analytics, Better Clinical Trials TAMR BUILDS LASTING VALUE SAS Faster Regulatory Filings Better Clinical Analytics Data Mining for New Indications
  • 13. Dynamic, Integrated View of 15k Existing and New Sources: Biopharma Result • Replaced 10+ man years of human curation effort with Tamr • Engage 600 Scientists in data quality ownership Challenges • $2B in research and silos of experimental results • 15,000 sources of experimental results • Hundreds of decentralized labs • 1M+ rows with >100k attribute names • Non-standardized attribute names & measurement units • Manual curation prohibitively time & cost intensive Solution • Integrate data to find similar experiments • Scaling data curation to incorporate all sources at reasonable cost • Engage owners of data sources in improving quality of data 15k sources integrated into one view Tamr Output
  • 14. TACKLING THE ENTERPRISE DATA SILO PROBLEM All are necessary but not sufficient to truly address next-gen challenges ● Democratized visualization and modeling - radical consumption heterogeneity ● SemanticWeb/LinkedData - radical source heterogeneity ● Provenance for data to improve reliability ● Rapid iteration/change requires reproduceability from source ● Desire for longitudinal data across many entities ● Need for automated data quality / assurance Traditional approaches... ● Standardization - worth trying ● Aggregation - yes - but actually makes the problem worse ● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data

Editor's Notes

  1. Key Messages: Today I’ll be speaking about how data variety, the natural, siloed nature of data as it’s created, is creating a bottleneck to analytics, and how deterministic data unification approaches aren’t alone sufficient to scale to the variety of hundreds or thousands of data silos found within the enterprise.
  2. What we won’t worry about today: incremental updates, data velocity scale
  3. What we won’t worry about today: incremental updates, data velocity scale
  4. What we won’t worry about today: incremental updates, data velocity scale
  5. What we won’t worry about today: incremental updates, data velocity scale
  6. graph data: rows are nodes, columns are nodes or edges. genomics - rows: genes, variants, ‘features’, and columns: position or: rows are people and columns are variants or: rows are people and columns are phenotypes or: rows are phenotypes and columns are variants (sort of a pivot version) clinical study data: rows are people, or visits, or measurements, and columns are dates, observation codes, categories, names. Sometimes the data just *is* in spreadsheets! (A large Swiss pharmaceutical company, every screening experiment was captured in a separate spreadsheet. “Which experiments were even run?”) A single insight that crosses data silos Discovery that doesn’t “double count” evidence Matching for causal inference
  7. No single method can solve this problem! We need an iterative approach, that automates integration but is guided and corrected by human feedback.
  8. Looking to get an integrated view—previously w/ manual effort and cannot redo—need an automated system to work w humans to create a catalogue Mapping to 80% accuracy Opened discussion up across departments
  9. This slide has animation. You need to click once. Traditional approaches, while necessary, are not alone sufficient to truly address next-gen data challenges Democratized visualization and modeling - radical consumption heterogeneity New visualization and modeling tools have helped democratize analytics, changing the ways in which business users across the enterprise want to consume data. Today, more users require access to high-quality data for varying analytics projects. How do rule base approaches scale with more users consuming data in different ways? SemanticWeb/LinkedData - radical source heterogeneity Extensions for structuring and understanding data on the web have introduced a radical new source of heterogeneous data, presenting challenges to traditional top down data-integration approaches. If we already struggle with scale of our own internal enterprise data, how do you leverage a source with the scale and variety of the web? Provenance for data to improve reliability To be able to reproduce results and ensure data quality, you need to able to understand how the data has been used and transformed over time. Understanding the inputs, entities, systems, and processes that influence data of interest in an automated, programmatic way can improve reliability Rapid iteration/change requires reproducability from source Can you reproduce the same analysis and transformations from the source data, over time? Desire for longitudinal data across many entities For many organizations, it’s important to understand how the relationships between a given set of entities has changed over time. For instance, understanding the relationships between a part, supplier, and product can lead to buying the highest quality part at the cheapest price, from the most reliable manufacturer.