SlideShare a Scribd company logo
1 of 29
Download to read offline
1st edition
November 4-5, 2018
Machine Learning School in Doha
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 2
Cutting-edge Research in the
Data curation
Data Discovery, Data Integration, and Data Cleaning
Mourad Ouzzani
Principal Scientist, QCRI, HBKU
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 3
Data Discovery
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 4
The Data Discovery Problem
Finance
Sales
Databases, logs,
reports…
Tech
• How do I find relevant data?
• Am I missing important data?
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 5
Declare What You Want!
Sales
Databases, logs,
reports…Tech
Employee Id Name Gender Department
1001 John Male Finance
1002 Mary Female Tech
1003 Susan Female Finance
$> find_schema_with(“department”, “gender”, “employee”)
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 6
Data Discovery System
• Profile the data and build the Enterprise Knowledge Graph
(EKG)
• Enrich the EKG by exposing

semantic relations using

reference data
• APIs to query the EKG
o similarTables(t: table) = schemaSim(t) AND contentSim(t)
o joinPath(src: table, tgt: table) = paths_between(src, tgt, Relation.PKFK)
APIs
SINGLE3727
protein_typeid
polymerase 4
name
1
isoform
Q2HRB6100
accessionvariant_id
M197L
mutation
Chemical
Compound
Protein drug
DrugCentral
Chembl_22
Target dictionary
Variant Sequences
Experimental
Factor
Ontology
Internal
Ontology
record_id
cd_id name 32
molregnodrug_id
Drug Indication
Drug
Target
interacts_with
https://www.csail.mit.edu/research/aurum-large-scale-data-discovery
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 7
Coherent Groups
• Coherent groups
– Semantic signatures out of many words
– A coherent group indicates roughly a concept – words fall in the same semantic
space
– All-pairs similarity of words in the group is beyond a threshold
– Use word embeddings to capture “semantic” similarity
Pair of schema
elements are related
Pair of schema
elements are unrelated
Coherency Factor of a set of vectors X
Average of all-pairs similarities of elements in X
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 8
Discovering Disguised Missing
Values
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 9
The Problem
Values that replace missing values
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 10
FAHES Architecture
https://github.com/daqcri/fahes
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 11
Entity Resolution
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 12
The Entity Resolution Problem
Name Address Email Nation Gender
Catherine
Zeta-Jones
9601 Wilshire Blvd., Beverly Hills,
CA 90210-5213
c.jones@gmail.com Wales F
C. Zeta Jones 3rd Floor, Beverly Hills, CA 90210 c.jones@gmail.com US F
Michael
Jordan
676 North Michigan Avenue, Suite
293, Chicago
US M
Bob Dylan 1230 Avenue of the Americas, NY
10020
US M
Name Apt Email Country Sex
Catherine
Zeta-Jones
9601 Wilshire, 3rd Floor,
Beverly Hills, CA 90210
c.jones@gmail.com Wales F
B. Dylan 1230 Avenue of the Americas,
NY 10020
bob.dylan@gmail.com US M
Michael
Jordan
427 Evans Hall #3860,
Berkeley, CA 94720
jordan@cs.berkeley.edu US M
R
S
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 13
Typical Entity Resolution Pipeline
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 14
• Feature Engineering
• Automatic feature engineering that could handle
syntactic/semantic similarities
• Blocking
• Automated and customizable blocking method with a
holistic view of all attributes
• Labeling Effort
• Much less labeled data by considering prior knowledge
➢ Key Idea: Use distributed representations (of tuples) - a
fundamental concept in deep learning (DL)
DeepER Solution
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 15
Distributed Representations of Words
• DRs (aka word embeddings) are learned from the
data
• Semantically related words are often close to
each other
• their geometric relationship encodes a
semantic relationship
• Map each word into a high dimensional vector
with a fixed dimension d, e.g., 300 for GloVe
• Each word ! a distribution of weights (+/-)
across d dimensions
king – man + woman = queen
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 16
Name City
t1 Bill Gates Seattle
t2 William Gates Seattle
Word DR for words
Bill [0.4, 0.8, 0.9]
William [0.3, 0.9, 0.7]
Gates [0.5, 0.8, 0.8]
Seattle [0.1, 0.1, 0.2]
DR for words
t1 [?, ?, ?, ?]
t2 [?, ?, ?, ?]
From DR of Words to DRs of Tuples
1. Simple Approach – Averaging
– Ignores word order
– DR(bill gates) = 0.5 * (DR(bill) + DR(gates))
– Simple to train
2. Compositional Approach - RNN with LSTM
– Takes word and attribute order into account
– Use a NN to semantically compose the word

vectors into an attribute-level vector
https://github.com/daqcri/deeper-lite
https://github.com/daqcri/DeepER
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 17
Transfer Learning for ER
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 18
The TL Problem
Given a target dataset DT on which we need to do ER such that
DT has limited or no training data, is it possible to train a good
ML classifier for DT by reusing and adapting training data from a
related dataset DS?
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 19
Feature spaceof DTFeature space of DS
Feature Truncation
Feature Standardization
Feature Standardization based on DRs
Advantages:
1.Reuse of ML classifiers
2.Encode semantic similarity and has a fine-grained similarity computed holistically
3.Pool training data from multiple source datasets
4.Minimize domain expert effort in identifying appropriate features, similarity functions …
5.Reuse popular DRs such as Word2vec, GloVe, and FastText
• Feature space truncation – Use only the
common attributes

• Feature space standardization - tuples
from each relation all encoded into a standard
feature space of dimension
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 20
Training Data Method Description
Source Target
Adequate Nothing Unsupervised domain adaptation Use weighted paradigm where different
similarity vectors have different weights
based on their fidelity to D_T
Adequate Limited Feature Augmentation Learn parameters jointly when appropriate
and learn individually otherwise
Limited Limited Semi-supervised domain adaptation Use both unlabeled and labeled data
Adequate Adequate Easy – any of the above algorithms
Algorithms that
1.successfully address ER-specific challenges such as imbalanced data, diverse schemata, and varying
vocabulary,
2.are capable of leveraging key ER properties such as similarity vectors as features and monotonicity
of precision,
3.work on classifiers that are widely used in ER,
4.are dataset and domain agnostic, and
5.allow seamless transfer from multiple source datasets.
Our Solution …
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 21
Scenario (Adequate, Nothing)
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 22
Scenario (Adequate, Limited)
Feature Augmentation - Similarity vector x of
dimension d is transformed into a similarity
vector ɸ of dimension 3xd by duplicating each
feature in x a manner that is different for DS
and DT
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 23
Scenario (Limited, Limited)
Feature Augmentation – As for (Adequate, Limited)
Data Augmentation – Create 2 copies of each x in DT
U,
(label duplicate/non-duplicate).
Ensure that the weights learned for the transformed dataset
also agree on the unlabeled data.
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 24
Protein Structure Prediction
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 25
Will my protein crystalize or not given a sequence of the
protein?
● Answering this question is important to understand protein function & design drugs
● Current Approach
● Protein structure determination using X-ray crystallography
● High attrition rate, trial-and-error settings increase production cost
● New ML Methods
● Use sequence, bio-chemical and structure features, mostly with SVM or RF
classifiers
➢ DeepCrystal, a CNN based deep learning framework exploits frequent k-mers (amino
acid residues of length k) and groups of k-mers using the raw protein sequences only
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 26
Architecture for the DeepCrystal model
https://deeplearning-protein.qcri.org/index.html
· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 27
References
QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 28
References
● Data Discovery
● http://da.qcri.org/ntang/pubs/icde2018semantic.pdf
● Entity Resolution
● http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
● Transfer Learning
● https://arxiv.org/abs/1809.11084
● Disguised MissingValues
● http://da.qcri.org/ntang/pubs/kdd18.pdf
● Protein Structure Prediction
● https://deeplearning-protein.qcri.org/index.html
MLSD18. Machine Learning Research at QCRI

More Related Content

What's hot

What's hot (20)

MLSD18. Supervised Summary
MLSD18. Supervised SummaryMLSD18. Supervised Summary
MLSD18. Supervised Summary
 
MLSD18. Summary of Morning Sessions
MLSD18. Summary of Morning SessionsMLSD18. Summary of Morning Sessions
MLSD18. Summary of Morning Sessions
 
MLSD18. Supervised Workshop
MLSD18. Supervised WorkshopMLSD18. Supervised Workshop
MLSD18. Supervised Workshop
 
MLSD18. Basic Transformations - BigML
MLSD18. Basic Transformations - BigMLMLSD18. Basic Transformations - BigML
MLSD18. Basic Transformations - BigML
 
MLSD18. Basic Transformations - QCRI
MLSD18. Basic Transformations - QCRIMLSD18. Basic Transformations - QCRI
MLSD18. Basic Transformations - QCRI
 
MLSD18. Feature Engineering
MLSD18. Feature EngineeringMLSD18. Feature Engineering
MLSD18. Feature Engineering
 
MLSD18. Automating Machine Learning Workflows
MLSD18. Automating Machine Learning WorkflowsMLSD18. Automating Machine Learning Workflows
MLSD18. Automating Machine Learning Workflows
 
BigML Summer 2017 Release
BigML Summer 2017 ReleaseBigML Summer 2017 Release
BigML Summer 2017 Release
 
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetGraph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
 
TigerGraph.js
TigerGraph.jsTigerGraph.js
TigerGraph.js
 
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraphFROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
 
VSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 SessionsVSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 Sessions
 
BSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBSSML17 - API and WhizzML
BSSML17 - API and WhizzML
 
VSSML18. Feature Engineering
VSSML18. Feature EngineeringVSSML18. Feature Engineering
VSSML18. Feature Engineering
 
BigML Release: PCA
BigML Release: PCABigML Release: PCA
BigML Release: PCA
 
Connected datalondon metadata-driven apps
Connected datalondon metadata-driven appsConnected datalondon metadata-driven apps
Connected datalondon metadata-driven apps
 
SHACL-based data life cycle management
SHACL-based data life cycle managementSHACL-based data life cycle management
SHACL-based data life cycle management
 
Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
Building, and communicating, a knowledge graph in Zalando
Building, and communicating, a knowledge graph in ZalandoBuilding, and communicating, a knowledge graph in Zalando
Building, and communicating, a knowledge graph in Zalando
 

Similar to MLSD18. Machine Learning Research at QCRI

The five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar finalThe five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar final
Neo4j
 
The five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar finalThe five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar final
Neo4j
 

Similar to MLSD18. Machine Learning Research at QCRI (20)

Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
MLSEV. Automating Decision Making
MLSEV. Automating Decision MakingMLSEV. Automating Decision Making
MLSEV. Automating Decision Making
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
Docker Summit MongoDB - Data Democratization
Docker Summit MongoDB - Data Democratization Docker Summit MongoDB - Data Democratization
Docker Summit MongoDB - Data Democratization
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
 
MLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven FactoryMLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven Factory
 
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
 
Building a Winning Data Engineering Culture
Building a Winning Data Engineering CultureBuilding a Winning Data Engineering Culture
Building a Winning Data Engineering Culture
 
Integrating Web and Business Data
Integrating Web and Business DataIntegrating Web and Business Data
Integrating Web and Business Data
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDB
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
The five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar finalThe five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar final
 
The five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar finalThe five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar final
 
Integrating Web and Business Data
Integrating Web and Business DataIntegrating Web and Business Data
Integrating Web and Business Data
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 

More from BigML, Inc

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Recently uploaded

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Recently uploaded (20)

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

MLSD18. Machine Learning Research at QCRI

  • 1. 1st edition November 4-5, 2018 Machine Learning School in Doha
  • 2. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 2 Cutting-edge Research in the Data curation Data Discovery, Data Integration, and Data Cleaning Mourad Ouzzani Principal Scientist, QCRI, HBKU
  • 3. · @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 3 Data Discovery
  • 4. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 4 The Data Discovery Problem Finance Sales Databases, logs, reports… Tech • How do I find relevant data? • Am I missing important data?
  • 5. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 5 Declare What You Want! Sales Databases, logs, reports…Tech Employee Id Name Gender Department 1001 John Male Finance 1002 Mary Female Tech 1003 Susan Female Finance $> find_schema_with(“department”, “gender”, “employee”)
  • 6. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 6 Data Discovery System • Profile the data and build the Enterprise Knowledge Graph (EKG) • Enrich the EKG by exposing
 semantic relations using
 reference data • APIs to query the EKG o similarTables(t: table) = schemaSim(t) AND contentSim(t) o joinPath(src: table, tgt: table) = paths_between(src, tgt, Relation.PKFK) APIs SINGLE3727 protein_typeid polymerase 4 name 1 isoform Q2HRB6100 accessionvariant_id M197L mutation Chemical Compound Protein drug DrugCentral Chembl_22 Target dictionary Variant Sequences Experimental Factor Ontology Internal Ontology record_id cd_id name 32 molregnodrug_id Drug Indication Drug Target interacts_with https://www.csail.mit.edu/research/aurum-large-scale-data-discovery
  • 7. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 7 Coherent Groups • Coherent groups – Semantic signatures out of many words – A coherent group indicates roughly a concept – words fall in the same semantic space – All-pairs similarity of words in the group is beyond a threshold – Use word embeddings to capture “semantic” similarity Pair of schema elements are related Pair of schema elements are unrelated Coherency Factor of a set of vectors X Average of all-pairs similarities of elements in X
  • 8. · @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 8 Discovering Disguised Missing Values
  • 9. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 9 The Problem Values that replace missing values
  • 10. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 10 FAHES Architecture https://github.com/daqcri/fahes
  • 11. · @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 11 Entity Resolution
  • 12. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 12 The Entity Resolution Problem Name Address Email Nation Gender Catherine Zeta-Jones 9601 Wilshire Blvd., Beverly Hills, CA 90210-5213 c.jones@gmail.com Wales F C. Zeta Jones 3rd Floor, Beverly Hills, CA 90210 c.jones@gmail.com US F Michael Jordan 676 North Michigan Avenue, Suite 293, Chicago US M Bob Dylan 1230 Avenue of the Americas, NY 10020 US M Name Apt Email Country Sex Catherine Zeta-Jones 9601 Wilshire, 3rd Floor, Beverly Hills, CA 90210 c.jones@gmail.com Wales F B. Dylan 1230 Avenue of the Americas, NY 10020 bob.dylan@gmail.com US M Michael Jordan 427 Evans Hall #3860, Berkeley, CA 94720 jordan@cs.berkeley.edu US M R S
  • 13. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 13 Typical Entity Resolution Pipeline
  • 14. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 14 • Feature Engineering • Automatic feature engineering that could handle syntactic/semantic similarities • Blocking • Automated and customizable blocking method with a holistic view of all attributes • Labeling Effort • Much less labeled data by considering prior knowledge ➢ Key Idea: Use distributed representations (of tuples) - a fundamental concept in deep learning (DL) DeepER Solution
  • 15. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 15 Distributed Representations of Words • DRs (aka word embeddings) are learned from the data • Semantically related words are often close to each other • their geometric relationship encodes a semantic relationship • Map each word into a high dimensional vector with a fixed dimension d, e.g., 300 for GloVe • Each word ! a distribution of weights (+/-) across d dimensions king – man + woman = queen
  • 16. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 16 Name City t1 Bill Gates Seattle t2 William Gates Seattle Word DR for words Bill [0.4, 0.8, 0.9] William [0.3, 0.9, 0.7] Gates [0.5, 0.8, 0.8] Seattle [0.1, 0.1, 0.2] DR for words t1 [?, ?, ?, ?] t2 [?, ?, ?, ?] From DR of Words to DRs of Tuples 1. Simple Approach – Averaging – Ignores word order – DR(bill gates) = 0.5 * (DR(bill) + DR(gates)) – Simple to train 2. Compositional Approach - RNN with LSTM – Takes word and attribute order into account – Use a NN to semantically compose the word
 vectors into an attribute-level vector https://github.com/daqcri/deeper-lite https://github.com/daqcri/DeepER
  • 17. · @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 17 Transfer Learning for ER
  • 18. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 18 The TL Problem Given a target dataset DT on which we need to do ER such that DT has limited or no training data, is it possible to train a good ML classifier for DT by reusing and adapting training data from a related dataset DS?
  • 19. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 19 Feature spaceof DTFeature space of DS Feature Truncation Feature Standardization Feature Standardization based on DRs Advantages: 1.Reuse of ML classifiers 2.Encode semantic similarity and has a fine-grained similarity computed holistically 3.Pool training data from multiple source datasets 4.Minimize domain expert effort in identifying appropriate features, similarity functions … 5.Reuse popular DRs such as Word2vec, GloVe, and FastText • Feature space truncation – Use only the common attributes
 • Feature space standardization - tuples from each relation all encoded into a standard feature space of dimension
  • 20. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 20 Training Data Method Description Source Target Adequate Nothing Unsupervised domain adaptation Use weighted paradigm where different similarity vectors have different weights based on their fidelity to D_T Adequate Limited Feature Augmentation Learn parameters jointly when appropriate and learn individually otherwise Limited Limited Semi-supervised domain adaptation Use both unlabeled and labeled data Adequate Adequate Easy – any of the above algorithms Algorithms that 1.successfully address ER-specific challenges such as imbalanced data, diverse schemata, and varying vocabulary, 2.are capable of leveraging key ER properties such as similarity vectors as features and monotonicity of precision, 3.work on classifiers that are widely used in ER, 4.are dataset and domain agnostic, and 5.allow seamless transfer from multiple source datasets. Our Solution …
  • 21. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 21 Scenario (Adequate, Nothing)
  • 22. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 22 Scenario (Adequate, Limited) Feature Augmentation - Similarity vector x of dimension d is transformed into a similarity vector ɸ of dimension 3xd by duplicating each feature in x a manner that is different for DS and DT
  • 23. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 23 Scenario (Limited, Limited) Feature Augmentation – As for (Adequate, Limited) Data Augmentation – Create 2 copies of each x in DT U, (label duplicate/non-duplicate). Ensure that the weights learned for the transformed dataset also agree on the unlabeled data.
  • 24. · @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 24 Protein Structure Prediction
  • 25. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 25 Will my protein crystalize or not given a sequence of the protein? ● Answering this question is important to understand protein function & design drugs ● Current Approach ● Protein structure determination using X-ray crystallography ● High attrition rate, trial-and-error settings increase production cost ● New ML Methods ● Use sequence, bio-chemical and structure features, mostly with SVM or RF classifiers ➢ DeepCrystal, a CNN based deep learning framework exploits frequent k-mers (amino acid residues of length k) and groups of k-mers using the raw protein sequences only
  • 26. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 26 Architecture for the DeepCrystal model https://deeplearning-protein.qcri.org/index.html
  • 27. · @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 27 References
  • 28. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 28 References ● Data Discovery ● http://da.qcri.org/ntang/pubs/icde2018semantic.pdf ● Entity Resolution ● http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf ● Transfer Learning ● https://arxiv.org/abs/1809.11084 ● Disguised MissingValues ● http://da.qcri.org/ntang/pubs/kdd18.pdf ● Protein Structure Prediction ● https://deeplearning-protein.qcri.org/index.html