MLSD18. Machine Learning Research at QCRI

1st edition
November 4-5, 2018
Machine Learning School in Doha

QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 2
Cutting-edge Research in the
Data curation
Data Discovery, Data Integration, and Data Cleaning
Mourad Ouzzani
Principal Scientist, QCRI, HBKU

· @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 3
Data Discovery

The Data Discovery Problem
Finance
Sales
Databases, logs,
reports…
Tech
• How do I find relevant data?
• Am I missing important data?

Declare What You Want!
Sales
Databases, logs,
reports…Tech
Employee Id Name Gender Department
1001 John Male Finance
1002 Mary Female Tech
1003 Susan Female Finance
$> find_schema_with(“department”, “gender”, “employee”)

Data Discovery System
• Profile the data and build the Enterprise Knowledge Graph
(EKG)
• Enrich the EKG by exposing 
semantic relations using 
reference data
• APIs to query the EKG
o similarTables(t: table) = schemaSim(t) AND contentSim(t)
o joinPath(src: table, tgt: table) = paths_between(src, tgt, Relation.PKFK)
APIs
SINGLE3727
protein_typeid
polymerase 4
name
1
isoform
Q2HRB6100
accessionvariant_id
M197L
mutation
Chemical
Compound
Protein drug
DrugCentral
Chembl_22
Target dictionary
Variant Sequences
Experimental
Factor
Ontology
Internal
Ontology
record_id
cd_id name 32
molregnodrug_id
Drug Indication
Drug
Target
interacts_with
https://www.csail.mit.edu/research/aurum-large-scale-data-discovery

Coherent Groups
• Coherent groups
– Semantic signatures out of many words
– A coherent group indicates roughly a concept – words fall in the same semantic
space
– All-pairs similarity of words in the group is beyond a threshold
– Use word embeddings to capture “semantic” similarity
Pair of schema
elements are related
Pair of schema
elements are unrelated
Coherency Factor of a set of vectors X
Average of all-pairs similarities of elements in X

Discovering Disguised Missing
Values

The Problem
Values that replace missing values

FAHES Architecture
https://github.com/daqcri/fahes

Entity Resolution

The Entity Resolution Problem
Name Address Email Nation Gender
Catherine
Zeta-Jones
9601 Wilshire Blvd., Beverly Hills,
CA 90210-5213
c.jones@gmail.com Wales F
C. Zeta Jones 3rd Floor, Beverly Hills, CA 90210 c.jones@gmail.com US F
Michael
Jordan
676 North Michigan Avenue, Suite
293, Chicago
US M
Bob Dylan 1230 Avenue of the Americas, NY
10020
US M
Name Apt Email Country Sex
Catherine
Zeta-Jones
9601 Wilshire, 3rd Floor,
Beverly Hills, CA 90210
c.jones@gmail.com Wales F
B. Dylan 1230 Avenue of the Americas,
NY 10020
bob.dylan@gmail.com US M
Michael
Jordan
427 Evans Hall #3860,
Berkeley, CA 94720
jordan@cs.berkeley.edu US M
R
S

Typical Entity Resolution Pipeline

• Feature Engineering
• Automatic feature engineering that could handle
syntactic/semantic similarities
• Blocking
• Automated and customizable blocking method with a
holistic view of all attributes
• Labeling Effort
• Much less labeled data by considering prior knowledge
➢ Key Idea: Use distributed representations (of tuples) - a
fundamental concept in deep learning (DL)
DeepER Solution

Distributed Representations of Words
• DRs (aka word embeddings) are learned from the
data
• Semantically related words are often close to
each other
• their geometric relationship encodes a
semantic relationship
• Map each word into a high dimensional vector
with a fixed dimension d, e.g., 300 for GloVe
• Each word ! a distribution of weights (+/-)
across d dimensions
king – man + woman = queen

Name City
t1 Bill Gates Seattle
t2 William Gates Seattle
Word DR for words
Bill [0.4, 0.8, 0.9]
William [0.3, 0.9, 0.7]
Gates [0.5, 0.8, 0.8]
Seattle [0.1, 0.1, 0.2]
DR for words
t1 [?, ?, ?, ?]
t2 [?, ?, ?, ?]
From DR of Words to DRs of Tuples
1. Simple Approach – Averaging
– Ignores word order
– DR(bill gates) = 0.5 * (DR(bill) + DR(gates))
– Simple to train
2. Compositional Approach - RNN with LSTM
– Takes word and attribute order into account
– Use a NN to semantically compose the word 
vectors into an attribute-level vector
https://github.com/daqcri/deeper-lite
https://github.com/daqcri/DeepER

Transfer Learning for ER

The TL Problem
Given a target dataset DT on which we need to do ER such that
DT has limited or no training data, is it possible to train a good
ML classifier for DT by reusing and adapting training data from a
related dataset DS?

Feature spaceof DTFeature space of DS
Feature Truncation
Feature Standardization
Feature Standardization based on DRs
Advantages:
1.Reuse of ML classifiers
2.Encode semantic similarity and has a fine-grained similarity computed holistically
3.Pool training data from multiple source datasets
4.Minimize domain expert effort in identifying appropriate features, similarity functions …
5.Reuse popular DRs such as Word2vec, GloVe, and FastText
• Feature space truncation – Use only the
common attributes 
• Feature space standardization - tuples
from each relation all encoded into a standard
feature space of dimension

Training Data Method Description
Source Target
Adequate Nothing Unsupervised domain adaptation Use weighted paradigm where different
similarity vectors have different weights
based on their fidelity to D_T
Adequate Limited Feature Augmentation Learn parameters jointly when appropriate
and learn individually otherwise
Limited Limited Semi-supervised domain adaptation Use both unlabeled and labeled data
Adequate Adequate Easy – any of the above algorithms
Algorithms that
1.successfully address ER-specific challenges such as imbalanced data, diverse schemata, and varying
vocabulary,
2.are capable of leveraging key ER properties such as similarity vectors as features and monotonicity
of precision,
3.work on classifiers that are widely used in ER,
4.are dataset and domain agnostic, and
5.allow seamless transfer from multiple source datasets.
Our Solution …

Scenario (Adequate, Nothing)

Scenario (Adequate, Limited)
Feature Augmentation - Similarity vector x of
dimension d is transformed into a similarity
vector ɸ of dimension 3xd by duplicating each
feature in x a manner that is different for DS
and DT

Scenario (Limited, Limited)
Feature Augmentation – As for (Adequate, Limited)
Data Augmentation – Create 2 copies of each x in DT
U,
(label duplicate/non-duplicate).
Ensure that the weights learned for the transformed dataset
also agree on the unlabeled data.

Protein Structure Prediction

Will my protein crystalize or not given a sequence of the
protein?
● Answering this question is important to understand protein function & design drugs
● Current Approach
● Protein structure determination using X-ray crystallography
● High attrition rate, trial-and-error settings increase production cost
● New ML Methods
● Use sequence, bio-chemical and structure features, mostly with SVM or RF
classifiers
➢ DeepCrystal, a CNN based deep learning framework exploits frequent k-mers (amino
acid residues of length k) and groups of k-mers using the raw protein sequences only

Architecture for the DeepCrystal model
https://deeplearning-protein.qcri.org/index.html

References

References
● Data Discovery
● http://da.qcri.org/ntang/pubs/icde2018semantic.pdf
● Entity Resolution
● http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
● Transfer Learning
● https://arxiv.org/abs/1809.11084
● Disguised MissingValues
● http://da.qcri.org/ntang/pubs/kdd18.pdf
● Protein Structure Prediction
● https://deeplearning-protein.qcri.org/index.html

MLSD18. Machine Learning Research at QCRI

MLSD18. Machine Learning Research at QCRI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MLSD18. Machine Learning Research at QCRI

Similar to MLSD18. Machine Learning Research at QCRI (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

MLSD18. Machine Learning Research at QCRI