Cutting-edge in the Machine Learning Field
MLSD18: 1st edition of the Machine Learning School in Doha.
Author: Dr. Mourad Ouzzani, Principal Scientist at Qatar Computing Research Institute (QRCI), from Hamad Bin Khalifa University (HBKU).
2. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 2
Cutting-edge Research in the
Data curation
Data Discovery, Data Integration, and Data Cleaning
Mourad Ouzzani
Principal Scientist, QCRI, HBKU
4. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 4
The Data Discovery Problem
Finance
Sales
Databases, logs,
reports…
Tech
• How do I find relevant data?
• Am I missing important data?
5. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 5
Declare What You Want!
Sales
Databases, logs,
reports…Tech
Employee Id Name Gender Department
1001 John Male Finance
1002 Mary Female Tech
1003 Susan Female Finance
$> find_schema_with(“department”, “gender”, “employee”)
6. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 6
Data Discovery System
• Profile the data and build the Enterprise Knowledge Graph
(EKG)
• Enrich the EKG by exposing
semantic relations using
reference data
• APIs to query the EKG
o similarTables(t: table) = schemaSim(t) AND contentSim(t)
o joinPath(src: table, tgt: table) = paths_between(src, tgt, Relation.PKFK)
APIs
SINGLE3727
protein_typeid
polymerase 4
name
1
isoform
Q2HRB6100
accessionvariant_id
M197L
mutation
Chemical
Compound
Protein drug
DrugCentral
Chembl_22
Target dictionary
Variant Sequences
Experimental
Factor
Ontology
Internal
Ontology
record_id
cd_id name 32
molregnodrug_id
Drug Indication
Drug
Target
interacts_with
https://www.csail.mit.edu/research/aurum-large-scale-data-discovery
7. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 7
Coherent Groups
• Coherent groups
– Semantic signatures out of many words
– A coherent group indicates roughly a concept – words fall in the same semantic
space
– All-pairs similarity of words in the group is beyond a threshold
– Use word embeddings to capture “semantic” similarity
Pair of schema
elements are related
Pair of schema
elements are unrelated
Coherency Factor of a set of vectors X
Average of all-pairs similarities of elements in X
12. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 12
The Entity Resolution Problem
Name Address Email Nation Gender
Catherine
Zeta-Jones
9601 Wilshire Blvd., Beverly Hills,
CA 90210-5213
c.jones@gmail.com Wales F
C. Zeta Jones 3rd Floor, Beverly Hills, CA 90210 c.jones@gmail.com US F
Michael
Jordan
676 North Michigan Avenue, Suite
293, Chicago
US M
Bob Dylan 1230 Avenue of the Americas, NY
10020
US M
Name Apt Email Country Sex
Catherine
Zeta-Jones
9601 Wilshire, 3rd Floor,
Beverly Hills, CA 90210
c.jones@gmail.com Wales F
B. Dylan 1230 Avenue of the Americas,
NY 10020
bob.dylan@gmail.com US M
Michael
Jordan
427 Evans Hall #3860,
Berkeley, CA 94720
jordan@cs.berkeley.edu US M
R
S
14. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 14
• Feature Engineering
• Automatic feature engineering that could handle
syntactic/semantic similarities
• Blocking
• Automated and customizable blocking method with a
holistic view of all attributes
• Labeling Effort
• Much less labeled data by considering prior knowledge
➢ Key Idea: Use distributed representations (of tuples) - a
fundamental concept in deep learning (DL)
DeepER Solution
15. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 15
Distributed Representations of Words
• DRs (aka word embeddings) are learned from the
data
• Semantically related words are often close to
each other
• their geometric relationship encodes a
semantic relationship
• Map each word into a high dimensional vector
with a fixed dimension d, e.g., 300 for GloVe
• Each word ! a distribution of weights (+/-)
across d dimensions
king – man + woman = queen
16. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 16
Name City
t1 Bill Gates Seattle
t2 William Gates Seattle
Word DR for words
Bill [0.4, 0.8, 0.9]
William [0.3, 0.9, 0.7]
Gates [0.5, 0.8, 0.8]
Seattle [0.1, 0.1, 0.2]
DR for words
t1 [?, ?, ?, ?]
t2 [?, ?, ?, ?]
From DR of Words to DRs of Tuples
1. Simple Approach – Averaging
– Ignores word order
– DR(bill gates) = 0.5 * (DR(bill) + DR(gates))
– Simple to train
2. Compositional Approach - RNN with LSTM
– Takes word and attribute order into account
– Use a NN to semantically compose the word
vectors into an attribute-level vector
https://github.com/daqcri/deeper-lite
https://github.com/daqcri/DeepER
17. · @bigmlcom · @QatarComputing · #MLSD18 ·QCRI 17
Transfer Learning for ER
18. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 18
The TL Problem
Given a target dataset DT on which we need to do ER such that
DT has limited or no training data, is it possible to train a good
ML classifier for DT by reusing and adapting training data from a
related dataset DS?
19. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 19
Feature spaceof DTFeature space of DS
Feature Truncation
Feature Standardization
Feature Standardization based on DRs
Advantages:
1.Reuse of ML classifiers
2.Encode semantic similarity and has a fine-grained similarity computed holistically
3.Pool training data from multiple source datasets
4.Minimize domain expert effort in identifying appropriate features, similarity functions …
5.Reuse popular DRs such as Word2vec, GloVe, and FastText
• Feature space truncation – Use only the
common attributes
• Feature space standardization - tuples
from each relation all encoded into a standard
feature space of dimension
20. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 20
Training Data Method Description
Source Target
Adequate Nothing Unsupervised domain adaptation Use weighted paradigm where different
similarity vectors have different weights
based on their fidelity to D_T
Adequate Limited Feature Augmentation Learn parameters jointly when appropriate
and learn individually otherwise
Limited Limited Semi-supervised domain adaptation Use both unlabeled and labeled data
Adequate Adequate Easy – any of the above algorithms
Algorithms that
1.successfully address ER-specific challenges such as imbalanced data, diverse schemata, and varying
vocabulary,
2.are capable of leveraging key ER properties such as similarity vectors as features and monotonicity
of precision,
3.work on classifiers that are widely used in ER,
4.are dataset and domain agnostic, and
5.allow seamless transfer from multiple source datasets.
Our Solution …
22. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 22
Scenario (Adequate, Limited)
Feature Augmentation - Similarity vector x of
dimension d is transformed into a similarity
vector ɸ of dimension 3xd by duplicating each
feature in x a manner that is different for DS
and DT
23. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 23
Scenario (Limited, Limited)
Feature Augmentation – As for (Adequate, Limited)
Data Augmentation – Create 2 copies of each x in DT
U,
(label duplicate/non-duplicate).
Ensure that the weights learned for the transformed dataset
also agree on the unlabeled data.
25. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 25
Will my protein crystalize or not given a sequence of the
protein?
● Answering this question is important to understand protein function & design drugs
● Current Approach
● Protein structure determination using X-ray crystallography
● High attrition rate, trial-and-error settings increase production cost
● New ML Methods
● Use sequence, bio-chemical and structure features, mostly with SVM or RF
classifiers
➢ DeepCrystal, a CNN based deep learning framework exploits frequent k-mers (amino
acid residues of length k) and groups of k-mers using the raw protein sequences only
26. QCRI · @bigmlcom · @QatarComputing · #MLSD18 · 26
Architecture for the DeepCrystal model
https://deeplearning-protein.qcri.org/index.html