Scaling Big Data Cleansing

Scaling Big Data
Cleansing
PHD DEFENSE OF: ZUHAIR KHAYYAT
MAY, 2017

What is Data Cleansing?
q Data cleansing is the process of:
A. detecting error in record sets, tables, or databases (violation detection)
B. and fixing them (violation repair)
q Example errors in data:
• Typos • Duplicate • Values inconsistent with business rules
• Outliers • Outdated • Missing values
May 16, 2017 2/73

Why Data Cleansing is Important?
q 25% of world's critical data are dirty
q 60% - 98% of the data scientist's time is lost in the process data cleansing
q “duplicate and dirty data costs the healthcare industry over $300 billion
every year” -- Joe Fusaro (RingLead)
q “inaccurate data has a direct impact ... the average company losing 12% of its
revenue” -- Ben Davis (Econsultancy)
May 16, 2017 3/73

Example of a Dirty Dataset
A Company employee database:
q Rule 1: Any two employees in same Zipcode must be in same City
q Rule 2: An employee who earns higher salary must pay more taxes compared to others
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
May 16, 2017 4/73

The Process of Data Cleansing
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Detection
Rules
Input Data
Dirty
1st: Detect
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th: Update Input Data
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
2st: Analyze
May 16, 2017 5/73

Why Dirty Data is Still a Problem?
q Data is growing at a 40%
compound annual rate
q Source: Oracle, 2012,
https://goo.gl/uHd4uR
≈ 15 Zettabytes
May 16, 2017 6/73

t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Detection
Rules
Input Data
Problems of Big Data Cleansing
1st: Detect
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
2st: Analyze
≈ 90% Runtime
Most of Research
0
20
40
60
80
100
1% 5% 10% 50%
Time(Seconds)
Violation percentage
Violation detection
Data repair
May 16, 2017 7/73

t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st: Detect
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
2st: Analyze
Detection
Rules
Input Data
1. Violation detection becomes too expensive with big data:
a. Enumerating all tuples is not possible
b. Not feasible to implement a parallel version of each detection rule
c. Serial repair algorithms cannot handle big errors
May 16, 2017 8/73

t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st: Detect
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
2st: Analyze
Detection
Rules
Input Data
2. Complex error discovery rules based on inequality conditions are too expensive:
Rule 2: An employee who earns higher salary must pay more taxes compared to others
è (ti.salary < tj.salary) AND (ti.tax > tj.tax)
May 16, 2017 9/73

t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st: Detect
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
Detection
Rules
Input Data
3. Error graph (violation graph) is random, big and unpredictable:
• Irregular structures
• Skewed distributions
• Unpredictable workload of algorithm
2st: Analyze
May 16, 2017 10/73

Problems & Solutions of Big Data Cleansing
Problems
1. Violation detection becomes too
expensive with big data
2. Complex error discovery rules based on
inequality conditions are too expensive
3. Error graph (violation graph) is random,
big and unpredictable
• Develop a general purpose
scalable data cleansing
system
BigDansing
• Introduce new join algorithm
to enhance inequality joinsIEJoin
• Build a general graph system
that adapts to various graph
structures and algorithms
Mizan
Solutions
May 16, 2017 11/73

BigDansing A System for Big Data Cleansing
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
May 16, 2017 12/73

Related work?
NADEEF*
* M. Dallachiesa, et al., “NADEEF: A Commodity
Data Cleaning System,” in SIGMOD 2013
DBMS
UDF
Declarative
Rules
ü Easy-to-use
ü Extensible
ü Efficient
☓ Scalable (Single Machine)
May 16, 2017 13/73

What does Big Data Cleansing require?
1. Scale Detection
§ Declarative rules
Ø Functional dependencies (FDs, CFDs)
Ø Denial constraints (DCs)
§ User defined functions
2. Scale Repairs
§ Handle serial repair algorithms
May 16, 2017 14/73

BigDansing – Scaling Violation Detection
Functional
dependencies
Denial
constraints
Entity
resolution
Inclusion
dependencies
Domain Specific Language
Scope Block Iterate Detect GenFix
May 16, 2017 15/73

BigDansing – Input
UDFScope
Block
Iterate
Detect
GenFix Violation Detection Plan (Logical Plan)
Rule
Parser
Declarative
Rules
May 16, 2017 16/73

BigDansing – Plan Conversion and Optimization
Logical Plan
Physical Plan
Execution Plan
May 16, 2017 17/73

Rule 1 – Logical Plan
§ Any two employees in same Zipcode must be in same City
§ FD: Zipcode à City
Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix
Logical Operators
May 16, 2017 18/73

Rule 1 – Physical Plan
PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct
PScope PBlock PIterate PDetect PGenFix
Physical Operators
May 16, 2017 19/73

Rule 1 – Execution Plan
Spark-
PScope
Spark-
PBlock
Spark-
PIterate
Spark-
PDetect
Spark-
PGenFix
PScope PBlock PIterate PDetect PGenFix
May 16, 2017 20/73

Rule 1 – Execution Example
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Zipcode City
t1 10001 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60827 CH
t6 90210 LA
Zipcode City
t1 10001 NY
t2 90210 LA
t4 90210 SF
t6 90210 LA
t3 60601 CH
t5 60827 CH
(t2, t4)
(t2 ,t6)
(t4, t6)
(t2, t4)
(t4, t6)
t2[City] = t4[City]
t4[City] = t6[City]
1) Scope 3) Iterate2) Block
4) Detect
5) GenFix
May 16, 2017 21/73

§An employee who earns higher salary must pay more taxes compared to others
§ DC: ti, tj D, ¬(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate)
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
• For Annie, compare Salary with:
• Laure
• John
• Mark
• Robert
• Mary
Compare Rate
Compare Rate
Compare Rate
Compare Rate
Report a Violation!
May 16, 2017 22/73

Scope(Salary, Rate) Iterate
Detect(ti.Salary < tj.Salary ˄
ti.Rate > tj.Rate)
GenFix
Logical Operators
May 16, 2017 23/73

Rule 2 – Physical Plan
PScope UCrossProduct PDetect PGenFix
Physical Operators
ti.Rate > tj.Rate)
GenFix
PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct
May 16, 2017 24/73

PScope UCrossProduct PDetect PGenFix
ti.Rate > tj.Rate)
GenFix
Spark-
PScope
Spark-
UCrossProduct
Spark-
PDetect
Spark-
PGenFix
May 16, 2017 25/73

Plan Optimizations – OCJoin
Range
Partitioning
Sorting
Pruning
Joining
Partition 1 Partition 2 Partition 3 Partition n Based on
Salary
Based on
Rate
Partition 1 Partition 2 Partition 3 Partition n
Partition 1
Partition 2
Partition 3 Partition 4
Partition 5
Partition 6 Partition n
Partition 2 Partition 3 Partition 5 Partition 6⨝ ⨝
May 16, 2017 26/73

§ Rule 2: An employee who earns higher salary must pay more taxes compared to others
PScope
OCJoin(ti.Salary < tj.Salary ˄
ti.Rate > tj.Rate)
PDetect PGenFIx
ti.Rate > tj.Rate)
GenFIx
Spark-
PScope
Spark-OCJoin
Spark-
PDetect
Spark-
PGenFIx
May 16, 2017 27/73

What does Big Data Cleansing require?
1. Scale Detection
§ Declarative rules
Ø Functional dependencies (FDs, CFDs)
Ø Denial constraints (DCs)
§ User defined functions
2. Scale Repairs
§ Handle serial repair algorithms
!
May 16, 2017 28/73

Rule 1: Zipcode à City
Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate)
BigDansing – Structure of the Violation Graph
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
• Rule 1: t2[City] = t4[City]
• Rule 2: t1[Salary] > t2[Salary]
OR t1[Tax] < t2[Tax]
May 16, 2017 29/73

Rule 1: Zipcode à City
Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate)
BigDansing – Structure of the Violation Graph
t1
t5
t2 t4
t6
R1: City
R1: City
R2: Salary, Tax• Rule 1: t2[City] = t4[City]
• Rule 1: t4[City] = t6[City]
• Rule 2: t1[Salary] > t2[Salary] OR t1[Tax] < t2[Tax]
• Rule 2: t5[Salary] > t2[Salary] OR t5[Tax] < t2[Tax]
May 16, 2017 30/73

BigDansing – Data Repair as a Black box
t1
t5
t2 t4
t6
R1: City
R1: City
R2: Salary, Tax t1
t5
t2
R2: Salary, Tax
t2 t4
t6
R1: City
Graph Analysis
Serial Repair
Algorithm
Serial Repair
Algorithm
Serial Repair
Algorithm
tytx
R1: City
tytx
R1: City
May 16, 2017 31/73

BigDansing – Apache Spark Stack
May 16, 2017 32/73

Performance of a Single Machine
0
1000
2000
3000
4000
5000
6000
100,000 1,000,000 10,000,000
Runtime(Seconds)
Dataset size (rows)
BigDansing
NADEEF
PostgreSQL
Spark SQL
Shark
5
18
86
55
368
0.264
37
3183
4
8
80
2
47
4153
0
2000
4000
6000
8000
10000
12000
14000
16000
100,000 200,000 300,000
Runtime(Seconds) Dataset size (rows)
BigDansing
NADEEF
PostgreSQL
Spark SQL
Shark
10
30
62
833
4529
9336
2133
8780
3731
7982
Rule 1 Rule 2
May 16, 2017 33/73

0
20000
40000
60000
80000
100000
120000
1M 2M 3M
Time(Seconds) Dataset size (rows)
BigDansing-Spark
Spark SQL
Shark
1240
5319
7730
0
5000
10000
15000
20000
10M 20M 40M
Time(Seconds)
Dataset size (rows)
BigDansing-Spark
BigDansing-Hadoop
Spark SQL
Shark
121
150
337
503
865
2302
159
313
662
3739
14113
126822
Performance on a 16-machine cluster
Rule 1 Rule 2
May 16, 2017 34/73

0
25000
50000
75000
100000
125000
1 2 4 8 16
Runtime(Seconds)
#-workers
BigDansing
Spark SQL
0
40000
80000
120000
160000
200000
647M 959M 1271M1583M1907M
Time(Seconds)
Dataset size (rows)
BigDansing-Spark
BigDansing-Hadoop
Spark SQL
712
2307
5113
8670
11880
24803
52886
92236
138932
196133
9263
17872
30195
46907
65115
Performance on a 16-machine cluster
May 16, 2017 35/73

Detecting Violations on RDF
Scope Block 1 Iterate 1
Block 2 Iterate 2
Block 3 Iterate 3
Detect GenFix
May 16, 2017 36/73

Detecting Violations on RDF
0
1000
2000
3000
4000
5000
BigDansing
S2RDF
BigDansing
S2RDF
BigDansing
S2RDF
BigDansing
S2RDF
Runtime(Seconds)
Number of RDF triples
Pre-processing
Violation Detection
170M85M42M21M
*Alexander Schätzle, et al., “S2RDF:
RDF Querying with SPARQL on Spark”,
in PVLDB 2016
* * * *
May 16, 2017 37/73

BigDansing: A System for Big Data Cleansing
ü Easy-to-use
ü Efficient
ü Extensible
ü Scalable
* Zuhair Khayyat, et al., “BigDansing: A System for Big Data Cleansing”,
in SIGMOD 2015.
May 16, 2017 38/73

IEJoin Fast and Scalable Inequality Joins
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
May 16, 2017 39/73

OCJoin in BigDansing
0
20000
40000
60000
80000
100000
100,000 200,000 300,000
Runtime(Seconds)
Data size (rows)
OCJoin
UCrossProduct
Cross product
97
103
126
4279
22912
61772
4953
27078
82524 0
20000
40000
60000
80000
100000
120000
1M 2M 3M
Time(Seconds)
Dataset size (rows)
BigDansing-Spark
Spark SQL
Shark
1240
5319
7730
May 16, 2017 40/73

What is the Problem?
q Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate)
§ Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Tax > t2.Tax
q Processed as a Cartesian product: O(n2)
May 16, 2017 41/73

Related Work
q Band Join:
§Based on a point within a range: R.A − c1 ≤ S.B & S.B ≤ R.A + c2
q Interval join in temporal and spatial data: not general
q Spatial indexing:
§Large memory footprint
§Expensive preprocessing
May 16, 2017 42/73

IEJoin – a New Join Algorithm
q In data cleansing:
§ Q1: Select * from D t1 JOIN D t2 on t1.Salary > t2.Salary AND t1.Tax < t2.Tax
q Interval intersection:
§Q2: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end ≥ s.start
q Joining tables with (≠):
§Qk: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end ≠ s.start
May 16, 2017 43/73

Algorithm Discovery
t3(150) t4(120) t1(100) t2(90)
Q1: Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Rate > t2.Rate
Sort descending on Salary:
Salary partial answer: (t2, t1), (t2, t4), (t2, t3) …. (t4, t3)
t3(15) t4(10) t2(9) t1(5)
Sort descending on Rate:
Rate partial answer: (t1, t2), (t1, t4), (t1, t3) …. (t4, t3)
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
May 16, 2017 44/73

Algorithm Discovery
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
Rate partial answer:
(t1, t2), (t1, t4), (t1, t3),
(t2, t4), (t2, t3),
(t4, t3)}
Salary partial answer:
(t2, t1), (t2, t4), (t2, t3),
(t1, t4), (t1, t3),
(t4, t3)}
The expected result is: (t2, t1)
May 16, 2017 45/73

IEJoin – the Algorithm
q Sort Descending on Salary:
q Sort Descending on Rate:
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
t3(150) t4(120) t1(100) t2(90) 0 1 2 3
Permutation Array
t3(15) t4(10) t2(9) t1(5) 0 1 3 2
0 0 0 0
t3 t4 t2 t1
1 1 11
Sequential scan
Random access
Result = (t2,t1)
Bit-Array
May 16, 2017 46/73

Sorting Orders
q For self joins:
§ Salary: ascending order if OP1 is either > or ≥, otherwise descending order
§ Rate: descending order if OP1 is either > or ≥, otherwise ascending order
§ Non-self joins:
§ Salary: descending order if OP1 is either > or ≥, otherwise descending order
§ Rate: ascending order if OP1 is either > or ≥, otherwise descending order
OP1 OP2
May 16, 2017 47/73

Optimizations – Bitmap Index
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
0 1 0 0
C1 C2 C3 C4
(i) pos 6 (ii) pos 9
B
max
May 16, 2017 48/73

Optimizations – Not Equal Operator
q Convert each (≠) into one (>) and one (<) joined with UNION ALL operator
Qk: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end ≠ s.start
Q’k: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end < s.start
UNION ALL
SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end > s.start
May 16, 2017 49/73

Optimizations – Selectivity Estimation
q A query with three attributes: r.Salary < s.Salary AND r.Rate > s.Rate AND r.Age > s.Age
q Use sampling to estimate the maximum output size – Est(Salary,Rate), Est(Salary,Tax), Est(Tax, Age)
Range
Partitioning
Sorting
Pruning
Calculate
Max Output
Based on
OP1
Based on
OP2
Partition 1
Partition 2
Partition 3 Partition 4
Partition 5
Partition 6 Partition n
Estimated Output = number of overlapping partitions = 2
May 16, 2017 50/73

IEJoin and BigDansing
May 16, 2017 51/73

Serial IEJoin vs. Naïve Baseline
0.01
0.1
1
10
100
1000
10000
10K 50K 100K
Runtime(Seconds)
Input size
PG-IEJoin
PG-Original
MonetDB
DBMS-X
0.01
0.1
1
10
100
1000
10000
10K 50K 100K
Runtime(Seconds)
Input size
PG-IEJoin
PG-Original
MonetDB
DBMS-X
Salary-Rate Interval Intersection
May 16, 2017 52/73

0
2000
4000
6000
8000
10000
PG-IEJoin
PG-GiST
PG-BTree
PG-IEJoin
PG-GiST
PG-BTree
Runtime(Seconds)
Indexing Querying
X146
3928
X
310
6287
Q2Q1
Serial IEJoin vs. Postgres with Index – 50M Rows
16 workers1 workers
GiST: Generalized Search Tree
May 16, 2017 53/73

Parallel and Distributed IEJoin – 100M Rows
0
4000
8000
12000
16000
20000
Parallel-IEJoin
Distributed-IEJoin
DPG-GiST
DPG-BTree
SparkSQL-SM
SparkSQL
Runtime(Seconds)
Indexing Querying
X X X X
4302
1313
0
4000
8000
12000
16000
20000
Parallel-IEJoin
Distributed-IEJoin
DPG-GiST
DPG-BTree
SparkSQL-SM
SparkSQL
Runtime(Seconds)
Indexing Querying
X X X
4965
1376
Salary-Rate Interval Intersection
May 16, 2017 54/73

IEJoin
q A new join algorithm
q Based on conditions: (<, ≤, >, ≥, ≠)
q Extremely fast and highly scalable
q Utilizes sorting and efficient data structures
q Easy to implement in traditional DBMS and distributed systems
* Zuhair Khayyat, et al., “Fast and Scalable Inequality Joins”,
The VLDB Journal 2017, Special Issue: Best Papers of VLDB 2015
* Zuhair Khayyat , et al., “Lightning Fast and Space
Efficient Inequality Joins”, in PVLDB 2015
May 16, 2017 55/73

Mizan A System for Dynamic Load
Balancing in Large-scale Graph
Processing
May 16, 2017 56/73

BigDansing’s implementations
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Detection
Rules
Input Data
Dirty
1st: Detect
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
BigDansing
Apache
Hadoop
Giraph
Apache
Spark
GraphX
HDFS
2st: Analyze
May 16, 2017 57/73

Pregel*/Giraph Abstraction
q Based on vertex-centric computation
q Abstraction:
§ compute(), combine() & aggregate()
q Synchronous in-memory bulk
synchronous parallel (BSP)
* G. Malewicz, et al., “Pregel: A System for Large-Scale Graph Processing,” in SIGMOD 2010
Superstep 1 Superstep 2 Superstep 3
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
May 16, 2017 58/73

Problems of Giraph
The Light Side The Dark Side
§ Algorithm:
§ Unforeseen
§ Structure:
§ Variable
§ Algorithm:
§ Predictable
§ Structure:
§ Fixed
Error graph (violation graph) is random, big and unpredictable
May 16, 2017 59/73

How Giraph Optimize Computations
1. Faster Graph Loading
§ Simple graph partitioning
§ Hash, Range
2. Optimized for graph structure
§ Sophisticated and expensive
partitioning techniques
§ Min-cuts
0
50
100
150
200
250
300
350
LiveJournal
kgraph4m68m
arabic-2005
RunTime(Min)
Hash
Range
Min-cuts
The runtime of a single iteration is
as fast as the slowest worker
May 16, 2017 60/73

Behaviors of Different Graph Algorithms
0.001
0.01
0.1
1
10
100
1000
0 10 20 30 40 50 60
InMessages(Millions)
SuperSteps
PageRank - Total
PageRank - Max/W
DMST - Total
DMST - Max/W
PageRank vs. Distributed Minimal Spanning Tree
May 16, 2017 61/73

Source of Imbalance in Giraph
1. High vertex response time
2. Long time to receive incoming messages
3. Long time to send outgoing messages
Superstep 1
-High vertex response time
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
Superstep 1
-Long time to receive in messages
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
Superstep 1
-Long time to send out messages
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
May 16, 2017 62/73

Mizan – Solving the Workload Imbalance
q Move vertices between workers during runtime
q Planning and vertex migrations within the BSP barrier to
maintain computation consistency
Superstep 1 Superstep 2 Superstep 3
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
Migration Barrier Migration Planner
Communicator - DHT
Vertex Compute()
BSP Graph Processor
Storage Manager
HDFS/Local Disks
IO
Mizan Worker
Load Balancer:
Migration Planner
May 16, 2017 63/73

Mizan’s Migration Planning Steps
1. Identify the source of workload imbalance across workers
§ Remote outgoing messages
§ All incoming messages
§ Response time V1
Worker 2Worker 1
Remote Incoming Messages
Remote Outgoing Messages
Vertex
Response Time
V3
V2
V4
Mizan
V5
V6
Mizan
Local Incoming Messages
May 16, 2017 64/73

2. Select the migration objective through a statistical analysis
§ Optimize for outgoing messages, or
§ Optimize for incoming messages, or
§ Optimize for response time
May 16, 2017 65/73

3. Pair over-utilized workers with under-utilized ones
W7 W2 W1 W5 W8 W4 W0 W6 W3
0 1 2 3 4 5 6 7 8
W9
May 16, 2017 66/73

3. Pair over-utilized workers with under-utilized ones
4. Select vertices to migrate
§ Select the least number of vertices that has the highest impact
§ Vertex ownership: distributed hash table (DHT)
§ Delayed migration: reduce migration cost
May 16, 2017 67/73

0
5
10
15
20
25
30
35
40
Static
WS
Mizan
Static
WS
Mizan
Static
WS
Mizan
Runtime(Min)
MetisRangeHash
Performance of Mizan on PageRank
May 16, 2017 68/73

0
50
100
150
200
250
300
Advertisment
DMST
Runtime(Min)
Static
Work Stealing
Mizan
Performance of Mizan with Metis
May 16, 2017 69/73

Mizan – a General Graph Processing System
q A Pregel-clone
§ Supports very large graphs
§ Runs on very large clusters
q Dynamic fine-grained vertex migrations to
balance computation and communication
q Optimized for predictable and non-
predictable graph algorithms and structures
BigDansing
Apache
Spark
Mizan
GraphX
* Zuhair Khayyat, et al., “Mizan: A System for Dynamic Load
Balancing in Large-scale Graph Processing”, in EuroSys 2013
Giraph
HDFS
May 16, 2017 70/73

Summary
• A general system for big data cleansing
• Performance up to 2 orders of magnitude faster
• SIGMOD 2015
§ A novel algorithm for fast inequality joins
§ Performance least 2 orders of magnitude
faster
§ PVLDB 2015 & VLDBJ 2017
§ A general system for distributed graph
processing
§ Performance improvements up to 84%
§ EuroSys 2013
May 16, 2017 71/73

Publications
" Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Panos
Kalnis, “Fast and Scalable Inequality Joins”, The VLDB Journal 2017 special issue: Best Papers of VLDB 2015.
" Divy Agrawal, Lamine Ba, Laure Berti-Equille, Sanjay Chawla, Ahmed Elmagarmid, Hossam Hammady, Yasser Idris, Zoi
Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Mohammed J.
Zaki, “Rheem: Enabling Multi-Platform Task Execution”, in SIGMOD 2016.
" Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Panos
Kalnis, “Lightning Fast and Space Efficient Inequality Joins”, in PVLDB 2015.
" Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan
Tang, Si Yin, “BigDansing: A System for Big Data Cleansing”, in SIGMOD 2015.
" Zuhair Khayyat, Karim Awara, Amani Alonazi, Hani Jamjoom, Dan Williams, Panos Kalnis, “Mizan: A System for Dynamic
Load Balancing in Large-scale Graph Processing”, in EuroSys 2013.
May 16, 2017 73/73

Scaling Big Data Cleansing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Scaling Big Data Cleansing

Similar to Scaling Big Data Cleansing (13)

More from Zuhair khayyat

More from Zuhair khayyat (10)

Recently uploaded

Recently uploaded (20)

Scaling Big Data Cleansing