SlideShare a Scribd company logo
1 of 72
Download to read offline
Scaling	Big	Data	
Cleansing
PHD DEFENSE	OF: ZUHAIR KHAYYAT
MAY,	2017
What	is	Data	Cleansing?
q Data	cleansing	is	the	process	of:
A. detecting	error	in	record	sets,	tables,	or	databases	(violation	detection)	
B. and	fixing	them	(violation repair)
q Example	errors	in	data:
• Typos • Duplicate • Values inconsistent with	business	rules
• Outliers • Outdated • Missing	values
May	16,	2017 2/73
Why	Data	Cleansing	is	Important?
q 25%	of	world's	critical	data	are	dirty
q 60%	- 98%	of	the	data	scientist's	time	is	lost	in	the	process	data	cleansing
q “duplicate	and	dirty	data	costs	the	healthcare	industry	over	$300	billion	
every	year”	-- Joe	Fusaro (RingLead)
q “inaccurate	data	has	a	direct	impact	...	the	average	company	losing	12%	of	its	
revenue”	-- Ben	Davis	(Econsultancy)
May	16,	2017 3/73
Example	of	a	Dirty	Dataset
A	Company	employee	database:
q Rule	1:	Any	two	employees	in	same	Zipcode must	be	in	same	City
q Rule	2:	An	employee	who	earns	higher	salary	must	pay	more	taxes	compared	to	others
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
May	16,	2017 4/73
The	Process	of	Data	Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Detection
Rules
Input	Data
Dirty
1st:	Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:	Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:	Update	Input	Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean	Data
2st:	Analyze
May	16,	2017 5/73
Why	Dirty	Data	is	Still	a	Problem?
q Data	is	growing	at	a	40%	
compound	annual	rate
q Source:	Oracle,	2012,	
https://goo.gl/uHd4uR
≈	15	Zettabytes
May	16,	2017 6/73
4th:	Update	Input	Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean	Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Detection
Rules
Input	Data
Problems	of	Big	Data	Cleansing
1st:	Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:	Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
2st:	Analyze
≈	90%	Runtime
Most	of	Research
0
20
40
60
80
100
1% 5% 10% 50%
Time(Seconds)
Violation percentage
Violation detection
Data repair
May	16,	2017 7/73
Problems	of	Big	Data	Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st:	Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:	Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:	Update	Input	Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean	Data
2st:	Analyze
Detection
Rules
Input	Data
1. Violation	detection	becomes	too	expensive	with	big	data:
a. Enumerating	all	tuples	is	not	possible
b. Not	feasible	to	implement	a	parallel	version	of	each	detection	rule
c. Serial	repair	algorithms	cannot	handle	big	errors
May	16,	2017 8/73
Problems	of	Big	Data	Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st:	Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:	Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:	Update	Input	Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean	Data
2st:	Analyze
Detection
Rules
Input	Data
2. Complex	error	discovery	rules	based	on	inequality	conditions	are	too	expensive:
Rule	2:	An	employee	who	earns	higher	salary	must	pay	more	taxes	compared	to	others
è (ti.salary <	tj.salary)	AND	(ti.tax >	tj.tax)
May	16,	2017 9/73
Problems	of	Big	Data	Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st:	Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:	Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:	Update	Input	Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean	Data
Detection
Rules
Input	Data
3. Error	graph	(violation	graph)	is	random,	big	and	unpredictable:
• Irregular	structures
• Skewed	distributions
• Unpredictable	workload	of	algorithm
2st:	Analyze
May	16,	2017 10/73
Problems	&	Solutions	of	Big	Data	Cleansing
Problems
1. Violation	detection	becomes	too	
expensive	with	big	data
2. Complex	error	discovery	rules	based	on	
inequality	conditions	are	too	expensive
3. Error	graph	(violation	graph)	is	random,	
big	and	unpredictable
• Develop	a	general	purpose	
scalable	data	cleansing	
system
BigDansing
• Introduce	new	join	algorithm	
to	enhance	inequality	joinsIEJoin
• Build	a	general	graph	system	
that	adapts	to	various	graph	
structures	and	algorithms
Mizan
Solutions
May	16,	2017 11/73
BigDansing A	System	for	Big	Data	Cleansing	
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
May	16,	2017 12/73
Related	work?
NADEEF*
* M.	Dallachiesa,	et	al.,	“NADEEF:	A	Commodity	
Data	Cleaning	System,”	in	SIGMOD	2013
DBMS
UDF
Declarative	
Rules
ü Easy-to-use
ü Extensible
ü Efficient
☓ Scalable	(Single	Machine)
May	16,	2017 13/73
What	does	Big	Data	Cleansing	require?
1. Scale	Detection
§ Declarative	rules
Ø Functional	dependencies	(FDs,	CFDs)
Ø Denial	constraints	(DCs)
§ User	defined	functions
2. Scale	Repairs
§ Handle serial	repair	algorithms	
May	16,	2017 14/73
BigDansing – Scaling	Violation	Detection
Functional
dependencies
Denial
constraints
Entity
resolution
Inclusion
dependencies
Domain	Specific	Language
Scope Block Iterate Detect GenFix
May	16,	2017 15/73
BigDansing – Input
UDFScope
Block
Iterate
Detect
GenFix Violation	Detection	Plan (Logical	Plan)
Rule	
Parser
Declarative	
Rules
May	16,	2017 16/73
BigDansing – Plan	Conversion	and	Optimization
Logical	Plan
Physical	Plan
Execution	Plan
May	16,	2017 17/73
Rule	1	– Logical	Plan
§ Any	two	employees	in	same	Zipcode must	be	in	same	City
§ FD:	Zipcode à City
Scope Block Iterate Detect GenFix
Scope(Zipcode,	City) Block(Zipcode) Iterate Detect(Cityi ≠	Cityj) GenFix
Logical	Operators
May	16,	2017 18/73
Rule	1	– Physical	Plan
§ Any	two	employees	in	same	Zipcode must	be	in	same	City
§ FD:	Zipcode à City
PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct
PScope PBlock PIterate PDetect PGenFix
Scope(Zipcode,	City) Block(Zipcode) Iterate Detect(Cityi ≠	Cityj) GenFix
Physical	Operators
May	16,	2017 19/73
Rule	1	– Execution	Plan
§ Any	two	employees	in	same	Zipcode must	be	in	same	City
§ FD:	Zipcode à City
Spark-
PScope
Spark-
PBlock
Spark-
PIterate
Spark-
PDetect
Spark-
PGenFix
PScope PBlock PIterate PDetect PGenFix
Scope(Zipcode,	City) Block(Zipcode) Iterate Detect(Cityi ≠	Cityj) GenFix
May	16,	2017 20/73
Rule	1	– Execution	Example
Scope(Zipcode,	City) Block(Zipcode) Iterate Detect(Cityi ≠	Cityj) GenFix
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Zipcode City
t1 10001 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60827 CH
t6 90210 LA
Zipcode City
t1 10001 NY
t2 90210 LA
t4 90210 SF
t6 90210 LA
t3 60601 CH
t5 60827 CH
(t2,	t4)
(t2	,t6)
(t4,	t6)
(t2,	t4)
(t4,	t6)
t2[City]	=	t4[City]	
t4[City]	=	t6[City]	
1)	Scope 3)	Iterate2)	Block
4)	Detect
5)	GenFix
May	16,	2017 21/73
Rule	2	– Logical	Plan
§An	employee	who	earns	higher	salary	must	pay	more	taxes	compared	to	others
§ DC:	 ti,	tj D,	¬(ti.Salary <	tj.Salary ˄	ti.Rate >	tj.Rate)
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
• For	Annie,	compare	Salary	with:
• Laure
• John
• Mark
• Robert
• Mary
Compare	Rate
Compare	Rate
Compare	Rate
Compare	Rate
Report	a	Violation!
May	16,	2017 22/73
Rule	2	– Logical	Plan
§An	employee	who	earns	higher	salary	must	pay	more	taxes	compared	to	others
§ DC:	 ti,	tj D,	¬(ti.Salary <	tj.Salary ˄	ti.Rate >	tj.Rate)
Scope(Salary,	Rate) Iterate
Detect(ti.Salary <	tj.Salary ˄
ti.Rate >	tj.Rate)
GenFix
Scope Block Iterate Detect GenFix
Logical	Operators
May	16,	2017 23/73
Rule	2	– Physical	Plan
§An	employee	who	earns	higher	salary	must	pay	more	taxes	compared	to	others
§ DC:	 ti,	tj D,	¬(ti.Salary <	tj.Salary ˄	ti.Rate >	tj.Rate)
PScope UCrossProduct PDetect PGenFix
Physical	Operators
Scope(Salary,	Rate) Iterate
Detect(ti.Salary <	tj.Salary ˄
ti.Rate >	tj.Rate)
GenFix
PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct
May	16,	2017 24/73
Rule	2	– Execution	Plan
§An	employee	who	earns	higher	salary	must	pay	more	taxes	compared	to	others
§ DC:	 ti,	tj D,	¬(ti.Salary <	tj.Salary ˄	ti.Rate >	tj.Rate)
PScope UCrossProduct PDetect PGenFix
Scope(Salary,	Rate) Iterate
Detect(ti.Salary <	tj.Salary ˄
ti.Rate >	tj.Rate)
GenFix
Spark-
PScope
Spark-
UCrossProduct
Spark-
PDetect
Spark-
PGenFix
May	16,	2017 25/73
Plan	Optimizations	– OCJoin
§ DC:	 ti,	tj D,	¬(ti.Salary <	tj.Salary ˄	ti.Rate >	tj.Rate)
Range
Partitioning
Sorting
Pruning
Joining
Partition	1 Partition	2 Partition	3 Partition	n Based	on	
Salary
Based	on	
Rate
Partition	1 Partition	2 Partition	3 Partition	n
Partition	1
Partition	2
Partition	3 Partition	4
Partition	5
Partition	6 Partition	n
Partition	2 Partition	3 Partition	5 Partition	6⨝ ⨝
May	16,	2017 26/73
Rule	2	– Execution	Plan
§ Rule	2:	An	employee	who	earns	higher	salary	must	pay	more	taxes	compared	to	others
§ DC:	 ti,	tj D,	¬(ti.Salary <	tj.Salary ˄	ti.Rate >	tj.Rate)
PScope
OCJoin(ti.Salary <	tj.Salary ˄
ti.Rate >	tj.Rate)
PDetect PGenFIx
Scope(Salary,	Rate) Iterate
Detect(ti.Salary <	tj.Salary ˄
ti.Rate >	tj.Rate)
GenFIx
Spark-
PScope
Spark-OCJoin
Spark-
PDetect
Spark-
PGenFIx
May	16,	2017 27/73
What	does	Big	Data	Cleansing	require?
1. Scale	Detection
§ Declarative	rules
Ø Functional	dependencies	(FDs,	CFDs)
Ø Denial	constraints	(DCs)
§ User	defined	functions
2. Scale	Repairs
§ Handle serial	repair	algorithms	
!
May	16,	2017 28/73
Rule	1:	Zipcode à City
Rule	2:	 t1,	t2	 D,	¬(t1.Salary	<	t2.Salary	˄	t1.Rate	>	t2.Rate)
BigDansing – Structure	of	the	Violation	Graph
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
• Rule	1:	t2[City]	=	t4[City]
• Rule	2:	t1[Salary]	> t2[Salary]	
OR	t1[Tax]	< t2[Tax]	
May	16,	2017 29/73
Rule	1:	Zipcode à City
Rule	2:	 t1,	t2	 D,	¬(t1.Salary	<	t2.Salary	˄	t1.Rate	>	t2.Rate)
BigDansing – Structure	of	the	Violation	Graph
t1
t5
t2 t4
t6
R1:	City
R1:	City
R2:	Salary,	Tax• Rule	1:	t2[City]	=	t4[City]
• Rule	1:	t4[City]	=	t6[City]
• Rule	2:	t1[Salary]	> t2[Salary]	OR	t1[Tax]	< t2[Tax]	
• Rule	2:	t5[Salary]	> t2[Salary]	OR	t5[Tax]	< t2[Tax]	
May	16,	2017 30/73
BigDansing – Data	Repair	as	a	Black	box
t1
t5
t2 t4
t6
R1:	City
R1:	City
R2:	Salary,	Tax t1
t5
t2
R2:	Salary,	Tax
t2 t4
t6
R1:	City
Graph	Analysis
Serial	Repair	
Algorithm
Serial	Repair	
Algorithm
Serial	Repair	
Algorithm
tytx
R1:	City
tytx
R1:	City
May	16,	2017 31/73
BigDansing – Apache	Spark	Stack	
May	16,	2017 32/73
Performance	of	a	Single	Machine	
0
1000
2000
3000
4000
5000
6000
100,000 1,000,000 10,000,000
Runtime(Seconds)
Dataset size (rows)
BigDansing
NADEEF
PostgreSQL
Spark SQL
Shark
5
18
86
55
368
0.264
37
3183
4
8
80
2
47
4153
0
2000
4000
6000
8000
10000
12000
14000
16000
100,000 200,000 300,000
Runtime(Seconds) Dataset size (rows)
BigDansing
NADEEF
PostgreSQL
Spark SQL
Shark
10
30
62
833
4529
9336
2133
8780
3731
7982
Rule	1 Rule	2
May	16,	2017 33/73
0
20000
40000
60000
80000
100000
120000
1M 2M 3M
Time(Seconds) Dataset size (rows)
BigDansing-Spark
Spark SQL
Shark
1240
5319
7730
0
5000
10000
15000
20000
10M 20M 40M
Time(Seconds)
Dataset size (rows)
BigDansing-Spark
BigDansing-Hadoop
Spark SQL
Shark
121
150
337
503
865
2302
159
313
662
3739
14113
126822
Performance	on	a	16-machine	cluster
Rule	1 Rule	2
May	16,	2017 34/73
0
25000
50000
75000
100000
125000
1 2 4 8 16
Runtime(Seconds)
#-workers
BigDansing
Spark SQL
0
40000
80000
120000
160000
200000
647M 959M 1271M1583M1907M
Time(Seconds)
Dataset size (rows)
BigDansing-Spark
BigDansing-Hadoop
Spark SQL
712
2307
5113
8670
11880
24803
52886
92236
138932
196133
9263
17872
30195
46907
65115
Performance	on	a	16-machine	cluster
May	16,	2017 35/73
Detecting	Violations	on	RDF
Scope Block	1 Iterate	1
Block	2 Iterate	2
Block	3 Iterate	3
Detect GenFix
May	16,	2017 36/73
Detecting	Violations	on	RDF
0
1000
2000
3000
4000
5000
BigDansing
S2RDF
BigDansing
S2RDF
BigDansing
S2RDF
BigDansing
S2RDF
Runtime(Seconds)
Number of RDF triples
Pre-processing
Violation Detection
170M85M42M21M
*Alexander	Schätzle, et	al., “S2RDF:	
RDF	Querying	with	SPARQL	on	Spark”,	
in	PVLDB	2016
* * * *
May	16,	2017 37/73
BigDansing:	A	System	for	Big	Data	Cleansing
ü Easy-to-use
ü Efficient
ü Extensible
ü Scalable
*	Zuhair	Khayyat,	et	al.,	“BigDansing:	A	System	for	Big	Data	Cleansing”,	
in	SIGMOD	2015.
May	16,	2017 38/73
IEJoin Fast	and	Scalable	Inequality	Joins	
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
May	16,	2017 39/73
OCJoin in	BigDansing
0
20000
40000
60000
80000
100000
100,000 200,000 300,000
Runtime(Seconds)
Data size (rows)
OCJoin
UCrossProduct
Cross product
97
103
126
4279
22912
61772
4953
27078
82524 0
20000
40000
60000
80000
100000
120000
1M 2M 3M
Time(Seconds)
Dataset size (rows)
BigDansing-Spark
Spark SQL
Shark
1240
5319
7730
May	16,	2017 40/73
What	is	the	Problem?
q Rule	2:	 t1,	t2	 D,	¬(t1.Salary	<	t2.Salary	˄	t1.Rate	>	t2.Rate)
§ Select	*	from	D	t1	JOIN	D	t2	on	t1.Salary	<	t2.Salary	AND t1.Tax	>	t2.Tax
q Processed	as	a	Cartesian	product:	O(n2)
May	16,	2017 41/73
Related	Work
q Band	Join:
§Based	on	a	point	within	a	range:	R.A	−	c1	≤	S.B	&	S.B	≤	R.A	+	c2	
q Interval	join	in	temporal	and	spatial	data:	not	general
q Spatial	indexing:
§Large	memory	footprint
§Expensive	preprocessing
May	16,	2017 42/73
IEJoin – a	New	Join	Algorithm
q In	data	cleansing:
§ Q1:	Select	*	from	D	t1	JOIN	D	t2	on	t1.Salary	>	t2.Salary	AND t1.Tax	<	t2.Tax
q Interval	intersection:
§Q2:	SELECT	*	FROM	Events	r,	Events	s	WHERE	r.start ≤	s.end AND	r.end ≥	s.start
q Joining	tables	with	(≠):
§Qk:	SELECT	*	FROM	Events	r,	Events	s	WHERE	r.start ≤	s.end AND	r.end ≠	s.start
May	16,	2017 43/73
Algorithm	Discovery
t3(150) t4(120) t1(100) t2(90)
Q1:	Select	*	from	D	t1	JOIN	D	t2	on	t1.Salary	<	t2.Salary	AND t1.Rate	>	t2.Rate
Sort	descending	on	Salary:
Salary	partial	answer:	(t2,	t1),	(t2,	t4),	(t2,	t3)	….	(t4,	t3)
t3(15) t4(10) t2(9) t1(5)
Sort	descending	on	Rate:
Rate	partial	answer:	(t1,	t2),	(t1,	t4),	(t1,	t3)	….	(t4,	t3)
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
May	16,	2017 44/73
Algorithm	Discovery
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
Q1:	Select	*	from	D	t1	JOIN	D	t2	on	t1.Salary	<	t2.Salary	AND t1.Rate	>	t2.Rate
Rate	partial	answer:
(t1,	t2),	(t1,	t4),	(t1,	t3),
(t2,	t4),	(t2,	t3),
(t4,	t3)}
Salary	partial	answer:
(t2,	t1),	(t2,	t4),	(t2,	t3),
(t1,	t4),	(t1,	t3),
(t4,	t3)}
The	expected	result	is:	(t2,	t1)	
May	16,	2017 45/73
IEJoin – the	Algorithm
q Sort	Descending	on	Salary:
q Sort	Descending	on	Rate:
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
t3(150) t4(120) t1(100) t2(90) 0 1 2 3
Permutation	Array
t3(15) t4(10) t2(9) t1(5) 0 1 3 2
0 0 0 0
t3 t4 t2 t1
1 1 11
Sequential	scan
Random	access
Result	=	(t2,t1)
Bit-Array
May	16,	2017 46/73
Sorting	Orders
Q1:	Select	*	from	D	t1	JOIN	D	t2	on	t1.Salary	<	t2.Salary	AND t1.Rate	>	t2.Rate
q For	self	joins:
§ Salary:	ascending order	if	OP1	is	either	>	or	≥,	otherwise	descending order
§ Rate:	descending order	if	OP1	is	either	>	or	≥,	otherwise	ascending order
§ Non-self	joins:
§ Salary:	descending order	if	OP1	is	either	>	or	≥,	otherwise	descending order
§ Rate:	ascending order	if	OP1	is	either	>	or	≥,	otherwise	descending order
OP1 OP2
May	16,	2017 47/73
Optimizations	– Bitmap	Index	
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
0 1 0 0
C1 C2 C3 C4
(i) pos 6 (ii) pos 9
B
max
May	16,	2017 48/73
Optimizations	– Not	Equal	Operator	
q Convert	each	(≠)	into	one	(>)	and	one	(<)	joined	with	UNION	ALL	operator
Qk:	SELECT	*	FROM	Events	r,	Events	s	WHERE	r.start ≤	s.end AND	r.end ≠	s.start
Q’k:	SELECT	*	FROM	Events	r,	Events	s	WHERE	r.start ≤	s.end AND	r.end < s.start
UNION	ALL																
SELECT	*	FROM	Events	r,	Events	s	WHERE	r.start ≤	s.end AND	r.end >	s.start
May	16,	2017 49/73
Optimizations	– Selectivity	Estimation
q A	query	with	three	attributes: r.Salary <	s.Salary AND	r.Rate >	s.Rate AND	r.Age >	s.Age
q Use	sampling	to	estimate	the	maximum	output	size	– Est(Salary,Rate),	Est(Salary,Tax),	Est(Tax,	Age)
Range
Partitioning
Sorting
Pruning
Calculate	
Max	Output
Partition	1 Partition	2 Partition	3 Partition	n
Based	on	
OP1
Based	on	
OP2
Partition	1 Partition	2 Partition	3 Partition	n
Partition	1
Partition	2
Partition	3 Partition	4
Partition	5
Partition	6 Partition	n
Estimated	Output	=	number	of	overlapping	partitions	=	2
May	16,	2017 50/73
IEJoin and	BigDansing
May	16,	2017 51/73
Serial	IEJoin vs.	Naïve	Baseline
0.01
0.1
1
10
100
1000
10000
10K 50K 100K
Runtime(Seconds)
Input size
PG-IEJoin
PG-Original
MonetDB
DBMS-X
0.01
0.1
1
10
100
1000
10000
10K 50K 100K
Runtime(Seconds)
Input size
PG-IEJoin
PG-Original
MonetDB
DBMS-X
Salary-Rate Interval	Intersection
May	16,	2017 52/73
0
2000
4000
6000
8000
10000
PG-IEJoin
PG-GiST
PG-BTree
PG-IEJoin
PG-GiST
PG-BTree
Runtime(Seconds)
Indexing Querying
X146
3928
X
310
6287
Q2Q1
Serial	IEJoin vs.	Postgres	with	Index	– 50M	Rows
16	workers1	workers
GiST: Generalized Search Tree
May	16,	2017 53/73
Parallel	and	Distributed	IEJoin – 100M	Rows
0
4000
8000
12000
16000
20000
Parallel-IEJoin
Distributed-IEJoin
DPG-GiST
DPG-BTree
SparkSQL-SM
SparkSQL
Runtime(Seconds)
Indexing Querying
X X X X
4302
1313
0
4000
8000
12000
16000
20000
Parallel-IEJoin
Distributed-IEJoin
DPG-GiST
DPG-BTree
SparkSQL-SM
SparkSQL
Runtime(Seconds)
Indexing Querying
X X X
4965
1376
Salary-Rate Interval	Intersection
May	16,	2017 54/73
IEJoin
q A	new	join	algorithm	
q Based	on	conditions:	(<,	≤,	>,	≥,	≠)
q Extremely	fast	and	highly	scalable
q Utilizes	sorting	and	efficient	data	structures
q Easy	to	implement	in	traditional	DBMS	and	distributed	systems
*	Zuhair	Khayyat,	et	al., “Fast	and	Scalable	Inequality	Joins”,	
The	VLDB	Journal	2017,	Special	Issue:	Best	Papers	of	VLDB	2015
*	Zuhair	Khayyat ,	et	al.,	“Lightning	Fast	and	Space	
Efficient	Inequality	Joins”,	in	PVLDB	2015
May	16,	2017 55/73
Mizan A	System	for	Dynamic	Load	
Balancing	in	Large-scale	Graph	
Processing	
May	16,	2017 56/73
BigDansing’s implementations
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Detection
Rules
Input	Data
Dirty
1st:	Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:	Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:	Update	Input	Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean	Data
BigDansing
Apache	
Hadoop
Giraph
Apache	
Spark
GraphX
HDFS
2st:	Analyze
May	16,	2017 57/73
Pregel*/Giraph Abstraction
q Based	on	vertex-centric	computation
q Abstraction:
§ compute(),	combine()	&	aggregate()
q Synchronous	in-memory	bulk	
synchronous	parallel	(BSP)	
* G. Malewicz, et al., “Pregel: A System for Large-Scale Graph Processing,” in	SIGMOD	2010
Superstep 1 Superstep 2 Superstep 3
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
May	16,	2017 58/73
Problems	of	Giraph
The	Light	Side The	Dark	Side
§ Algorithm:
§ Unforeseen
§ Structure:
§ Variable
§ Algorithm:
§ Predictable
§ Structure:
§ Fixed
Error	graph	(violation	graph)	is	random,	big	and	unpredictable
May	16,	2017 59/73
How	Giraph Optimize Computations
1. Faster	Graph	Loading
§ Simple	graph	partitioning
§ Hash,	Range
2. Optimized	for	graph	structure
§ Sophisticated	and	expensive	
partitioning	techniques
§ Min-cuts
0
50
100
150
200
250
300
350
LiveJournal
kgraph4m68m
arabic-2005
RunTime(Min)
Hash
Range
Min-cuts
The	runtime	of	a	single	iteration	is	
as	fast	as	the	slowest	worker
May	16,	2017 60/73
Behaviors	of	Different	Graph	Algorithms
0.001
0.01
0.1
1
10
100
1000
0 10 20 30 40 50 60
InMessages(Millions)
SuperSteps
PageRank - Total
PageRank - Max/W
DMST - Total
DMST - Max/W
PageRank	vs.	Distributed	Minimal	Spanning	Tree
May	16,	2017 61/73
Source	of	Imbalance	in	Giraph
1. High	vertex	response	time
2. Long	time	to	receive	incoming	messages
3. Long	time	to	send	outgoing	messages
Superstep 1
-High vertex response time
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
Superstep 1
-Long time to receive in messages
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
Superstep 1
-Long time to send out messages
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
May	16,	2017 62/73
Mizan – Solving	the	Workload	Imbalance
q Move	vertices	between	workers	during	runtime
q Planning	and	vertex	migrations	within	the	BSP	barrier	to	
maintain	computation	consistency	
Superstep 1 Superstep 2 Superstep 3
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
Migration Barrier Migration Planner
Communicator - DHT
Vertex Compute()
BSP Graph Processor
Storage Manager
HDFS/Local Disks
IO
Mizan Worker
Load Balancer:
Migration Planner
May	16,	2017 63/73
Mizan’s Migration	Planning	Steps	
1. Identify	the	source	of	workload	imbalance	across	workers
§ Remote	outgoing	messages	
§ All	incoming	messages	
§ Response	time	 V1
Worker 2Worker 1
Remote Incoming Messages
Remote Outgoing Messages
Vertex
Response Time
V3
V2
V4
Mizan
V5
V6
Mizan
Local Incoming Messages
May	16,	2017 64/73
Mizan’s Migration	Planning	Steps	
1. Identify	the	source	of	workload	imbalance	across	workers
2. Select	the	migration	objective	through	a	statistical	analysis
§ Optimize	for	outgoing	messages, or
§ Optimize	for	incoming	messages,	or	
§ Optimize	for	response	time	
May	16,	2017 65/73
Mizan’s Migration	Planning	Steps	
1. Identify	the	source	of	workload	imbalance	across	workers
2. Select	the	migration	objective	through	a	statistical	analysis
3. Pair	over-utilized	workers	with	under-utilized	ones
W7 W2 W1 W5 W8 W4 W0 W6 W3
0 1 2 3 4 5 6 7 8
W9
May	16,	2017 66/73
Mizan’s Migration	Planning	Steps	
1. Identify	the	source	of	workload	imbalance	across	workers
2. Select	the	migration	objective	through	a	statistical	analysis
3. Pair	over-utilized	workers	with	under-utilized	ones
4. Select	vertices	to	migrate
§ Select	the	least	number	of	vertices	that	has	the	highest	impact
§ Vertex	ownership:	distributed	hash	table	(DHT)
§ Delayed	migration:	reduce	migration	cost
May	16,	2017 67/73
0
5
10
15
20
25
30
35
40
Static
WS
Mizan
Static
WS
Mizan
Static
WS
Mizan
Runtime(Min)
MetisRangeHash
Performance	of	Mizan on	PageRank
May	16,	2017 68/73
0
50
100
150
200
250
300
Advertisment
DMST
Runtime(Min)
Static
Work Stealing
Mizan
Performance	of	Mizan with	Metis
May	16,	2017 69/73
Mizan – a	General	Graph	Processing	System
q A	Pregel-clone
§ Supports	very	large	graphs
§ Runs	on	very	large	clusters
q Dynamic	fine-grained	vertex	migrations	to	
balance	computation	and	communication	
q Optimized	for	predictable	and	non-
predictable	graph	algorithms	and	structures
BigDansing
Apache	
Spark
Mizan
GraphX
*	Zuhair	Khayyat,	et	al.,	“Mizan:	A	System	for	Dynamic	Load	
Balancing	in	Large-scale	Graph	Processing”,	in	EuroSys 2013
Giraph
HDFS
May	16,	2017 70/73
Summary
• A	general	system	for	big	data	cleansing
• Performance	up	to	2	orders	of	magnitude	faster
• SIGMOD	2015
§ A	novel	algorithm	for	fast	inequality	joins
§ Performance	least	2	orders	of	magnitude	
faster
§ PVLDB	2015	&	VLDBJ	2017
§ A	general	system	for	distributed	graph	
processing
§ Performance	improvements	up	to	84%
§ EuroSys 2013	
May	16,	2017 71/73
Publications
" Zuhair	Khayyat,	William	Lucia,	Meghna Singh,	Mourad Ouzzani,	Paolo	Papotti,	Jorge-Arnulfo	Quiané-Ruiz,	Nan	Tang,	Panos
Kalnis,	“Fast	and	Scalable	Inequality	Joins”,	The	VLDB	Journal	2017 special	issue:	Best	Papers	of	VLDB	2015.
" Divy Agrawal,	Lamine Ba,	Laure	Berti-Equille,	Sanjay	Chawla,	Ahmed	Elmagarmid,	Hossam Hammady,	Yasser	Idris,	Zoi
Kaoudi,	Zuhair	Khayyat, Sebastian	Kruse,	Mourad Ouzzani,	Paolo	Papotti,	Jorge-Arnulfo	Quiané-Ruiz,	Nan	Tang,	Mohammed	J.	
Zaki,	“Rheem:	Enabling	Multi-Platform	Task	Execution”,	in	SIGMOD	2016.
" Zuhair	Khayyat,	William	Lucia,	Meghna Singh,	Mourad Ouzzani,	Paolo	Papotti,	Jorge-Arnulfo	Quiané-Ruiz,	Nan	Tang,	Panos
Kalnis,	“Lightning	Fast	and	Space	Efficient	Inequality	Joins”,	in	PVLDB	2015.
" Zuhair	Khayyat,	Ihab F.	Ilyas,	Alekh Jindal,	Samuel	Madden,	Mourad Ouzzani,	Paolo	Papotti,	Jorge-Arnulfo	Quiané-Ruiz,	Nan	
Tang,	Si	Yin,	“BigDansing:	A	System	for	Big	Data	Cleansing”,	in	SIGMOD	2015.
" Zuhair	Khayyat,	Karim	Awara,	Amani	Alonazi,	Hani	Jamjoom,	Dan	Williams,	Panos Kalnis,	“Mizan:	A	System	for	Dynamic	
Load	Balancing	in	Large-scale	Graph	Processing”,	in	EuroSys 2013.
May	16,	2017 73/73

More Related Content

What's hot

Data Analyst Roles & Responsibilities | Edureka
Data Analyst Roles & Responsibilities | EdurekaData Analyst Roles & Responsibilities | Edureka
Data Analyst Roles & Responsibilities | EdurekaEdureka!
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGAhtesham Ullah khan
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data CleaningCarolineSmith912130
 
Become a Data Analyst
Become a Data Analyst Become a Data Analyst
Become a Data Analyst Aaron Lamphere
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
 
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and GraphsBuild Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and GraphsNeo4j
 
Synthetic data generation for machine learning
Synthetic data generation for machine learningSynthetic data generation for machine learning
Synthetic data generation for machine learningQuantUniversity
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & ApplicationsFazle Rabbi Ador
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sectorAnil Rana
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata StrategiesDATAVERSITY
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
Data Observability.pptx
Data Observability.pptxData Observability.pptx
Data Observability.pptxSonaSamad1
 

What's hot (20)

Data Analyst Roles & Responsibilities | Edureka
Data Analyst Roles & Responsibilities | EdurekaData Analyst Roles & Responsibilities | Edureka
Data Analyst Roles & Responsibilities | Edureka
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
Fraud analytics
Fraud analyticsFraud analytics
Fraud analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data Cleaning
 
Become a Data Analyst
Become a Data Analyst Become a Data Analyst
Become a Data Analyst
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and GraphsBuild Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and Graphs
 
Synthetic data generation for machine learning
Synthetic data generation for machine learningSynthetic data generation for machine learning
Synthetic data generation for machine learning
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sector
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata Strategies
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Data Observability.pptx
Data Observability.pptxData Observability.pptx
Data Observability.pptx
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 

Viewers also liked

Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Stefan Urbanek
 
Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010Rami Mansour
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData Blueprint
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScalePrecisely
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementEmpowered Holdings, LLC
 

Viewers also liked (6)

Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)
 
Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data Scale
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
 

Similar to Scaling Big Data Cleansing

BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTZuhair khayyat
 
Williamson County, Texas: Changing Demographics and Implications of Growth
Williamson County, Texas: Changing Demographics and Implications of GrowthWilliamson County, Texas: Changing Demographics and Implications of Growth
Williamson County, Texas: Changing Demographics and Implications of GrowthCivic Analytics LLC
 
03/25/2011 Meeting - Vendor Red Flags
03/25/2011 Meeting - Vendor Red Flags03/25/2011 Meeting - Vendor Red Flags
03/25/2011 Meeting - Vendor Red Flagsacfesj
 
Coldwell Banker Commercial Market Comparison Report Ranks Denver as Top Comme...
Coldwell Banker Commercial Market Comparison Report Ranks Denver as Top Comme...Coldwell Banker Commercial Market Comparison Report Ranks Denver as Top Comme...
Coldwell Banker Commercial Market Comparison Report Ranks Denver as Top Comme...Coldwell Banker Commercial
 
Austin, TX: State of the Economy
Austin, TX: State of the EconomyAustin, TX: State of the Economy
Austin, TX: State of the EconomyCivic Analytics LLC
 
Presentationfor babsonslideshare
Presentationfor babsonslidesharePresentationfor babsonslideshare
Presentationfor babsonslideshareKstedman
 
Colliers north american office highlights 3 q 2010
Colliers north american office highlights 3 q 2010Colliers north american office highlights 3 q 2010
Colliers north american office highlights 3 q 2010Coy Davidson
 
2q2103 land report
2q2103 land report2q2103 land report
2q2103 land reportScott Davis
 

Similar to Scaling Big Data Cleansing (13)

BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUST
 
Williamson County, Texas: Changing Demographics and Implications of Growth
Williamson County, Texas: Changing Demographics and Implications of GrowthWilliamson County, Texas: Changing Demographics and Implications of Growth
Williamson County, Texas: Changing Demographics and Implications of Growth
 
03/25/2011 Meeting - Vendor Red Flags
03/25/2011 Meeting - Vendor Red Flags03/25/2011 Meeting - Vendor Red Flags
03/25/2011 Meeting - Vendor Red Flags
 
Real Estate Trends 2016 presentation to the Lake Elsinore Economic Developmen...
Real Estate Trends 2016 presentation to the Lake Elsinore Economic Developmen...Real Estate Trends 2016 presentation to the Lake Elsinore Economic Developmen...
Real Estate Trends 2016 presentation to the Lake Elsinore Economic Developmen...
 
Coldwell Banker Commercial Market Comparison Report Ranks Denver as Top Comme...
Coldwell Banker Commercial Market Comparison Report Ranks Denver as Top Comme...Coldwell Banker Commercial Market Comparison Report Ranks Denver as Top Comme...
Coldwell Banker Commercial Market Comparison Report Ranks Denver as Top Comme...
 
Austin, TX: State of the Economy
Austin, TX: State of the EconomyAustin, TX: State of the Economy
Austin, TX: State of the Economy
 
Amanda Hoyle Real Estate Presentation 2018
Amanda Hoyle Real Estate Presentation 2018Amanda Hoyle Real Estate Presentation 2018
Amanda Hoyle Real Estate Presentation 2018
 
Company Information
Company InformationCompany Information
Company Information
 
Leadership 2023, Session 3 - History, Geography, and Governance
Leadership 2023, Session 3 - History, Geography, and GovernanceLeadership 2023, Session 3 - History, Geography, and Governance
Leadership 2023, Session 3 - History, Geography, and Governance
 
Presentationfor babsonslideshare
Presentationfor babsonslidesharePresentationfor babsonslideshare
Presentationfor babsonslideshare
 
Consumer Magazine Top 10 DMAs
Consumer Magazine Top 10 DMAsConsumer Magazine Top 10 DMAs
Consumer Magazine Top 10 DMAs
 
Colliers north american office highlights 3 q 2010
Colliers north american office highlights 3 q 2010Colliers north american office highlights 3 q 2010
Colliers north american office highlights 3 q 2010
 
2q2103 land report
2q2103 land report2q2103 land report
2q2103 land report
 

More from Zuhair khayyat

IEJoin and Big Data Cleansing
IEJoin and Big Data CleansingIEJoin and Big Data Cleansing
IEJoin and Big Data CleansingZuhair khayyat
 
BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015Zuhair khayyat
 
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Zuhair khayyat
 
Large Graph Processing
Large Graph ProcessingLarge Graph Processing
Large Graph ProcessingZuhair khayyat
 
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingZuhair khayyat
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hoodZuhair khayyat
 

More from Zuhair khayyat (10)

IEJoin and Big Data Cleansing
IEJoin and Big Data CleansingIEJoin and Big Data Cleansing
IEJoin and Big Data Cleansing
 
BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015
 
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
 
Large Graph Processing
Large Graph ProcessingLarge Graph Processing
Large Graph Processing
 
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
 
Google appengine
Google appengineGoogle appengine
Google appengine
 
MapReduce
MapReduceMapReduce
MapReduce
 
Kineograph
KineographKineograph
Kineograph
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hood
 
Dynamo db
Dynamo dbDynamo db
Dynamo db
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 

Scaling Big Data Cleansing

  • 2. What is Data Cleansing? q Data cleansing is the process of: A. detecting error in record sets, tables, or databases (violation detection) B. and fixing them (violation repair) q Example errors in data: • Typos • Duplicate • Values inconsistent with business rules • Outliers • Outdated • Missing values May 16, 2017 2/73
  • 3. Why Data Cleansing is Important? q 25% of world's critical data are dirty q 60% - 98% of the data scientist's time is lost in the process data cleansing q “duplicate and dirty data costs the healthcare industry over $300 billion every year” -- Joe Fusaro (RingLead) q “inaccurate data has a direct impact ... the average company losing 12% of its revenue” -- Ben Davis (Econsultancy) May 16, 2017 3/73
  • 4. Example of a Dirty Dataset A Company employee database: q Rule 1: Any two employees in same Zipcode must be in same City q Rule 2: An employee who earns higher salary must pay more taxes compared to others Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 May 16, 2017 4/73
  • 5. The Process of Data Cleansing Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Detection Rules Input Data Dirty 1st: Detect Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 3rd: Repair Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 4th: Update Input Data Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty Dirty Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Clean Data 2st: Analyze May 16, 2017 5/73
  • 7. 4th: Update Input Data Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty Dirty Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Clean Data Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty Detection Rules Input Data Problems of Big Data Cleansing 1st: Detect Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 3rd: Repair Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 2st: Analyze ≈ 90% Runtime Most of Research 0 20 40 60 80 100 1% 5% 10% 50% Time(Seconds) Violation percentage Violation detection Data repair May 16, 2017 7/73
  • 8. Problems of Big Data Cleansing Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 1st: Detect Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 3rd: Repair Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 4th: Update Input Data Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty Dirty Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Clean Data 2st: Analyze Detection Rules Input Data 1. Violation detection becomes too expensive with big data: a. Enumerating all tuples is not possible b. Not feasible to implement a parallel version of each detection rule c. Serial repair algorithms cannot handle big errors May 16, 2017 8/73
  • 9. Problems of Big Data Cleansing Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 1st: Detect Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 3rd: Repair Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 4th: Update Input Data Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty Dirty Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Clean Data 2st: Analyze Detection Rules Input Data 2. Complex error discovery rules based on inequality conditions are too expensive: Rule 2: An employee who earns higher salary must pay more taxes compared to others è (ti.salary < tj.salary) AND (ti.tax > tj.tax) May 16, 2017 9/73
  • 10. Problems of Big Data Cleansing Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 1st: Detect Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 3rd: Repair Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 4th: Update Input Data Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty Dirty Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Clean Data Detection Rules Input Data 3. Error graph (violation graph) is random, big and unpredictable: • Irregular structures • Skewed distributions • Unpredictable workload of algorithm 2st: Analyze May 16, 2017 10/73
  • 11. Problems & Solutions of Big Data Cleansing Problems 1. Violation detection becomes too expensive with big data 2. Complex error discovery rules based on inequality conditions are too expensive 3. Error graph (violation graph) is random, big and unpredictable • Develop a general purpose scalable data cleansing system BigDansing • Introduce new join algorithm to enhance inequality joinsIEJoin • Build a general graph system that adapts to various graph structures and algorithms Mizan Solutions May 16, 2017 11/73
  • 12. BigDansing A System for Big Data Cleansing Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty May 16, 2017 12/73
  • 14. What does Big Data Cleansing require? 1. Scale Detection § Declarative rules Ø Functional dependencies (FDs, CFDs) Ø Denial constraints (DCs) § User defined functions 2. Scale Repairs § Handle serial repair algorithms May 16, 2017 14/73
  • 16. BigDansing – Input UDFScope Block Iterate Detect GenFix Violation Detection Plan (Logical Plan) Rule Parser Declarative Rules May 16, 2017 16/73
  • 18. Rule 1 – Logical Plan § Any two employees in same Zipcode must be in same City § FD: Zipcode à City Scope Block Iterate Detect GenFix Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix Logical Operators May 16, 2017 18/73
  • 19. Rule 1 – Physical Plan § Any two employees in same Zipcode must be in same City § FD: Zipcode à City PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct PScope PBlock PIterate PDetect PGenFix Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix Physical Operators May 16, 2017 19/73
  • 20. Rule 1 – Execution Plan § Any two employees in same Zipcode must be in same City § FD: Zipcode à City Spark- PScope Spark- PBlock Spark- PIterate Spark- PDetect Spark- PGenFix PScope PBlock PIterate PDetect PGenFix Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix May 16, 2017 20/73
  • 21. Rule 1 – Execution Example Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Zipcode City t1 10001 NY t2 90210 LA t3 60601 CH t4 90210 SF t5 60827 CH t6 90210 LA Zipcode City t1 10001 NY t2 90210 LA t4 90210 SF t6 90210 LA t3 60601 CH t5 60827 CH (t2, t4) (t2 ,t6) (t4, t6) (t2, t4) (t4, t6) t2[City] = t4[City] t4[City] = t6[City] 1) Scope 3) Iterate2) Block 4) Detect 5) GenFix May 16, 2017 21/73
  • 22. Rule 2 – Logical Plan §An employee who earns higher salary must pay more taxes compared to others § DC: ti, tj D, ¬(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 • For Annie, compare Salary with: • Laure • John • Mark • Robert • Mary Compare Rate Compare Rate Compare Rate Compare Rate Report a Violation! May 16, 2017 22/73
  • 23. Rule 2 – Logical Plan §An employee who earns higher salary must pay more taxes compared to others § DC: ti, tj D, ¬(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) Scope(Salary, Rate) Iterate Detect(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) GenFix Scope Block Iterate Detect GenFix Logical Operators May 16, 2017 23/73
  • 24. Rule 2 – Physical Plan §An employee who earns higher salary must pay more taxes compared to others § DC: ti, tj D, ¬(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) PScope UCrossProduct PDetect PGenFix Physical Operators Scope(Salary, Rate) Iterate Detect(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) GenFix PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct May 16, 2017 24/73
  • 25. Rule 2 – Execution Plan §An employee who earns higher salary must pay more taxes compared to others § DC: ti, tj D, ¬(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) PScope UCrossProduct PDetect PGenFix Scope(Salary, Rate) Iterate Detect(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) GenFix Spark- PScope Spark- UCrossProduct Spark- PDetect Spark- PGenFix May 16, 2017 25/73
  • 26. Plan Optimizations – OCJoin § DC: ti, tj D, ¬(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) Range Partitioning Sorting Pruning Joining Partition 1 Partition 2 Partition 3 Partition n Based on Salary Based on Rate Partition 1 Partition 2 Partition 3 Partition n Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition n Partition 2 Partition 3 Partition 5 Partition 6⨝ ⨝ May 16, 2017 26/73
  • 27. Rule 2 – Execution Plan § Rule 2: An employee who earns higher salary must pay more taxes compared to others § DC: ti, tj D, ¬(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) PScope OCJoin(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) PDetect PGenFIx Scope(Salary, Rate) Iterate Detect(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate) GenFIx Spark- PScope Spark-OCJoin Spark- PDetect Spark- PGenFIx May 16, 2017 27/73
  • 28. What does Big Data Cleansing require? 1. Scale Detection § Declarative rules Ø Functional dependencies (FDs, CFDs) Ø Denial constraints (DCs) § User defined functions 2. Scale Repairs § Handle serial repair algorithms ! May 16, 2017 28/73
  • 29. Rule 1: Zipcode à City Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate) BigDansing – Structure of the Violation Graph Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 • Rule 1: t2[City] = t4[City] • Rule 2: t1[Salary] > t2[Salary] OR t1[Tax] < t2[Tax] May 16, 2017 29/73
  • 30. Rule 1: Zipcode à City Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate) BigDansing – Structure of the Violation Graph t1 t5 t2 t4 t6 R1: City R1: City R2: Salary, Tax• Rule 1: t2[City] = t4[City] • Rule 1: t4[City] = t6[City] • Rule 2: t1[Salary] > t2[Salary] OR t1[Tax] < t2[Tax] • Rule 2: t5[Salary] > t2[Salary] OR t5[Tax] < t2[Tax] May 16, 2017 30/73
  • 31. BigDansing – Data Repair as a Black box t1 t5 t2 t4 t6 R1: City R1: City R2: Salary, Tax t1 t5 t2 R2: Salary, Tax t2 t4 t6 R1: City Graph Analysis Serial Repair Algorithm Serial Repair Algorithm Serial Repair Algorithm tytx R1: City tytx R1: City May 16, 2017 31/73
  • 33. Performance of a Single Machine 0 1000 2000 3000 4000 5000 6000 100,000 1,000,000 10,000,000 Runtime(Seconds) Dataset size (rows) BigDansing NADEEF PostgreSQL Spark SQL Shark 5 18 86 55 368 0.264 37 3183 4 8 80 2 47 4153 0 2000 4000 6000 8000 10000 12000 14000 16000 100,000 200,000 300,000 Runtime(Seconds) Dataset size (rows) BigDansing NADEEF PostgreSQL Spark SQL Shark 10 30 62 833 4529 9336 2133 8780 3731 7982 Rule 1 Rule 2 May 16, 2017 33/73
  • 34. 0 20000 40000 60000 80000 100000 120000 1M 2M 3M Time(Seconds) Dataset size (rows) BigDansing-Spark Spark SQL Shark 1240 5319 7730 0 5000 10000 15000 20000 10M 20M 40M Time(Seconds) Dataset size (rows) BigDansing-Spark BigDansing-Hadoop Spark SQL Shark 121 150 337 503 865 2302 159 313 662 3739 14113 126822 Performance on a 16-machine cluster Rule 1 Rule 2 May 16, 2017 34/73
  • 35. 0 25000 50000 75000 100000 125000 1 2 4 8 16 Runtime(Seconds) #-workers BigDansing Spark SQL 0 40000 80000 120000 160000 200000 647M 959M 1271M1583M1907M Time(Seconds) Dataset size (rows) BigDansing-Spark BigDansing-Hadoop Spark SQL 712 2307 5113 8670 11880 24803 52886 92236 138932 196133 9263 17872 30195 46907 65115 Performance on a 16-machine cluster May 16, 2017 35/73
  • 36. Detecting Violations on RDF Scope Block 1 Iterate 1 Block 2 Iterate 2 Block 3 Iterate 3 Detect GenFix May 16, 2017 36/73
  • 37. Detecting Violations on RDF 0 1000 2000 3000 4000 5000 BigDansing S2RDF BigDansing S2RDF BigDansing S2RDF BigDansing S2RDF Runtime(Seconds) Number of RDF triples Pre-processing Violation Detection 170M85M42M21M *Alexander Schätzle, et al., “S2RDF: RDF Querying with SPARQL on Spark”, in PVLDB 2016 * * * * May 16, 2017 37/73
  • 38. BigDansing: A System for Big Data Cleansing ü Easy-to-use ü Efficient ü Extensible ü Scalable * Zuhair Khayyat, et al., “BigDansing: A System for Big Data Cleansing”, in SIGMOD 2015. May 16, 2017 38/73
  • 39. IEJoin Fast and Scalable Inequality Joins Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 May 16, 2017 39/73
  • 40. OCJoin in BigDansing 0 20000 40000 60000 80000 100000 100,000 200,000 300,000 Runtime(Seconds) Data size (rows) OCJoin UCrossProduct Cross product 97 103 126 4279 22912 61772 4953 27078 82524 0 20000 40000 60000 80000 100000 120000 1M 2M 3M Time(Seconds) Dataset size (rows) BigDansing-Spark Spark SQL Shark 1240 5319 7730 May 16, 2017 40/73
  • 41. What is the Problem? q Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate) § Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Tax > t2.Tax q Processed as a Cartesian product: O(n2) May 16, 2017 41/73
  • 43. IEJoin – a New Join Algorithm q In data cleansing: § Q1: Select * from D t1 JOIN D t2 on t1.Salary > t2.Salary AND t1.Tax < t2.Tax q Interval intersection: §Q2: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end ≥ s.start q Joining tables with (≠): §Qk: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end ≠ s.start May 16, 2017 43/73
  • 44. Algorithm Discovery t3(150) t4(120) t1(100) t2(90) Q1: Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Rate > t2.Rate Sort descending on Salary: Salary partial answer: (t2, t1), (t2, t4), (t2, t3) …. (t4, t3) t3(15) t4(10) t2(9) t1(5) Sort descending on Rate: Rate partial answer: (t1, t2), (t1, t4), (t1, t3) …. (t4, t3) Salary Rate t1 100 5 t2 90 9 t3 150 15 t4 120 10 May 16, 2017 44/73
  • 45. Algorithm Discovery Salary Rate t1 100 5 t2 90 9 t3 150 15 t4 120 10 Q1: Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Rate > t2.Rate Rate partial answer: (t1, t2), (t1, t4), (t1, t3), (t2, t4), (t2, t3), (t4, t3)} Salary partial answer: (t2, t1), (t2, t4), (t2, t3), (t1, t4), (t1, t3), (t4, t3)} The expected result is: (t2, t1) May 16, 2017 45/73
  • 46. IEJoin – the Algorithm q Sort Descending on Salary: q Sort Descending on Rate: Salary Rate t1 100 5 t2 90 9 t3 150 15 t4 120 10 t3(150) t4(120) t1(100) t2(90) 0 1 2 3 Permutation Array t3(15) t4(10) t2(9) t1(5) 0 1 3 2 0 0 0 0 t3 t4 t2 t1 1 1 11 Sequential scan Random access Result = (t2,t1) Bit-Array May 16, 2017 46/73
  • 47. Sorting Orders Q1: Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Rate > t2.Rate q For self joins: § Salary: ascending order if OP1 is either > or ≥, otherwise descending order § Rate: descending order if OP1 is either > or ≥, otherwise ascending order § Non-self joins: § Salary: descending order if OP1 is either > or ≥, otherwise descending order § Rate: ascending order if OP1 is either > or ≥, otherwise descending order OP1 OP2 May 16, 2017 47/73
  • 48. Optimizations – Bitmap Index 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 C1 C2 C3 C4 (i) pos 6 (ii) pos 9 B max May 16, 2017 48/73
  • 49. Optimizations – Not Equal Operator q Convert each (≠) into one (>) and one (<) joined with UNION ALL operator Qk: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end ≠ s.start Q’k: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end < s.start UNION ALL SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end > s.start May 16, 2017 49/73
  • 50. Optimizations – Selectivity Estimation q A query with three attributes: r.Salary < s.Salary AND r.Rate > s.Rate AND r.Age > s.Age q Use sampling to estimate the maximum output size – Est(Salary,Rate), Est(Salary,Tax), Est(Tax, Age) Range Partitioning Sorting Pruning Calculate Max Output Partition 1 Partition 2 Partition 3 Partition n Based on OP1 Based on OP2 Partition 1 Partition 2 Partition 3 Partition n Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition n Estimated Output = number of overlapping partitions = 2 May 16, 2017 50/73
  • 52. Serial IEJoin vs. Naïve Baseline 0.01 0.1 1 10 100 1000 10000 10K 50K 100K Runtime(Seconds) Input size PG-IEJoin PG-Original MonetDB DBMS-X 0.01 0.1 1 10 100 1000 10000 10K 50K 100K Runtime(Seconds) Input size PG-IEJoin PG-Original MonetDB DBMS-X Salary-Rate Interval Intersection May 16, 2017 52/73
  • 54. Parallel and Distributed IEJoin – 100M Rows 0 4000 8000 12000 16000 20000 Parallel-IEJoin Distributed-IEJoin DPG-GiST DPG-BTree SparkSQL-SM SparkSQL Runtime(Seconds) Indexing Querying X X X X 4302 1313 0 4000 8000 12000 16000 20000 Parallel-IEJoin Distributed-IEJoin DPG-GiST DPG-BTree SparkSQL-SM SparkSQL Runtime(Seconds) Indexing Querying X X X 4965 1376 Salary-Rate Interval Intersection May 16, 2017 54/73
  • 55. IEJoin q A new join algorithm q Based on conditions: (<, ≤, >, ≥, ≠) q Extremely fast and highly scalable q Utilizes sorting and efficient data structures q Easy to implement in traditional DBMS and distributed systems * Zuhair Khayyat, et al., “Fast and Scalable Inequality Joins”, The VLDB Journal 2017, Special Issue: Best Papers of VLDB 2015 * Zuhair Khayyat , et al., “Lightning Fast and Space Efficient Inequality Joins”, in PVLDB 2015 May 16, 2017 55/73
  • 57. BigDansing’s implementations Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Detection Rules Input Data Dirty 1st: Detect Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 3rd: Repair Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty 4th: Update Input Data Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Dirty Dirty Name Zipcode City State Salary Rate t1 Annie 10001 NY NY 24000 15 t2 Laure 90210 LA CA 25000 10 t3 John 60601 CH IL 40000 25 t4 Mark 90210 SF CA 88000 28 t5 Robert 60827 CH IL 15000 15 t6 Mary 90210 LA CA 81000 28 Clean Data BigDansing Apache Hadoop Giraph Apache Spark GraphX HDFS 2st: Analyze May 16, 2017 57/73
  • 58. Pregel*/Giraph Abstraction q Based on vertex-centric computation q Abstraction: § compute(), combine() & aggregate() q Synchronous in-memory bulk synchronous parallel (BSP) * G. Malewicz, et al., “Pregel: A System for Large-Scale Graph Processing,” in SIGMOD 2010 Superstep 1 Superstep 2 Superstep 3 Worker 3 Worker 2 Worker 1 Worker 3 Worker 2 Worker 1 Worker 3 Worker 2 Worker 1 BSP Barrier May 16, 2017 58/73
  • 59. Problems of Giraph The Light Side The Dark Side § Algorithm: § Unforeseen § Structure: § Variable § Algorithm: § Predictable § Structure: § Fixed Error graph (violation graph) is random, big and unpredictable May 16, 2017 59/73
  • 60. How Giraph Optimize Computations 1. Faster Graph Loading § Simple graph partitioning § Hash, Range 2. Optimized for graph structure § Sophisticated and expensive partitioning techniques § Min-cuts 0 50 100 150 200 250 300 350 LiveJournal kgraph4m68m arabic-2005 RunTime(Min) Hash Range Min-cuts The runtime of a single iteration is as fast as the slowest worker May 16, 2017 60/73
  • 61. Behaviors of Different Graph Algorithms 0.001 0.01 0.1 1 10 100 1000 0 10 20 30 40 50 60 InMessages(Millions) SuperSteps PageRank - Total PageRank - Max/W DMST - Total DMST - Max/W PageRank vs. Distributed Minimal Spanning Tree May 16, 2017 61/73
  • 62. Source of Imbalance in Giraph 1. High vertex response time 2. Long time to receive incoming messages 3. Long time to send outgoing messages Superstep 1 -High vertex response time Worker 3 Worker 2 Worker 1 Worker 3 Worker 2 Worker 1 BSP Barrier Superstep 1 -Long time to receive in messages Worker 3 Worker 2 Worker 1 Worker 3 Worker 2 Worker 1 BSP Barrier Superstep 1 -Long time to send out messages Worker 3 Worker 2 Worker 1 Worker 3 Worker 2 Worker 1 BSP Barrier May 16, 2017 62/73
  • 63. Mizan – Solving the Workload Imbalance q Move vertices between workers during runtime q Planning and vertex migrations within the BSP barrier to maintain computation consistency Superstep 1 Superstep 2 Superstep 3 Worker 3 Worker 2 Worker 1 Worker 3 Worker 2 Worker 1 Worker 3 Worker 2 Worker 1 BSP Barrier Migration Barrier Migration Planner Communicator - DHT Vertex Compute() BSP Graph Processor Storage Manager HDFS/Local Disks IO Mizan Worker Load Balancer: Migration Planner May 16, 2017 63/73
  • 64. Mizan’s Migration Planning Steps 1. Identify the source of workload imbalance across workers § Remote outgoing messages § All incoming messages § Response time V1 Worker 2Worker 1 Remote Incoming Messages Remote Outgoing Messages Vertex Response Time V3 V2 V4 Mizan V5 V6 Mizan Local Incoming Messages May 16, 2017 64/73
  • 65. Mizan’s Migration Planning Steps 1. Identify the source of workload imbalance across workers 2. Select the migration objective through a statistical analysis § Optimize for outgoing messages, or § Optimize for incoming messages, or § Optimize for response time May 16, 2017 65/73
  • 66. Mizan’s Migration Planning Steps 1. Identify the source of workload imbalance across workers 2. Select the migration objective through a statistical analysis 3. Pair over-utilized workers with under-utilized ones W7 W2 W1 W5 W8 W4 W0 W6 W3 0 1 2 3 4 5 6 7 8 W9 May 16, 2017 66/73
  • 67. Mizan’s Migration Planning Steps 1. Identify the source of workload imbalance across workers 2. Select the migration objective through a statistical analysis 3. Pair over-utilized workers with under-utilized ones 4. Select vertices to migrate § Select the least number of vertices that has the highest impact § Vertex ownership: distributed hash table (DHT) § Delayed migration: reduce migration cost May 16, 2017 67/73
  • 70. Mizan – a General Graph Processing System q A Pregel-clone § Supports very large graphs § Runs on very large clusters q Dynamic fine-grained vertex migrations to balance computation and communication q Optimized for predictable and non- predictable graph algorithms and structures BigDansing Apache Spark Mizan GraphX * Zuhair Khayyat, et al., “Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing”, in EuroSys 2013 Giraph HDFS May 16, 2017 70/73
  • 71. Summary • A general system for big data cleansing • Performance up to 2 orders of magnitude faster • SIGMOD 2015 § A novel algorithm for fast inequality joins § Performance least 2 orders of magnitude faster § PVLDB 2015 & VLDBJ 2017 § A general system for distributed graph processing § Performance improvements up to 84% § EuroSys 2013 May 16, 2017 71/73
  • 72. Publications " Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Panos Kalnis, “Fast and Scalable Inequality Joins”, The VLDB Journal 2017 special issue: Best Papers of VLDB 2015. " Divy Agrawal, Lamine Ba, Laure Berti-Equille, Sanjay Chawla, Ahmed Elmagarmid, Hossam Hammady, Yasser Idris, Zoi Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Mohammed J. Zaki, “Rheem: Enabling Multi-Platform Task Execution”, in SIGMOD 2016. " Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Panos Kalnis, “Lightning Fast and Space Efficient Inequality Joins”, in PVLDB 2015. " Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin, “BigDansing: A System for Big Data Cleansing”, in SIGMOD 2015. " Zuhair Khayyat, Karim Awara, Amani Alonazi, Hani Jamjoom, Dan Williams, Panos Kalnis, “Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing”, in EuroSys 2013. May 16, 2017 73/73