Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identifying and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is desired to optimize sophisticated error discovery, that requires inequality joins, rather than naïvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing.
3. Why Data Cleansing is Important?
q 25% of world's critical data are dirty
q 60% - 98% of the data scientist's time is lost in the process data cleansing
q “duplicate and dirty data costs the healthcare industry over $300 billion
every year” -- Joe Fusaro (RingLead)
q “inaccurate data has a direct impact ... the average company losing 12% of its
revenue” -- Ben Davis (Econsultancy)
May 16, 2017 3/73
5. The Process of Data Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Detection
Rules
Input Data
Dirty
1st: Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th: Update Input Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
2st: Analyze
May 16, 2017 5/73
7. 4th: Update Input Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Detection
Rules
Input Data
Problems of Big Data Cleansing
1st: Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
2st: Analyze
≈ 90% Runtime
Most of Research
0
20
40
60
80
100
1% 5% 10% 50%
Time(Seconds)
Violation percentage
Violation detection
Data repair
May 16, 2017 7/73
8. Problems of Big Data Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st: Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th: Update Input Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
2st: Analyze
Detection
Rules
Input Data
1. Violation detection becomes too expensive with big data:
a. Enumerating all tuples is not possible
b. Not feasible to implement a parallel version of each detection rule
c. Serial repair algorithms cannot handle big errors
May 16, 2017 8/73
9. Problems of Big Data Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st: Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th: Update Input Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
2st: Analyze
Detection
Rules
Input Data
2. Complex error discovery rules based on inequality conditions are too expensive:
Rule 2: An employee who earns higher salary must pay more taxes compared to others
è (ti.salary < tj.salary) AND (ti.tax > tj.tax)
May 16, 2017 9/73
10. Problems of Big Data Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st: Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th: Update Input Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
Detection
Rules
Input Data
3. Error graph (violation graph) is random, big and unpredictable:
• Irregular structures
• Skewed distributions
• Unpredictable workload of algorithm
2st: Analyze
May 16, 2017 10/73
12. BigDansing A System for Big Data Cleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
May 16, 2017 12/73
18. Rule 1 – Logical Plan
§ Any two employees in same Zipcode must be in same City
§ FD: Zipcode à City
Scope Block Iterate Detect GenFix
Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix
Logical Operators
May 16, 2017 18/73
19. Rule 1 – Physical Plan
§ Any two employees in same Zipcode must be in same City
§ FD: Zipcode à City
PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct
PScope PBlock PIterate PDetect PGenFix
Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix
Physical Operators
May 16, 2017 19/73
20. Rule 1 – Execution Plan
§ Any two employees in same Zipcode must be in same City
§ FD: Zipcode à City
Spark-
PScope
Spark-
PBlock
Spark-
PIterate
Spark-
PDetect
Spark-
PGenFix
PScope PBlock PIterate PDetect PGenFix
Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix
May 16, 2017 20/73
21. Rule 1 – Execution Example
Scope(Zipcode, City) Block(Zipcode) Iterate Detect(Cityi ≠ Cityj) GenFix
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Zipcode City
t1 10001 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60827 CH
t6 90210 LA
Zipcode City
t1 10001 NY
t2 90210 LA
t4 90210 SF
t6 90210 LA
t3 60601 CH
t5 60827 CH
(t2, t4)
(t2 ,t6)
(t4, t6)
(t2, t4)
(t4, t6)
t2[City] = t4[City]
t4[City] = t6[City]
1) Scope 3) Iterate2) Block
4) Detect
5) GenFix
May 16, 2017 21/73
22. Rule 2 – Logical Plan
§An employee who earns higher salary must pay more taxes compared to others
§ DC: ti, tj D, ¬(ti.Salary < tj.Salary ˄ ti.Rate > tj.Rate)
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
• For Annie, compare Salary with:
• Laure
• John
• Mark
• Robert
• Mary
Compare Rate
Compare Rate
Compare Rate
Compare Rate
Report a Violation!
May 16, 2017 22/73
29. Rule 1: Zipcode à City
Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate)
BigDansing – Structure of the Violation Graph
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
• Rule 1: t2[City] = t4[City]
• Rule 2: t1[Salary] > t2[Salary]
OR t1[Tax] < t2[Tax]
May 16, 2017 29/73
30. Rule 1: Zipcode à City
Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate)
BigDansing – Structure of the Violation Graph
t1
t5
t2 t4
t6
R1: City
R1: City
R2: Salary, Tax• Rule 1: t2[City] = t4[City]
• Rule 1: t4[City] = t6[City]
• Rule 2: t1[Salary] > t2[Salary] OR t1[Tax] < t2[Tax]
• Rule 2: t5[Salary] > t2[Salary] OR t5[Tax] < t2[Tax]
May 16, 2017 30/73
31. BigDansing – Data Repair as a Black box
t1
t5
t2 t4
t6
R1: City
R1: City
R2: Salary, Tax t1
t5
t2
R2: Salary, Tax
t2 t4
t6
R1: City
Graph Analysis
Serial Repair
Algorithm
Serial Repair
Algorithm
Serial Repair
Algorithm
tytx
R1: City
tytx
R1: City
May 16, 2017 31/73
39. IEJoin Fast and Scalable Inequality Joins
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
May 16, 2017 39/73
41. What is the Problem?
q Rule 2: t1, t2 D, ¬(t1.Salary < t2.Salary ˄ t1.Rate > t2.Rate)
§ Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Tax > t2.Tax
q Processed as a Cartesian product: O(n2)
May 16, 2017 41/73
43. IEJoin – a New Join Algorithm
q In data cleansing:
§ Q1: Select * from D t1 JOIN D t2 on t1.Salary > t2.Salary AND t1.Tax < t2.Tax
q Interval intersection:
§Q2: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end ≥ s.start
q Joining tables with (≠):
§Qk: SELECT * FROM Events r, Events s WHERE r.start ≤ s.end AND r.end ≠ s.start
May 16, 2017 43/73
44. Algorithm Discovery
t3(150) t4(120) t1(100) t2(90)
Q1: Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Rate > t2.Rate
Sort descending on Salary:
Salary partial answer: (t2, t1), (t2, t4), (t2, t3) …. (t4, t3)
t3(15) t4(10) t2(9) t1(5)
Sort descending on Rate:
Rate partial answer: (t1, t2), (t1, t4), (t1, t3) …. (t4, t3)
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
May 16, 2017 44/73
45. Algorithm Discovery
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
Q1: Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Rate > t2.Rate
Rate partial answer:
(t1, t2), (t1, t4), (t1, t3),
(t2, t4), (t2, t3),
(t4, t3)}
Salary partial answer:
(t2, t1), (t2, t4), (t2, t3),
(t1, t4), (t1, t3),
(t4, t3)}
The expected result is: (t2, t1)
May 16, 2017 45/73
46. IEJoin – the Algorithm
q Sort Descending on Salary:
q Sort Descending on Rate:
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
t3(150) t4(120) t1(100) t2(90) 0 1 2 3
Permutation Array
t3(15) t4(10) t2(9) t1(5) 0 1 3 2
0 0 0 0
t3 t4 t2 t1
1 1 11
Sequential scan
Random access
Result = (t2,t1)
Bit-Array
May 16, 2017 46/73
47. Sorting Orders
Q1: Select * from D t1 JOIN D t2 on t1.Salary < t2.Salary AND t1.Rate > t2.Rate
q For self joins:
§ Salary: ascending order if OP1 is either > or ≥, otherwise descending order
§ Rate: descending order if OP1 is either > or ≥, otherwise ascending order
§ Non-self joins:
§ Salary: descending order if OP1 is either > or ≥, otherwise descending order
§ Rate: ascending order if OP1 is either > or ≥, otherwise descending order
OP1 OP2
May 16, 2017 47/73
50. Optimizations – Selectivity Estimation
q A query with three attributes: r.Salary < s.Salary AND r.Rate > s.Rate AND r.Age > s.Age
q Use sampling to estimate the maximum output size – Est(Salary,Rate), Est(Salary,Tax), Est(Tax, Age)
Range
Partitioning
Sorting
Pruning
Calculate
Max Output
Partition 1 Partition 2 Partition 3 Partition n
Based on
OP1
Based on
OP2
Partition 1 Partition 2 Partition 3 Partition n
Partition 1
Partition 2
Partition 3 Partition 4
Partition 5
Partition 6 Partition n
Estimated Output = number of overlapping partitions = 2
May 16, 2017 50/73
55. IEJoin
q A new join algorithm
q Based on conditions: (<, ≤, >, ≥, ≠)
q Extremely fast and highly scalable
q Utilizes sorting and efficient data structures
q Easy to implement in traditional DBMS and distributed systems
* Zuhair Khayyat, et al., “Fast and Scalable Inequality Joins”,
The VLDB Journal 2017, Special Issue: Best Papers of VLDB 2015
* Zuhair Khayyat , et al., “Lightning Fast and Space
Efficient Inequality Joins”, in PVLDB 2015
May 16, 2017 55/73
57. BigDansing’s implementations
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Detection
Rules
Input Data
Dirty
1st: Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd: Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th: Update Input Data
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Dirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Clean Data
BigDansing
Apache
Hadoop
Giraph
Apache
Spark
GraphX
HDFS
2st: Analyze
May 16, 2017 57/73
58. Pregel*/Giraph Abstraction
q Based on vertex-centric computation
q Abstraction:
§ compute(), combine() & aggregate()
q Synchronous in-memory bulk
synchronous parallel (BSP)
* G. Malewicz, et al., “Pregel: A System for Large-Scale Graph Processing,” in SIGMOD 2010
Superstep 1 Superstep 2 Superstep 3
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
May 16, 2017 58/73
60. How Giraph Optimize Computations
1. Faster Graph Loading
§ Simple graph partitioning
§ Hash, Range
2. Optimized for graph structure
§ Sophisticated and expensive
partitioning techniques
§ Min-cuts
0
50
100
150
200
250
300
350
LiveJournal
kgraph4m68m
arabic-2005
RunTime(Min)
Hash
Range
Min-cuts
The runtime of a single iteration is
as fast as the slowest worker
May 16, 2017 60/73
67. Mizan’s Migration Planning Steps
1. Identify the source of workload imbalance across workers
2. Select the migration objective through a statistical analysis
3. Pair over-utilized workers with under-utilized ones
4. Select vertices to migrate
§ Select the least number of vertices that has the highest impact
§ Vertex ownership: distributed hash table (DHT)
§ Delayed migration: reduce migration cost
May 16, 2017 67/73
70. Mizan – a General Graph Processing System
q A Pregel-clone
§ Supports very large graphs
§ Runs on very large clusters
q Dynamic fine-grained vertex migrations to
balance computation and communication
q Optimized for predictable and non-
predictable graph algorithms and structures
BigDansing
Apache
Spark
Mizan
GraphX
* Zuhair Khayyat, et al., “Mizan: A System for Dynamic Load
Balancing in Large-scale Graph Processing”, in EuroSys 2013
Giraph
HDFS
May 16, 2017 70/73
72. Publications
" Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Panos
Kalnis, “Fast and Scalable Inequality Joins”, The VLDB Journal 2017 special issue: Best Papers of VLDB 2015.
" Divy Agrawal, Lamine Ba, Laure Berti-Equille, Sanjay Chawla, Ahmed Elmagarmid, Hossam Hammady, Yasser Idris, Zoi
Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Mohammed J.
Zaki, “Rheem: Enabling Multi-Platform Task Execution”, in SIGMOD 2016.
" Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Panos
Kalnis, “Lightning Fast and Space Efficient Inequality Joins”, in PVLDB 2015.
" Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan
Tang, Si Yin, “BigDansing: A System for Big Data Cleansing”, in SIGMOD 2015.
" Zuhair Khayyat, Karim Awara, Amani Alonazi, Hani Jamjoom, Dan Williams, Panos Kalnis, “Mizan: A System for Dynamic
Load Balancing in Large-scale Graph Processing”, in EuroSys 2013.
May 16, 2017 73/73