Yilin Xia (yilinx2@illinois.edu),
Shawn Bowers (bowers@gonzaga.edu),
Lan Li (lanl2@illinois.edu), and
Bertram Ludäscher (ludaesch@illinois.edu)
Presented at IDCC-2024 in Edinburg.
ABSTRACT. We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal argumentation framework (AF). Such argumentation frameworks can be automatically analyzed and solved by translating them to a logic program PAF whose declarative semantics yield a transparent solution with many desirable properties, e.g., uncontroversial updates are accepted, unjustified ones are rejected, and the remaining ambiguities are exposed and presented to users for further analysis. After motivating the problem, we introduce our approach and illustrate it with a detailed running example introducing both well-founded and stable semantics to help understand the AF solutions. We have begun to develop open source tools and Jupyter notebooks that demonstrate the practicality of our approach. In future work we plan to develop a toolkit for conflict resolution that can be used in conjunction with OpenRefine, a popular interactive data cleaning tool.
How to Transform Clinical Trial Management with Advanced Data Analytics
Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
1. 1
Reconciling Conflicting Data Curation Actions:
Transparency through Argumentation
Yilin Xia (yilinx2@illinois.edu)
Shawn Bowers (bowers@gonzaga.edu)
Lan Li (lanl2@illinois.edu)
Bertram Ludäscher (ludaesch@illinois.edu)
2. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Data Cleaning: the story so far …
● 80% of data science is data wrangling … (or so they say)
● Interactive data cleaning (e.g. Excel, OpenRefine, … )
● Script-based (e.g., Python/pandas, R, … )
● Single-user/single-curator setting (… only the lonely … )
● Multi-user/multi-curator collaboration (… friends ..)
3. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Collaborative Data Cleaning: Pros & possible Cons
Joining forces & pooling expertise
è higher throughput (efficiency)
è higher data quality output
But also …
è Need to coordinate more (e.g., vertical- and/or horizontal splitting, ...)
è Need to resolve conflicts / disputes
è Cost of collaboration
4. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Collaborative Data Cleaning Part-I: Provenance + Expert Merge
Collaborative DC
Provenance Model (CDCM)
Expert Recipe Merge
5. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Ross Loretta
Whole Team > Sum of Members?
● Before: Expert coordinator, merging bits & pieces of data cleaning recipes
● Alternative: Tightly-coupled, well-planned collaboration (“eager”)
● New proposal: Loosely-coupled or ad-hoc collaboration (“lazy”)
+ automated conflict-resolution strategy
Rosetta
Team
+ <
6. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Loosely-Coupled Multi-Curator Data Cleaning Example
6
Book Title Author Date
Against Method Feyerabend, P. 1975
Changing Order Collins, H.M. ␣␣1985 ␣
Exceeding Our Grasp P. Kyle Stanford 2006
Theory of Information 1992
Wrangling Goal: Create an APA style in-text citation based on the given dataset D
Ross Loretta
9. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Data Cleaning Actions è Recipes
9
Step Action
E rename("Book Title", "Book-Title")
F cell_edit(3, "Author", "Stanford, P.")
G transform("Date", "value.toNumber()")
H del_row(4)
I split_col("Author", ",")
J del_col("Author 2")
K join_col("Author 1", "Date", "," ,
"Citation")
Recipe 1
Step Actions
L rename("Book Title", "Book_Title")
M transform("Date", "value.trim()")
N cell_edit(4, "Author", "Shannon, C.E.")
O cell_edit(3, "Author", "Stanford, P.K.")
P split_col("Author", ",")
Q rename("Author 1", "Last Name")
R rename("Author 2", "First Name")
S join_col("Last Name", "Date", "," ,
"Citation")
Recipe 2
10. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Data Cleaning Results
10
Book-Title Author Date Author 1 Citation
Against Method Feyerabend, P. 1975 Feyerabend Feyerabend,
1975
Changing Order Collins, H.M. 1985 Collins Collins, 1985
Exceeding Our
Grasp
Stanford, P. 2006 Stanford Stanford,
2006
Theory of
Information
1992
Book_Title Author Date Last Name First
Name
Citation
Against
Method
Feyerabend,
P.
1975 Feyerabend P. Feyerabend,
1975
Changing
Order
Collins, H.M. 1985 Collins H.M. Collins, 1985
Exceeding Our
Grasp
Stanford, P.K. 2006 Stanford P.K. Stanford,
2006
Theory of
Information
Shannon,
C.E.
1992 Shannon C.E. Shannon,
1992
rename("Book Title",
"Book-Title")
rename("Book Title",
"Book_Title")
del_row(4)
transform("Date",
"value.toNumber()")
transform("Date",
"value.trim()")
11. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Modeling Data Cleaning Conflicts
11
Execution Order Data Cleaning Actions
Attack Relationship
defeated(𝑋) ←
attacks(𝑌, 𝑋),
¬ defeated(𝑌).
12. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Operation Attack Relation (example, one of many)
12
B
A
Attack Relationship update(r,c,v1) del_row(r) del_col(c) split_col(c,sp1) transform(c,F1) join_col(c,...ci,sp1, cn1) rename(c, c1)
update(r,c,v2) A ⟷ B
del_row(r) A ⟶ B ∅
del_col(c) A ⟶ B ∅ ∅
split_col(c,sp2) A ⟵ B ∅ A ⟵ B A ⟷ B
transform(c,F2) A ⟷ B ∅ A ⟵ B A ⟶ B A ⟷ B
join_col(c,...ci,sp2, cn2) A ⟵ B ∅ A ⟵ B ∅ A ⟵ B A ⟷ B
rename(c, c2) A ⟶ B ∅ A ⟷ B A ⟶ B A ⟶ B A ⟶ B A ⟷ B
Describe whether/how operations A and B are in conflict with each other
13. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Example Data Cleaning Conflicts
13
Attack Description
E ↔ L rename("Book Title", "Book-Title") ↔
rename("Book Title", "Book_Title")
K ← Q del_row(4) → cell_edit(4, "Author", "Shannon, C.E.")
F → P cell_edit(3, "Author", "Stanford, P.") →
split_col("Author", ",")
… …
defeated(𝑋) ←
attacks(𝑌, 𝑋),
¬ defeated(𝑌).
14. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Formal
Argumentation
14
BBC4 Moral Maze
15. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Modeling Conflict: Argumentation Frameworks
15
defeated(𝑋) ç attacks(𝑌, 𝑋), ¬ defeated(𝑌).
accepted
defeated undecided
undecided
1. a isn’t attacked at all
2. ⇒ a is accepted
3. a attacks b
4. ⇒ b defeated
5. ⇒ b attacks c can be ignored
6. c and d attack each other
7. ⇒ status undecided
16. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Solving Conflict: Argumentation Frameworks (AF)
16
Input AF
(attack graph)
Output
(solved AF)
defeated(𝑋) ⇐
attacks(𝑌, 𝑋),
not defeated(𝑌).
Argument X is defeated
if it is attacked by Y
and Y is not defeated
18. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Solving Ross + Loretta ( = Rosetta) ad-hoc “collaboration”
18
Yilin Xia, Shawn Bowers, Lan Li and Bertram Ludäscher. 2023. Games and Argumentation Demo Repository.
https://github.com/idaks/Games-and-Argumentation/tree/idcc
19. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Refined Solution (Stable Model/Stable Extension)
19
Yilin Xia, Shawn Bowers, Lan Li and Bertram Ludäscher. 2023. Games and Argumentation Demo Repository.
https://github.com/idaks/Games-and-Argumentation/tree/idcc
20. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Refined Solution put back in Recipe Order
20
Step Actions Curator
E rename("Book Title", "Book-Title") Alice
M transform("Date", "value.trim()") Bob
H del_row(4) Alice
O cell_edit(3, "Author", "Stanford,
P.K.")
Bob
P split_col("Author", ",") Bob
J del_col("Author 2") Alice
Q rename("Author 1", "Last Name") Bob
S join_col("Last Name", "Date", "," ,
"Citation")
Bob
21. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Et voilà! The merged recipe and combined solution!
21
Yilin Xia, Shawn Bowers, Lan Li and Bertram Ludäscher. 2023. Games and Argumentation Demo Repository.
https://github.com/idaks/Games-and-Argumentation/tree/idcc
22. Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
Conclusions (Work in Progress) & Future Work
22
An approach based on formal
argumentation frameworks for
- modeling the actions of users’ data-
cleaning recipes
- identifying conflicting actions across
recipes
- providing users with new tools to help
resolve these conflicts to generate a
single, unified, merged recipe.
An algorithm helps auto-process
recipes and solve conflicts
Take dependencies in account
when modeling
Explore criterias can be used to
evaluate possible merged recipe
23. 23
Reconciling Conflicting Data Curation Actions:
Transparency Through Argumentation
Yilin Xia yilinx2@illinois.edu
Shawn Bowers bowers@gonzaga.edu
Lan Li lanl2@illinois.edu
Bertram Ludäscher ludaesch@illinois.edu