CodeCleaner: Mitigating Data Contamination
for LLM Benchmarking
Jialun Cao Songqiang Chen Wuqi Zhang Hau Ching Lo Shing-Chi
Cheung
Yeting Li
Background – LLMs Become Popular
Software industry has increasingly embraced LLMs
6/20/25 2
Data Contamination - A Threat To Validity
6/20/25 3
Data Contamination – What is it?
LLMs may have already seen the test data during training.
6/20/25 4
Train Test
How are Human Students Assessed?
The way we assess humans:
6/20/25 5
Students
Textbook Exam
Learning Taking Exams
How are LLMs Assessed?
The way we assess AIs:
6/20/25 6
LLMs
Textbook
Learning Taking Exams
Exam
Challenges
• Training corpora is huge
• Pretraining corpora surpasses 774.5 TB !
• Data contamination could be indirectly introduced
• Github, GitLab
• Blogs, forums
• Social media feeds
6/20/25 7
Solution 1 – Collect new data
• Collect new data uploaded after LLMs’ cut-off date
6/20/25 8
2020 2021 2022 2023 2024 2025
Solution 1 – Collect new data
• Collect new data uploaded after LLMs’ cut-off date
6/20/25 9
2020 2021 2022 2023 2024 2025
🤷
No more new?
Solution 2 – Refactor old data
• Code refactoring
6/20/25 10
Solution 2 – Refactor old data
• Challenges of Code refactoring
6/20/25 11
• Perturb code while keeping its
syntactic, semantic, and logic
requirements
• Using LLMs to refactor the
code will reintroduce data
contamination
Is Code Refactoring Really Effective?
6/20/25 12
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if len(row) > 0 and max(row) >= num_attributes:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
new_data.append('%d %s' % (col, s))
current_row += 1
yield ' '.join(['{', ','.join(new_data), '}'])
(A) Code from Scikit-Learn
Change Naming Style
def encode_data(self, data, attributes):
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= numAttributes:
raise BadObject('Instance %d has %d attributes,
expected %d' % (currentRow,
max(row) + 1, numAttributes))
(B) Code After Changing Naming Style
Overlap:
40.98%
Overlap:
20.78%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
• Take this piece of code as an example
DataPortraits: https://dataportraits.org/
Overlap with the
Stack-v1 (6TB
Code Data)
40.98%
Is Code Refactoring Really Effective?
6/20/25 13
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if len(row) > 0 and max(row) >= num_attributes:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
new_data.append('%d %s' % (col, s))
current_row += 1
yield ' '.join(['{', ','.join(new_data), '}'])
(A) Code from Scikit-Learn
Change Naming Style
def encode_data(self, data, attributes):
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= numAttributes:
raise BadObject('Instance %d has %d attributes,
expected %d' % (currentRow,
max(row) + 1, numAttributes))
(B) Code After Changing Naming Style
Overlap:
40.98%
Overlap:
20.78%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if len(row) > 0 and max(row) >= num_attributes:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
new_data.append('%d %s' % (col, s))
current_row += 1
yield ' '.join(['{', ','.join(new_data), '}'])
Change Naming Style
def encode_data(self, data, attributes):
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= numAttributes:
raise BadObject('Instance %d has %d attributes,
expected %d' % (currentRow,
max(row) + 1, numAttributes))
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
newData.append('%d %s' % (col, s))
currentRow += 1
yield ' '.join(['{', ','.join(newData), '}'])
(B) Code After Changing Naming Style
def encode_data(self, data, attributes):
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= num_attributes:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
(C) Code after Changing Naming Style + Flipping If-else Branch
If-else branch switch
40.98%
Overlap:
20.78%
Overlap:
0.0%
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
Change Naming Style
• Overlap ratio with training data: 40.98% à 20.78%
Is Code Refactor Really Effective?
• Overlap ratio with training data: 20.78% à 0%
6/20/25 14
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= numAttributes:
raise BadObject('Instance %d has %d attributes,
expected %d' % (currentRow,
max(row) + 1, numAttributes))
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
newData.append('%d %s' % (col, s))
currentRow += 1
yield ' '.join(['{', ','.join(newData), '}'])
def encode_data(self, data, attributes):
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= num_attributes:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
for col in sorted(row):
v = row[col]
if not (v is None or v == '' or v != v):
s = encode_string(str(v))
else:
s = '?'
newData.append('%d %s' % (col, s))
currentRow += 1
yield ' '.join(['{', ','.join(newData), '}'])
(C) Code after Changing Naming Style + Flipping If-else Branch
If-else branch switch
20.78%
Overlap:
0.0%
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if len(row) > 0 and max(row) >= num_attributes:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
new_data.append('%d %s' % (col, s))
current_row += 1
yield ' '.join(['{', ','.join(new_data), '}'])
Change Naming Style
def encode_data(self, data, attributes):
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= numAttributes:
raise BadObject('Instance %d has %d attributes,
expected %d' % (currentRow,
max(row) + 1, numAttributes))
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
newData.append('%d %s' % (col, s))
currentRow += 1
yield ' '.join(['{', ','.join(newData), '}'])
(B) Code After Changing Naming Style
def encode_data(self, data, attributes):
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= num_attributes:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
(C) Code after Changing Naming Style + Flipping If-else Branch
If-else branch switch
40.98%
Overlap:
20.78%
Overlap:
0.0%
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
Flip if-else branch
Code Refactoring Operators
6/20/25 15
1. Syntactic Oprs
6/20/25 16
Syntactic Operators – If-condition Flipping
6/20/25 17
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if not (len(row) > 0 and max(row) >= num_attributes):
pass
else:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
After Flipping If-else Branches
1
2
3
4
5
6
7
8
9
10
11
Before
After
Syntactic Operators – If-condition Flipping
6/20/25 18
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if not (len(row) > 0 and max(row) >= num_attributes):
pass
else:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
After Flipping If-else Branches
1
2
3
4
5
6
7
8
9
10
11
Before
After
Syntactic Operators – Loop transformation
6/20/25 19
Before
After
def decode_rows(self, stream, conversors):
for row in stream:
values = _parse_values(row)
(A) Original Code
1
2
3
def decode_rows(self, stream, conversors):
_iter2 = iter(stream)
while True:
try:
row = next(_iter2)
except StopIteration:
break
values = _parse_values(row)
(B) Code After Loop Transformation (for → while)
1
2
3
4
5
6
7
8
Loop Transformation
Syntactic Operators – Iter transformation
6/20/25 20
Before
After
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if len(row) > 0 and max(row) >= num_attributes:
...
(A) Original Code
Iteration Transformation
1
2
3
4
5
6
7
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in range(len(data)):
new_data = []
if len(data[row]) > 0 and max(data[row]) >= num_attributes:
...
(B) Code After Iteration Transformation (direct iteration → index iteration)
1
2
3
4
5
6
7
Syntactic Operators – Commutative Law
Shuffling
6/20/25 21
Before
After
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
(A) Original Code
1
2
3
4
5
6
for col in sorted(row):
v = row[col]
if v == '' or v != v or v is None:
s = '?'
else:
s = encode_string(str(v))
(B) Code After Applying Commutative Law in Logic Operators
1
2
3
4
5
6
Commutative Law
2. Semantic Operators
6/20/25 22
Semantic Operators – Special Parameter
Appending
6/20/25 23
Before
After
def decode_rows(self, stream, conversors):
for row in stream:
values = _parse_values(row)
if not isinstance(values, dict):
raise BadLayout()
(A) Original Code
Appending special parameters
1
2
3
4
5
def decode_rows(self, stream, conversors, *args, **kwargs):
for row in stream:
values = _parse_values(row)
if not isinstance(values, dict):
raise BadLayout()
(B) Code After Appending Special Parameters
1
2
3
4
5
Semantic Operators – Identifier Renaming
6/20/25 24
Before
After
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if len(row) > 0 and max(row) >= num_attributes:
...
(A) Original Code
1
2
3
4
5
6
7
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in data:
advanced_data = []
if len(row) > 0 and max(row) >= num_attributes:
...
(B) Code After Renaming an Identifier
1
2
3
4
5
6
7
Identifier Renaming
3. Code Style Operators
6/20/25 25
Code Style Operators - Code Normalization
6/20/25 26
Before
After
def __setattr__(self, key: str, val: Any) -> None:
prefix = object.__getattribute__(self, "prefix")
if prefix:
prefix += "."
prefix += key
if key in self.d and not isinstance(self.d[key], dict):
_set_option(prefix, val)
else:
raise OptionError("You can only set the value of existing options")
(A) Original Code
Code Normalization
1
2
3
4
5
6
7
8
9
def __setattr__(self, key: str, val: Any) -> None:
prefix = object.__getattribute__(self, 'prefix')
if prefix:
prefix += '.'
prefix += key
if key in self.d and (not isinstance(self.d[key], dict)):
_set_option(prefix, val)
else:
raise OptionError('You can only set the value of existing options')
(B) Code After Normalization
1
2
3
4
5
6
7
8
9
Code Style Operators - Naming Style
Switch
6/20/25 27
Before
After
Experiments show Effectiveness
6/20/25 28
The overlap with training data drops from red to blue,
from 87% to 22%.
Individual Operator’s Effectiveness
6/20/25 29
Original After applying operators
Effectiveness on Class-level Python code
6/20/25 30
The overlap with training data drops from red to blue,
from 66% to 32%.
Easy-Use Toolkit – CodeCleaner
6/20/25 31
root = ast.parse(code)
root, _ = refactorer.refactor(root)
refactored_code = ast.unparse(root)
CodeCleaner Github:
https://github.com/ArabelaTso/CodeCleaner-v1/
Jialun Cao

CodeCleaner: Mitigating Data Contamination for LLM Benchmarking

  • 1.
    CodeCleaner: Mitigating DataContamination for LLM Benchmarking Jialun Cao Songqiang Chen Wuqi Zhang Hau Ching Lo Shing-Chi Cheung Yeting Li
  • 2.
    Background – LLMsBecome Popular Software industry has increasingly embraced LLMs 6/20/25 2
  • 3.
    Data Contamination -A Threat To Validity 6/20/25 3
  • 4.
    Data Contamination –What is it? LLMs may have already seen the test data during training. 6/20/25 4 Train Test
  • 5.
    How are HumanStudents Assessed? The way we assess humans: 6/20/25 5 Students Textbook Exam Learning Taking Exams
  • 6.
    How are LLMsAssessed? The way we assess AIs: 6/20/25 6 LLMs Textbook Learning Taking Exams Exam
  • 7.
    Challenges • Training corporais huge • Pretraining corpora surpasses 774.5 TB ! • Data contamination could be indirectly introduced • Github, GitLab • Blogs, forums • Social media feeds 6/20/25 7
  • 8.
    Solution 1 –Collect new data • Collect new data uploaded after LLMs’ cut-off date 6/20/25 8 2020 2021 2022 2023 2024 2025
  • 9.
    Solution 1 –Collect new data • Collect new data uploaded after LLMs’ cut-off date 6/20/25 9 2020 2021 2022 2023 2024 2025 🤷 No more new?
  • 10.
    Solution 2 –Refactor old data • Code refactoring 6/20/25 10
  • 11.
    Solution 2 –Refactor old data • Challenges of Code refactoring 6/20/25 11 • Perturb code while keeping its syntactic, semantic, and logic requirements • Using LLMs to refactor the code will reintroduce data contamination
  • 12.
    Is Code RefactoringReally Effective? 6/20/25 12 def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if len(row) > 0 and max(row) >= num_attributes: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) for col in sorted(row): v = row[col] if v is None or v == '' or v != v: s = '?' else: s = encode_string(str(v)) new_data.append('%d %s' % (col, s)) current_row += 1 yield ' '.join(['{', ','.join(new_data), '}']) (A) Code from Scikit-Learn Change Naming Style def encode_data(self, data, attributes): currentRow = 0 numAttributes = len(attributes) for row in data: newData = [] if len(row) > 0 and max(row) >= numAttributes: raise BadObject('Instance %d has %d attributes, expected %d' % (currentRow, max(row) + 1, numAttributes)) (B) Code After Changing Naming Style Overlap: 40.98% Overlap: 20.78% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 • Take this piece of code as an example DataPortraits: https://dataportraits.org/ Overlap with the Stack-v1 (6TB Code Data) 40.98%
  • 13.
    Is Code RefactoringReally Effective? 6/20/25 13 def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if len(row) > 0 and max(row) >= num_attributes: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) for col in sorted(row): v = row[col] if v is None or v == '' or v != v: s = '?' else: s = encode_string(str(v)) new_data.append('%d %s' % (col, s)) current_row += 1 yield ' '.join(['{', ','.join(new_data), '}']) (A) Code from Scikit-Learn Change Naming Style def encode_data(self, data, attributes): currentRow = 0 numAttributes = len(attributes) for row in data: newData = [] if len(row) > 0 and max(row) >= numAttributes: raise BadObject('Instance %d has %d attributes, expected %d' % (currentRow, max(row) + 1, numAttributes)) (B) Code After Changing Naming Style Overlap: 40.98% Overlap: 20.78% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if len(row) > 0 and max(row) >= num_attributes: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) for col in sorted(row): v = row[col] if v is None or v == '' or v != v: s = '?' else: s = encode_string(str(v)) new_data.append('%d %s' % (col, s)) current_row += 1 yield ' '.join(['{', ','.join(new_data), '}']) Change Naming Style def encode_data(self, data, attributes): currentRow = 0 numAttributes = len(attributes) for row in data: newData = [] if len(row) > 0 and max(row) >= numAttributes: raise BadObject('Instance %d has %d attributes, expected %d' % (currentRow, max(row) + 1, numAttributes)) for col in sorted(row): v = row[col] if v is None or v == '' or v != v: s = '?' else: s = encode_string(str(v)) newData.append('%d %s' % (col, s)) currentRow += 1 yield ' '.join(['{', ','.join(newData), '}']) (B) Code After Changing Naming Style def encode_data(self, data, attributes): currentRow = 0 numAttributes = len(attributes) for row in data: newData = [] if len(row) > 0 and max(row) >= num_attributes: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) (C) Code after Changing Naming Style + Flipping If-else Branch If-else branch switch 40.98% Overlap: 20.78% Overlap: 0.0% 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 Change Naming Style • Overlap ratio with training data: 40.98% à 20.78%
  • 14.
    Is Code RefactorReally Effective? • Overlap ratio with training data: 20.78% à 0% 6/20/25 14 currentRow = 0 numAttributes = len(attributes) for row in data: newData = [] if len(row) > 0 and max(row) >= numAttributes: raise BadObject('Instance %d has %d attributes, expected %d' % (currentRow, max(row) + 1, numAttributes)) for col in sorted(row): v = row[col] if v is None or v == '' or v != v: s = '?' else: s = encode_string(str(v)) newData.append('%d %s' % (col, s)) currentRow += 1 yield ' '.join(['{', ','.join(newData), '}']) def encode_data(self, data, attributes): currentRow = 0 numAttributes = len(attributes) for row in data: newData = [] if len(row) > 0 and max(row) >= num_attributes: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) for col in sorted(row): v = row[col] if not (v is None or v == '' or v != v): s = encode_string(str(v)) else: s = '?' newData.append('%d %s' % (col, s)) currentRow += 1 yield ' '.join(['{', ','.join(newData), '}']) (C) Code after Changing Naming Style + Flipping If-else Branch If-else branch switch 20.78% Overlap: 0.0% 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if len(row) > 0 and max(row) >= num_attributes: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) for col in sorted(row): v = row[col] if v is None or v == '' or v != v: s = '?' else: s = encode_string(str(v)) new_data.append('%d %s' % (col, s)) current_row += 1 yield ' '.join(['{', ','.join(new_data), '}']) Change Naming Style def encode_data(self, data, attributes): currentRow = 0 numAttributes = len(attributes) for row in data: newData = [] if len(row) > 0 and max(row) >= numAttributes: raise BadObject('Instance %d has %d attributes, expected %d' % (currentRow, max(row) + 1, numAttributes)) for col in sorted(row): v = row[col] if v is None or v == '' or v != v: s = '?' else: s = encode_string(str(v)) newData.append('%d %s' % (col, s)) currentRow += 1 yield ' '.join(['{', ','.join(newData), '}']) (B) Code After Changing Naming Style def encode_data(self, data, attributes): currentRow = 0 numAttributes = len(attributes) for row in data: newData = [] if len(row) > 0 and max(row) >= num_attributes: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) (C) Code after Changing Naming Style + Flipping If-else Branch If-else branch switch 40.98% Overlap: 20.78% Overlap: 0.0% 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 Flip if-else branch
  • 15.
  • 16.
  • 17.
    Syntactic Operators –If-condition Flipping 6/20/25 17 def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if not (len(row) > 0 and max(row) >= num_attributes): pass else: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) After Flipping If-else Branches 1 2 3 4 5 6 7 8 9 10 11 Before After
  • 18.
    Syntactic Operators –If-condition Flipping 6/20/25 18 def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if not (len(row) > 0 and max(row) >= num_attributes): pass else: raise BadObject(‘Instance %d has %d attributes, expected %d’ % (current_row, max(row) + 1, num_attributes)) After Flipping If-else Branches 1 2 3 4 5 6 7 8 9 10 11 Before After
  • 19.
    Syntactic Operators –Loop transformation 6/20/25 19 Before After def decode_rows(self, stream, conversors): for row in stream: values = _parse_values(row) (A) Original Code 1 2 3 def decode_rows(self, stream, conversors): _iter2 = iter(stream) while True: try: row = next(_iter2) except StopIteration: break values = _parse_values(row) (B) Code After Loop Transformation (for → while) 1 2 3 4 5 6 7 8 Loop Transformation
  • 20.
    Syntactic Operators –Iter transformation 6/20/25 20 Before After def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if len(row) > 0 and max(row) >= num_attributes: ... (A) Original Code Iteration Transformation 1 2 3 4 5 6 7 def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in range(len(data)): new_data = [] if len(data[row]) > 0 and max(data[row]) >= num_attributes: ... (B) Code After Iteration Transformation (direct iteration → index iteration) 1 2 3 4 5 6 7
  • 21.
    Syntactic Operators –Commutative Law Shuffling 6/20/25 21 Before After for col in sorted(row): v = row[col] if v is None or v == '' or v != v: s = '?' else: s = encode_string(str(v)) (A) Original Code 1 2 3 4 5 6 for col in sorted(row): v = row[col] if v == '' or v != v or v is None: s = '?' else: s = encode_string(str(v)) (B) Code After Applying Commutative Law in Logic Operators 1 2 3 4 5 6 Commutative Law
  • 22.
  • 23.
    Semantic Operators –Special Parameter Appending 6/20/25 23 Before After def decode_rows(self, stream, conversors): for row in stream: values = _parse_values(row) if not isinstance(values, dict): raise BadLayout() (A) Original Code Appending special parameters 1 2 3 4 5 def decode_rows(self, stream, conversors, *args, **kwargs): for row in stream: values = _parse_values(row) if not isinstance(values, dict): raise BadLayout() (B) Code After Appending Special Parameters 1 2 3 4 5
  • 24.
    Semantic Operators –Identifier Renaming 6/20/25 24 Before After def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in data: new_data = [] if len(row) > 0 and max(row) >= num_attributes: ... (A) Original Code 1 2 3 4 5 6 7 def encode_data(self, data, attributes): current_row = 0 num_attributes = len(attributes) for row in data: advanced_data = [] if len(row) > 0 and max(row) >= num_attributes: ... (B) Code After Renaming an Identifier 1 2 3 4 5 6 7 Identifier Renaming
  • 25.
    3. Code StyleOperators 6/20/25 25
  • 26.
    Code Style Operators- Code Normalization 6/20/25 26 Before After def __setattr__(self, key: str, val: Any) -> None: prefix = object.__getattribute__(self, "prefix") if prefix: prefix += "." prefix += key if key in self.d and not isinstance(self.d[key], dict): _set_option(prefix, val) else: raise OptionError("You can only set the value of existing options") (A) Original Code Code Normalization 1 2 3 4 5 6 7 8 9 def __setattr__(self, key: str, val: Any) -> None: prefix = object.__getattribute__(self, 'prefix') if prefix: prefix += '.' prefix += key if key in self.d and (not isinstance(self.d[key], dict)): _set_option(prefix, val) else: raise OptionError('You can only set the value of existing options') (B) Code After Normalization 1 2 3 4 5 6 7 8 9
  • 27.
    Code Style Operators- Naming Style Switch 6/20/25 27 Before After
  • 28.
    Experiments show Effectiveness 6/20/2528 The overlap with training data drops from red to blue, from 87% to 22%.
  • 29.
    Individual Operator’s Effectiveness 6/20/2529 Original After applying operators
  • 30.
    Effectiveness on Class-levelPython code 6/20/25 30 The overlap with training data drops from red to blue, from 66% to 32%.
  • 31.
    Easy-Use Toolkit –CodeCleaner 6/20/25 31 root = ast.parse(code) root, _ = refactorer.refactor(root) refactored_code = ast.unparse(root) CodeCleaner Github: https://github.com/ArabelaTso/CodeCleaner-v1/ Jialun Cao