CodeCleaner: Mitigating Data Contamination for LLM Benchmarking

CodeCleaner: Mitigating Data Contamination
for LLM Benchmarking
Jialun Cao Songqiang Chen Wuqi Zhang Hau Ching Lo Shing-Chi
Cheung
Yeting Li

Background – LLMs Become Popular
Software industry has increasingly embraced LLMs
6/20/25 2

Data Contamination - A Threat To Validity
6/20/25 3

Data Contamination – What is it?
LLMs may have already seen the test data during training.
6/20/25 4
Train Test

How are Human Students Assessed?
The way we assess humans:
6/20/25 5
Students
Textbook Exam
Learning Taking Exams

How are LLMs Assessed?
The way we assess AIs:
6/20/25 6
LLMs
Textbook
Learning Taking Exams
Exam

Challenges
• Training corpora is huge
• Pretraining corpora surpasses 774.5 TB !
• Data contamination could be indirectly introduced
• Github, GitLab
• Blogs, forums
• Social media feeds
6/20/25 7

Solution 1 – Collect new data
• Collect new data uploaded after LLMs’ cut-off date
6/20/25 8
2020 2021 2022 2023 2024 2025

Solution 1 – Collect new data
• Collect new data uploaded after LLMs’ cut-off date
6/20/25 9
2020 2021 2022 2023 2024 2025
🤷
No more new?

Solution 2 – Refactor old data
• Code refactoring
6/20/25 10

Solution 2 – Refactor old data
• Challenges of Code refactoring
6/20/25 11
• Perturb code while keeping its
syntactic, semantic, and logic
requirements
• Using LLMs to refactor the
code will reintroduce data
contamination

Is Code Refactoring Really Effective?
6/20/25 12
def encode_data(self, data, attributes):
current_row = 0
num_attributes = len(attributes)
for row in data:
new_data = []
if len(row) > 0 and max(row) >= num_attributes:
raise BadObject(‘Instance %d has %d attributes,
expected %d’ % (current_row,
max(row) + 1, num_attributes))
for col in sorted(row):
v = row[col]
if v is None or v == '' or v != v:
s = '?'
else:
s = encode_string(str(v))
new_data.append('%d %s' % (col, s))
current_row += 1
yield ' '.join(['{', ','.join(new_data), '}'])
(A) Code from Scikit-Learn
Change Naming Style
currentRow = 0
numAttributes = len(attributes)
for row in data:
newData = []
if len(row) > 0 and max(row) >= numAttributes:
raise BadObject('Instance %d has %d attributes,
expected %d' % (currentRow,
max(row) + 1, numAttributes))
(B) Code After Changing Naming Style
Overlap:
40.98%
Overlap:
20.78%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
• Take this piece of code as an example
DataPortraits: https://dataportraits.org/
Overlap with the
Stack-v1 (6TB
Code Data)
40.98%

Is Code Refactoring Really Effective?
6/20/25 13
current_row = 0
for row in data:
new_data = []
v = row[col]
s = '?'
else:
current_row += 1
(A) Code from Scikit-Learn
Change Naming Style
currentRow = 0
for row in data:
newData = []
Overlap:
40.98%
Overlap:
20.78%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
current_row = 0
for row in data:
new_data = []
v = row[col]
s = '?'
else:
current_row += 1
Change Naming Style
currentRow = 0
for row in data:
newData = []
v = row[col]
s = '?'
else:
newData.append('%d %s' % (col, s))
currentRow += 1
yield ' '.join(['{', ','.join(newData), '}'])
currentRow = 0
for row in data:
newData = []
(C) Code after Changing Naming Style + Flipping If-else Branch
If-else branch switch
40.98%
Overlap:
20.78%
Overlap:
0.0%
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
Change Naming Style
• Overlap ratio with training data: 40.98% à 20.78%

Is Code Refactor Really Effective?
• Overlap ratio with training data: 20.78% à 0%
6/20/25 14
currentRow = 0
for row in data:
newData = []
v = row[col]
s = '?'
else:
currentRow += 1
currentRow = 0
for row in data:
newData = []
v = row[col]
if not (v is None or v == '' or v != v):
else:
s = '?'
currentRow += 1
20.78%
Overlap:
0.0%
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
current_row = 0
for row in data:
new_data = []
v = row[col]
s = '?'
else:
current_row += 1
Change Naming Style
currentRow = 0
for row in data:
newData = []
v = row[col]
s = '?'
else:
currentRow += 1
currentRow = 0
for row in data:
newData = []
40.98%
Overlap:
20.78%
Overlap:
0.0%
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
Flip if-else branch

Code Refactoring Operators
6/20/25 15

Syntactic Operators – If-condition Flipping
6/20/25 17
current_row = 0
for row in data:
new_data = []
if not (len(row) > 0 and max(row) >= num_attributes):
pass
else:
After Flipping If-else Branches
1
2
3
4
5
6
7
8
9
10
11
Before
After

Syntactic Operators – If-condition Flipping
6/20/25 18
current_row = 0
for row in data:
new_data = []
if not (len(row) > 0 and max(row) >= num_attributes):
pass
else:
After Flipping If-else Branches
1
2
3
4
5
6
7
8
9
10
11
Before
After

Syntactic Operators – Loop transformation
6/20/25 19
Before
After
def decode_rows(self, stream, conversors):
for row in stream:
values = _parse_values(row)
(A) Original Code
1
2
3
_iter2 = iter(stream)
while True:
try:
row = next(_iter2)
except StopIteration:
break
(B) Code After Loop Transformation (for → while)
1
2
3
4
5
6
7
8
Loop Transformation

Syntactic Operators – Iter transformation
6/20/25 20
Before
After
current_row = 0
for row in data:
new_data = []
...
(A) Original Code
Iteration Transformation
1
2
3
4
5
6
7
current_row = 0
for row in range(len(data)):
new_data = []
if len(data[row]) > 0 and max(data[row]) >= num_attributes:
...
(B) Code After Iteration Transformation (direct iteration → index iteration)
1
2
3
4
5
6
7

Syntactic Operators – Commutative Law
Shuffling
6/20/25 21
Before
After
v = row[col]
s = '?'
else:
(A) Original Code
1
2
3
4
5
6
v = row[col]
if v == '' or v != v or v is None:
s = '?'
else:
(B) Code After Applying Commutative Law in Logic Operators
1
2
3
4
5
6
Commutative Law

2. Semantic Operators
6/20/25 22

Semantic Operators – Special Parameter
Appending
6/20/25 23
Before
After
for row in stream:
if not isinstance(values, dict):
raise BadLayout()
(A) Original Code
Appending special parameters
1
2
3
4
5
def decode_rows(self, stream, conversors, *args, **kwargs):
for row in stream:
if not isinstance(values, dict):
raise BadLayout()
(B) Code After Appending Special Parameters
1
2
3
4
5

Semantic Operators – Identifier Renaming
6/20/25 24
Before
After
current_row = 0
for row in data:
new_data = []
...
(A) Original Code
1
2
3
4
5
6
7
current_row = 0
for row in data:
advanced_data = []
...
(B) Code After Renaming an Identifier
1
2
3
4
5
6
7
Identifier Renaming

3. Code Style Operators
6/20/25 25

Code Style Operators - Code Normalization
6/20/25 26
Before
After
def __setattr__(self, key: str, val: Any) -> None:
prefix = object.__getattribute__(self, "prefix")
if prefix:
prefix += "."
prefix += key
if key in self.d and not isinstance(self.d[key], dict):
_set_option(prefix, val)
else:
raise OptionError("You can only set the value of existing options")
(A) Original Code
Code Normalization
1
2
3
4
5
6
7
8
9
def __setattr__(self, key: str, val: Any) -> None:
prefix = object.__getattribute__(self, 'prefix')
if prefix:
prefix += '.'
prefix += key
if key in self.d and (not isinstance(self.d[key], dict)):
_set_option(prefix, val)
else:
raise OptionError('You can only set the value of existing options')
(B) Code After Normalization
1
2
3
4
5
6
7
8
9

Code Style Operators - Naming Style
Switch
6/20/25 27
Before
After

Experiments show Effectiveness
6/20/25 28
The overlap with training data drops from red to blue,
from 87% to 22%.

Individual Operator’s Effectiveness
6/20/25 29
Original After applying operators

Effectiveness on Class-level Python code
6/20/25 30
The overlap with training data drops from red to blue,
from 66% to 32%.

Easy-Use Toolkit – CodeCleaner
6/20/25 31
root = ast.parse(code)
root, _ = refactorer.refactor(root)
refactored_code = ast.unparse(root)
CodeCleaner Github:
https://github.com/ArabelaTso/CodeCleaner-v1/
Jialun Cao

CodeCleaner: Mitigating Data Contamination for LLM Benchmarking

More Related Content

Similar to CodeCleaner: Mitigating Data Contamination for LLM Benchmarking

Recently uploaded

CodeCleaner: Mitigating Data Contamination for LLM Benchmarking