SlideShare a Scribd company logo
Automated Refactoring in
Large Python Codebases
Jimmy Lai, Staff Software Engineer at Carta
July 15, 2022
Problems in Large Codebases
Code Formatting
Type Checking
2
3
Startup Founder
Employee
Stock
Option
Investor
Stock
Money
Compensation
Valuation
Tax
Fund Admin.
Large Codebase
4
Python code
Developed for 10 years
200 active developers
2 million lines
30,000 files 20,000 file changes
🚩Challenge:
Impossible to merge
due to conflicts
Problem: Code Formatting
Flexible Python Code Style
6
Python Code Formatting Tool
Break up changes as small pieces incrementally
7
20,000 file changes
🚩 Challenge:
Impossible to merge due
to conflicts
🚩Challenge:
Extra work: need to
manage each change to
create/review/merge them
and avoid overlaps
1
~20 files changes
2
~20 files changes
3
~20 files changes
🚩Challenge:
Avoid regression during
incremental adoption
Problem: Type Checking
Common Python Errors
Mypy can catch the errors by doing type checking
9
TypeError: f() got multiple values for keyword argument 'a'
AttributeError: 'str' object has no attribute 'args'
ValueError: not enough values to unpack (expected 4, got 3)
Error: "str" has no attribute "args"
Error: Attribute function "f" with type "Callable[[C], Any]" does not accept self argument
Error: Need more than 3 values to unpack (4 expected)
Type Annotations
10
def add(a, b):
…
def add(a: int, b: int) -> int:
…
add(1.2, 2)
Error: Argument 1 to "add" has incompatible type "float"; expected "int"
🚩 Challenge: Too many functions (100k) without type annotations
Solution:
Automated Refactoring
“
What if we can automate
incremental code changes?
The lifecycle of
a code change
code paths 1 2 3 4 5 6
Automatic Refactoring Pull Requests
The lifecycle of
a code change
code paths
Code changes in a pull request
1 2 3 4 5 6
Automatic Refactoring Pull Requests
14
Test run
The lifecycle of
a code change
15
code paths
Add reviewers
success
Test run
Test failures
failure
Retry test run after 1 day
Code changes in a pull request
1 2 3 4 5 6
Automatic Refactoring Pull Requests
The lifecycle of
a code change
16
code paths
Add reviewers
success
Test run
Test failures
failure
Retry test run after 1 day
Code changes in a pull request
1 2 3 4 5 6
Automatic Refactoring Pull Requests
Wait for review
Approvals
2
Merge
Add more reviewers
1 + pending for 3 days
Post in reviewer’s
team Slack channel
0
The lifecycle of
a code change
17
code paths
Add reviewers
success
Test run
Test failures
failure
Retry test run after 1 day
Code changes in a pull request
1 2 3 4 5 6
Automatic Refactoring Pull Requests
Wait for review
Approvals
2
Merge
Add more reviewers
1 + pending for 3 days
Post in reviewer’s
team Slack channel
0
after some time
Rebase
Close PR
merge conflict
The lifecycle of
a code change
18
code paths
Add reviewers
success
Test run
Test failures
failure
Retry test run after 1 day
Apply Code Changes
1 2 3 4 5 6
Automatic Refactoring Pull Requests
Wait for review
Approvals
2
Merge
Add more reviewers
1 + pending for 3 days
Post in reviewer’s
team Slack channel
0
after some time
Rebase
Close PR
merge conflict
Automation Actions performed by pipeline jobs
Design and Implementation
Size of Code Change
20
Tradeoffs:
● Too Big: not easy to review
● Too Small: too many pull requests
from os import walk
from pathlib import Path
for dirpath, _, _ in walk("."):
num_files = len(path for path in Path(dirpath).glob("**/*.py"))
if num_files < THRESHOLD:
... # use the dirpath
Number of Pending PRs
21
Tradeoffs:
● Too many concurrent pending PRs introduce too much code review work for developers
● Too few PRs results in slow progress
# gh CLI tool allow searching pending PRs based on label
# gh pr list --label automated-refactoring --json number,headRefName
from subprocess import check_output
from json import loads
pending_prs = loads(check_output(["gh", "pr", "list", "--label",
"automated-refactoring", "--json", "number,headRefName"]).decode())
if len(pending_prs) < PENDING_PR_THRESHOLD:
... # create more PRs
Avoid Duplications
22
Pull request paths should not have overlaps.
Encode the path info as part of the pull request for lookup
Using Git branch name
"black_formatting@dir_a/dir_b"
"black_formatting@dir_a/dir_c"
Incremental Adoption
23
Protect automated refactored files from regression using an enrollment process.
1. When a file is Black formatted, add the file path to an enrollment list.
2. A CI job runs Black format check on all enrolled files to ensure they follow
Black formats.
Once we enrolled the entire codebase, we can run Black on the entire codebase
and remove the enrollment list.
The same approach applies to type annotation adoption.
State Machine Transitions with Periodical Jobs
24
code paths
Add reviewers
success
Test run
Test failures
failure
Retry test run after 1 day
Apply Code Changes
Wait for review
Approvals
2
Merge
Add more reviewers
1 + pending for 3 days
Post in reviewer’s
team Slack channel
0
after some time
Rebase
Close PR
merge conflict
Automation Actions performed by pipeline jobs
Jobs
25
1. Apply code change and create a pull request
2. Check test status and add reviewers
3. Check review status and add more reviewers
4. Send Slack notifications for old PRs
5. Merge approved PRs
6. Close Pull Request when conflicts
Automated Refactoring Application API
26
from abc import ABC
from subprocess import check_output
class RefactoringApplication(ABC):
...
class BlackFormattingApplication(RefactoringApplication):
pr_title: str = "Automated Black formatting on path %{target_path}"
pr_labels: List[str] = ["automated-refactoring", "black-formatting"]
pr_body: Path: Path("Black_formatting_body.md")
commit_message: "Automated Black formatting on path %{target_path}"
def skip_a_path(path: Path) -> bool:
# lookup black_enrollment.txt to verify if the path exists or is covered by another path
return path_is_enrolled_in_black(path)
def refactor(path: Path):
# 1. Apply Black on the target path
check_output(["black", str("path")])
# 2. Add the path to the enrollment file black_enrollment.txt
enroll_a_path_to_black_formatting(path)
Periodical Jobs
27
27
Using Github workflows to run a Python script every 30 minutes during weekdays
on:
schedule:
- cron: '30 * * * 1,2,3,4,5'
jobs:
runs-on: ubuntu-latest
steps:
name: Apply code change and create a pull request
shell: python
run: |
from subprocess import check_output
check_output([...])
.github/workflows/create_pull_requests.yml
Job: Apply code change and create a pull request
from github import Github # use PyGithub library
github = Github("${{ github.token }}")
repo = github.get_repo("${{ github.repository }}")
for Application in RefactoringApplication.__subclasses__():
app = Application()
while len(app.get_pending_prs()) < PENDING_PR_THRESHOLD:
path = app.get_next_path()
app.refactor(path)
check_output(["git", "add", str(path)])
check_output(["git", "commit", "-m", app.get_commit_message()])
repo.create_pull(title=app.get_pr_title(),
body=app.get_pr_body(), head=app.get_branch(), base=MAIN_BRANCH)
Job: Merge approved PRs
29
from github import Github # use PyGithub library
github = Github("${{ github.token }}")
repo = github.get_repo("${{ github.repository }}")
for issue in repo.legacy_search_issues("open",
"label:automated-refactoring is:pull-request review:approved"):
pull = repo.get_pull(issue.number)
pull.merge()
Refactor Python code to add missing types
30
Add missing types based on simple inferences
from pathlib import Path
from libcst import CSTTransformer, FunctionDef, Return, Yield, Annotation, Name, parse_module
class AddMissingNoneReturn(CSTTransformer):
def __init__(self):
self.return_count = 0
def visit_FunctionDef(self, node: FunctionDef) -> None:
self.return_count = 0
def visit_Return(self, node: Return) -> None:
self.return_count += 1
def leave_FunctionDef(self, original_node: FunctionDef, updated_node: FunctionDef) -> FunctionDef:
if updated_node.returns is None and self.return_count == 0:
return updated_node.with_changes(returns=Annotation(annotations=Name("None")))
return updated_node
def refactor(python_file_path: Path):
module = parse_module(python_file_path.read_text())
modified_module = module.visit(AddMissingNoneReturn())
python_file_path.write_text(modified_module.code)
31
Add missing None return using LibCST
Add missing complex types by collecting types at runtime
32
Using MonkeyType:
1. Run your program with MonkeyType enabled
2. MonkeyType collects types of each function call trace
3. Run MonkeyType command to apply the collected types
Results
34
Black format coverage in our monolith codebase
Black tool was added
on 2020-10-30
Manual Adoption
Automated
Refactoring
100% coverage
on 2021-10-22
Type annotation coverage in our monolith codebase
35
Automated
Refactoring
Production type error improvement
36
Summary
37
Automated Refactoring is useful if you have a large codebase and
many tech debt to resolve.
Automated Refactoring framework enables:
● Saving tons of development time
● Fixing tech debt incrementally and continuously
● Refactor any programming language
Thank you for your attentions!
Carta Engineering Blog https://medium.com/building-carta
Carta Jobs https://boards.greenhouse.io/carta
Tools:
● Black Formatting
● Mypy
● PyGithub
● LibCST
● MonkeyType

More Related Content

What's hot

COURSE ORIENTATION ON MAD&PWA
COURSE ORIENTATION ON MAD&PWACOURSE ORIENTATION ON MAD&PWA
COURSE ORIENTATION ON MAD&PWAnikshaikh786
 
Top 45 interactive brokers interview questions and answers pdf
Top 45 interactive brokers interview questions and answers pdfTop 45 interactive brokers interview questions and answers pdf
Top 45 interactive brokers interview questions and answers pdfselinasimpson228
 
Mikes Guides AI tools ebook.pdf
Mikes Guides AI tools ebook.pdfMikes Guides AI tools ebook.pdf
Mikes Guides AI tools ebook.pdfMikeLawrence50
 
Scaling Uber
Scaling UberScaling Uber
Scaling UberC4Media
 
Orchestrator - Practical Approach to host UiPath Orchestrator
Orchestrator - Practical Approach to host UiPath OrchestratorOrchestrator - Practical Approach to host UiPath Orchestrator
Orchestrator - Practical Approach to host UiPath OrchestratorVibhor Shrivastava
 
Liit tyit sem 5 artificial intelligence unit 1 2018 notes most important ques...
Liit tyit sem 5 artificial intelligence unit 1 2018 notes most important ques...Liit tyit sem 5 artificial intelligence unit 1 2018 notes most important ques...
Liit tyit sem 5 artificial intelligence unit 1 2018 notes most important ques...tanujaparihar
 
Beyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapBeyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapJosh Evans
 
RPA Developer Kickstarter Slide - Day 1.pptx
RPA Developer Kickstarter Slide - Day 1.pptxRPA Developer Kickstarter Slide - Day 1.pptx
RPA Developer Kickstarter Slide - Day 1.pptxRohit Radhakrishnan
 
Writing REST APIs with OpenAPI and Swagger Ada
Writing REST APIs with OpenAPI and Swagger AdaWriting REST APIs with OpenAPI and Swagger Ada
Writing REST APIs with OpenAPI and Swagger AdaStephane Carrez
 
UiPath - IT Automation (1).pdf
UiPath - IT Automation (1).pdfUiPath - IT Automation (1).pdf
UiPath - IT Automation (1).pdfCristina Vidu
 
Automate CRM systems through APIs with the new UiPath Integration Service
Automate CRM systems through APIs with the new UiPath Integration ServiceAutomate CRM systems through APIs with the new UiPath Integration Service
Automate CRM systems through APIs with the new UiPath Integration ServiceDiana Gray, MBA
 
Getting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfGetting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfManish Chopra
 
IBM Cloud Pak for MCM Partner Add Ons Humio, SysDig, and Turbonomic
IBM Cloud Pak for MCM Partner Add Ons Humio, SysDig, and TurbonomicIBM Cloud Pak for MCM Partner Add Ons Humio, SysDig, and Turbonomic
IBM Cloud Pak for MCM Partner Add Ons Humio, SysDig, and TurbonomicLaura Naumann
 
RPA Developer Kickstarter Day 11 Best Practices and RPA Lifecycle.pdf
RPA Developer Kickstarter Day 11 Best Practices and RPA Lifecycle.pdfRPA Developer Kickstarter Day 11 Best Practices and RPA Lifecycle.pdf
RPA Developer Kickstarter Day 11 Best Practices and RPA Lifecycle.pdfRohit Radhakrishnan
 
What Is Data-Driven Product Development by Aaptiv Senior PM
What Is Data-Driven Product Development by Aaptiv Senior PMWhat Is Data-Driven Product Development by Aaptiv Senior PM
What Is Data-Driven Product Development by Aaptiv Senior PMProduct School
 
PPt on Chat GPT New users.pptx
PPt on Chat GPT New users.pptxPPt on Chat GPT New users.pptx
PPt on Chat GPT New users.pptxMohdMansoorAli1
 
Praneet’s Pre On ChatGpt edited.pptx
Praneet’s Pre On ChatGpt edited.pptxPraneet’s Pre On ChatGpt edited.pptx
Praneet’s Pre On ChatGpt edited.pptxSalunke2
 
RPA delivery life cycle
RPA delivery life cycleRPA delivery life cycle
RPA delivery life cycleRitika Raj
 
Pràctica na1 uf uf1 aplicacions web
Pràctica na1 uf uf1 aplicacions webPràctica na1 uf uf1 aplicacions web
Pràctica na1 uf uf1 aplicacions webGerardMP
 

What's hot (20)

COURSE ORIENTATION ON MAD&PWA
COURSE ORIENTATION ON MAD&PWACOURSE ORIENTATION ON MAD&PWA
COURSE ORIENTATION ON MAD&PWA
 
Top 45 interactive brokers interview questions and answers pdf
Top 45 interactive brokers interview questions and answers pdfTop 45 interactive brokers interview questions and answers pdf
Top 45 interactive brokers interview questions and answers pdf
 
Mikes Guides AI tools ebook.pdf
Mikes Guides AI tools ebook.pdfMikes Guides AI tools ebook.pdf
Mikes Guides AI tools ebook.pdf
 
Scaling Uber
Scaling UberScaling Uber
Scaling Uber
 
Orchestrator - Practical Approach to host UiPath Orchestrator
Orchestrator - Practical Approach to host UiPath OrchestratorOrchestrator - Practical Approach to host UiPath Orchestrator
Orchestrator - Practical Approach to host UiPath Orchestrator
 
Liit tyit sem 5 artificial intelligence unit 1 2018 notes most important ques...
Liit tyit sem 5 artificial intelligence unit 1 2018 notes most important ques...Liit tyit sem 5 artificial intelligence unit 1 2018 notes most important ques...
Liit tyit sem 5 artificial intelligence unit 1 2018 notes most important ques...
 
Beyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapBeyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the Gap
 
RPA Developer Kickstarter Slide - Day 1.pptx
RPA Developer Kickstarter Slide - Day 1.pptxRPA Developer Kickstarter Slide - Day 1.pptx
RPA Developer Kickstarter Slide - Day 1.pptx
 
Writing REST APIs with OpenAPI and Swagger Ada
Writing REST APIs with OpenAPI and Swagger AdaWriting REST APIs with OpenAPI and Swagger Ada
Writing REST APIs with OpenAPI and Swagger Ada
 
UiPath - IT Automation (1).pdf
UiPath - IT Automation (1).pdfUiPath - IT Automation (1).pdf
UiPath - IT Automation (1).pdf
 
Automate CRM systems through APIs with the new UiPath Integration Service
Automate CRM systems through APIs with the new UiPath Integration ServiceAutomate CRM systems through APIs with the new UiPath Integration Service
Automate CRM systems through APIs with the new UiPath Integration Service
 
Getting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfGetting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdf
 
IBM Cloud Pak for MCM Partner Add Ons Humio, SysDig, and Turbonomic
IBM Cloud Pak for MCM Partner Add Ons Humio, SysDig, and TurbonomicIBM Cloud Pak for MCM Partner Add Ons Humio, SysDig, and Turbonomic
IBM Cloud Pak for MCM Partner Add Ons Humio, SysDig, and Turbonomic
 
RPA Developer Kickstarter Day 11 Best Practices and RPA Lifecycle.pdf
RPA Developer Kickstarter Day 11 Best Practices and RPA Lifecycle.pdfRPA Developer Kickstarter Day 11 Best Practices and RPA Lifecycle.pdf
RPA Developer Kickstarter Day 11 Best Practices and RPA Lifecycle.pdf
 
What Is Data-Driven Product Development by Aaptiv Senior PM
What Is Data-Driven Product Development by Aaptiv Senior PMWhat Is Data-Driven Product Development by Aaptiv Senior PM
What Is Data-Driven Product Development by Aaptiv Senior PM
 
PPt on Chat GPT New users.pptx
PPt on Chat GPT New users.pptxPPt on Chat GPT New users.pptx
PPt on Chat GPT New users.pptx
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
 
Praneet’s Pre On ChatGpt edited.pptx
Praneet’s Pre On ChatGpt edited.pptxPraneet’s Pre On ChatGpt edited.pptx
Praneet’s Pre On ChatGpt edited.pptx
 
RPA delivery life cycle
RPA delivery life cycleRPA delivery life cycle
RPA delivery life cycle
 
Pràctica na1 uf uf1 aplicacions web
Pràctica na1 uf uf1 aplicacions webPràctica na1 uf uf1 aplicacions web
Pràctica na1 uf uf1 aplicacions web
 

More from Jimmy Lai

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdfJimmy Lai
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringJimmy Lai
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramJimmy Lai
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst NanodegreeJimmy Lai
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Jimmy Lai
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge BaseJimmy Lai
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr UsageJimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast PrototypingJimmy Lai
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in pythonJimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookJimmy Lai
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHugJimmy Lai
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesJimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugJimmy Lai
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012Jimmy Lai
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 

More from Jimmy Lai (20)

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoring
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst Nanodegree
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge Base
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr Usage
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 

EuroPython 2022 - Automated Refactoring Large Python Codebases

  • 1. Automated Refactoring in Large Python Codebases Jimmy Lai, Staff Software Engineer at Carta July 15, 2022
  • 2. Problems in Large Codebases Code Formatting Type Checking 2
  • 4. Large Codebase 4 Python code Developed for 10 years 200 active developers 2 million lines 30,000 files 20,000 file changes 🚩Challenge: Impossible to merge due to conflicts
  • 6. Flexible Python Code Style 6 Python Code Formatting Tool
  • 7. Break up changes as small pieces incrementally 7 20,000 file changes 🚩 Challenge: Impossible to merge due to conflicts 🚩Challenge: Extra work: need to manage each change to create/review/merge them and avoid overlaps 1 ~20 files changes 2 ~20 files changes 3 ~20 files changes 🚩Challenge: Avoid regression during incremental adoption
  • 9. Common Python Errors Mypy can catch the errors by doing type checking 9 TypeError: f() got multiple values for keyword argument 'a' AttributeError: 'str' object has no attribute 'args' ValueError: not enough values to unpack (expected 4, got 3) Error: "str" has no attribute "args" Error: Attribute function "f" with type "Callable[[C], Any]" does not accept self argument Error: Need more than 3 values to unpack (4 expected)
  • 10. Type Annotations 10 def add(a, b): … def add(a: int, b: int) -> int: … add(1.2, 2) Error: Argument 1 to "add" has incompatible type "float"; expected "int" 🚩 Challenge: Too many functions (100k) without type annotations
  • 12. “ What if we can automate incremental code changes?
  • 13. The lifecycle of a code change code paths 1 2 3 4 5 6 Automatic Refactoring Pull Requests
  • 14. The lifecycle of a code change code paths Code changes in a pull request 1 2 3 4 5 6 Automatic Refactoring Pull Requests 14 Test run
  • 15. The lifecycle of a code change 15 code paths Add reviewers success Test run Test failures failure Retry test run after 1 day Code changes in a pull request 1 2 3 4 5 6 Automatic Refactoring Pull Requests
  • 16. The lifecycle of a code change 16 code paths Add reviewers success Test run Test failures failure Retry test run after 1 day Code changes in a pull request 1 2 3 4 5 6 Automatic Refactoring Pull Requests Wait for review Approvals 2 Merge Add more reviewers 1 + pending for 3 days Post in reviewer’s team Slack channel 0
  • 17. The lifecycle of a code change 17 code paths Add reviewers success Test run Test failures failure Retry test run after 1 day Code changes in a pull request 1 2 3 4 5 6 Automatic Refactoring Pull Requests Wait for review Approvals 2 Merge Add more reviewers 1 + pending for 3 days Post in reviewer’s team Slack channel 0 after some time Rebase Close PR merge conflict
  • 18. The lifecycle of a code change 18 code paths Add reviewers success Test run Test failures failure Retry test run after 1 day Apply Code Changes 1 2 3 4 5 6 Automatic Refactoring Pull Requests Wait for review Approvals 2 Merge Add more reviewers 1 + pending for 3 days Post in reviewer’s team Slack channel 0 after some time Rebase Close PR merge conflict Automation Actions performed by pipeline jobs
  • 20. Size of Code Change 20 Tradeoffs: ● Too Big: not easy to review ● Too Small: too many pull requests from os import walk from pathlib import Path for dirpath, _, _ in walk("."): num_files = len(path for path in Path(dirpath).glob("**/*.py")) if num_files < THRESHOLD: ... # use the dirpath
  • 21. Number of Pending PRs 21 Tradeoffs: ● Too many concurrent pending PRs introduce too much code review work for developers ● Too few PRs results in slow progress # gh CLI tool allow searching pending PRs based on label # gh pr list --label automated-refactoring --json number,headRefName from subprocess import check_output from json import loads pending_prs = loads(check_output(["gh", "pr", "list", "--label", "automated-refactoring", "--json", "number,headRefName"]).decode()) if len(pending_prs) < PENDING_PR_THRESHOLD: ... # create more PRs
  • 22. Avoid Duplications 22 Pull request paths should not have overlaps. Encode the path info as part of the pull request for lookup Using Git branch name "black_formatting@dir_a/dir_b" "black_formatting@dir_a/dir_c"
  • 23. Incremental Adoption 23 Protect automated refactored files from regression using an enrollment process. 1. When a file is Black formatted, add the file path to an enrollment list. 2. A CI job runs Black format check on all enrolled files to ensure they follow Black formats. Once we enrolled the entire codebase, we can run Black on the entire codebase and remove the enrollment list. The same approach applies to type annotation adoption.
  • 24. State Machine Transitions with Periodical Jobs 24 code paths Add reviewers success Test run Test failures failure Retry test run after 1 day Apply Code Changes Wait for review Approvals 2 Merge Add more reviewers 1 + pending for 3 days Post in reviewer’s team Slack channel 0 after some time Rebase Close PR merge conflict Automation Actions performed by pipeline jobs
  • 25. Jobs 25 1. Apply code change and create a pull request 2. Check test status and add reviewers 3. Check review status and add more reviewers 4. Send Slack notifications for old PRs 5. Merge approved PRs 6. Close Pull Request when conflicts
  • 26. Automated Refactoring Application API 26 from abc import ABC from subprocess import check_output class RefactoringApplication(ABC): ... class BlackFormattingApplication(RefactoringApplication): pr_title: str = "Automated Black formatting on path %{target_path}" pr_labels: List[str] = ["automated-refactoring", "black-formatting"] pr_body: Path: Path("Black_formatting_body.md") commit_message: "Automated Black formatting on path %{target_path}" def skip_a_path(path: Path) -> bool: # lookup black_enrollment.txt to verify if the path exists or is covered by another path return path_is_enrolled_in_black(path) def refactor(path: Path): # 1. Apply Black on the target path check_output(["black", str("path")]) # 2. Add the path to the enrollment file black_enrollment.txt enroll_a_path_to_black_formatting(path)
  • 27. Periodical Jobs 27 27 Using Github workflows to run a Python script every 30 minutes during weekdays on: schedule: - cron: '30 * * * 1,2,3,4,5' jobs: runs-on: ubuntu-latest steps: name: Apply code change and create a pull request shell: python run: | from subprocess import check_output check_output([...]) .github/workflows/create_pull_requests.yml
  • 28. Job: Apply code change and create a pull request from github import Github # use PyGithub library github = Github("${{ github.token }}") repo = github.get_repo("${{ github.repository }}") for Application in RefactoringApplication.__subclasses__(): app = Application() while len(app.get_pending_prs()) < PENDING_PR_THRESHOLD: path = app.get_next_path() app.refactor(path) check_output(["git", "add", str(path)]) check_output(["git", "commit", "-m", app.get_commit_message()]) repo.create_pull(title=app.get_pr_title(), body=app.get_pr_body(), head=app.get_branch(), base=MAIN_BRANCH)
  • 29. Job: Merge approved PRs 29 from github import Github # use PyGithub library github = Github("${{ github.token }}") repo = github.get_repo("${{ github.repository }}") for issue in repo.legacy_search_issues("open", "label:automated-refactoring is:pull-request review:approved"): pull = repo.get_pull(issue.number) pull.merge()
  • 30. Refactor Python code to add missing types 30 Add missing types based on simple inferences
  • 31. from pathlib import Path from libcst import CSTTransformer, FunctionDef, Return, Yield, Annotation, Name, parse_module class AddMissingNoneReturn(CSTTransformer): def __init__(self): self.return_count = 0 def visit_FunctionDef(self, node: FunctionDef) -> None: self.return_count = 0 def visit_Return(self, node: Return) -> None: self.return_count += 1 def leave_FunctionDef(self, original_node: FunctionDef, updated_node: FunctionDef) -> FunctionDef: if updated_node.returns is None and self.return_count == 0: return updated_node.with_changes(returns=Annotation(annotations=Name("None"))) return updated_node def refactor(python_file_path: Path): module = parse_module(python_file_path.read_text()) modified_module = module.visit(AddMissingNoneReturn()) python_file_path.write_text(modified_module.code) 31 Add missing None return using LibCST
  • 32. Add missing complex types by collecting types at runtime 32 Using MonkeyType: 1. Run your program with MonkeyType enabled 2. MonkeyType collects types of each function call trace 3. Run MonkeyType command to apply the collected types
  • 34. 34 Black format coverage in our monolith codebase Black tool was added on 2020-10-30 Manual Adoption Automated Refactoring 100% coverage on 2021-10-22
  • 35. Type annotation coverage in our monolith codebase 35 Automated Refactoring
  • 36. Production type error improvement 36
  • 37. Summary 37 Automated Refactoring is useful if you have a large codebase and many tech debt to resolve. Automated Refactoring framework enables: ● Saving tons of development time ● Fixing tech debt incrementally and continuously ● Refactor any programming language
  • 38. Thank you for your attentions! Carta Engineering Blog https://medium.com/building-carta Carta Jobs https://boards.greenhouse.io/carta Tools: ● Black Formatting ● Mypy ● PyGithub ● LibCST ● MonkeyType