#MLonCode
Egor Bulychev
@egor_bu
Machine Learning
 GitHub
Machine
Learning
on Source
Code
MLonCode
Examples
Code naturalness
class ??? :
def connect(self, dbname, user, password, host, port):
# ...
def query(self, sql):
# ...
def close(self):
# ...
01.
02.
03.
04.
05.
06.
07.
class Database :
def connect(self, dbname, user, password, host, port):
# ...
def query(self, sql):
# ...
def close(self):
# ...
01.
02.
03.
04.
05.
06.
07.
class Foo:
def bar(self, qux):
# ...
def baz(self, waldo):
# ...
def do(self, really):
# ...
01.
02.
03.
04.
05.
06.
07.
twitter/mysql Tokutek/mysql-5.5
facebook/mysql-5.6 percona/percona-xtrabackup
Tokutek/mariadb-5.5 percona/percona-server
atcurtis/mariadb webscalesql/webscalesql-5.6
alibaba/AliSQL mysql/mysql-server
Details
Projects similar to MariaDB/server?
Exploratory search
Similar code detection
• By style
• By structure
• By identifiers provides us
Global graph
 Licenses
 Refactoring
Code autocompletion
Code style
class foobar:
def connecttoserver(self):
myserverhost = globalconfig.server.host
class FooBar:
def connect_to_server(self):
myServerHost = globalConfig.server.host
Many other applications
• Prediction of class, function, variable names
• Type inference
• Which comments do not make sense?
• Which comments are funny?
• Which APIs are bad or misused?
Your code as a crime scene
• Idea: mine the development history
• Book by Adam Tornhill
• codescene.io
Data
Datasets
• GHTorrent - everything except Git repositories, 70GB
• Public Git Archive - only Git repositories, 3TB
• GitHub Data in Google BigQuery
• rovers & borges - DIY
PGA
• 270k of siva files
• CSV index
Details in the paper.
Tools for MLonCode
clone
discover
classify
checkout
filter
parse
analyze
Plumbing
source{d} engine
• siva or bare git repository loader for Apache Spark
• Classification, parsing (thanks to bblfsh), filtering, checkouting, scaling
• Apache license
GitHub
source{d} engine
>>> from sourced.engine import Engine
>>> engine = Engine(spark, "/path/to/siva/files", "siva")
>>> engine.repositories.references.head_ref 
.commits.tree_entries.blobs 
.classify_languages() 
.select("blob_id", "path", "lang") 
.show()
function definition
identifier
value: nearest_neighbors
arguments body
identifier
value: self
identifier
value: origin
keyword argument
comment
value: origin can be...
conditional
identifier
value: k
literal
value: 10
condition true branch false branch
builtin
value: isinstance
identifier
value: origin
builtin
value: tuple
builtin
value: tuple
builtin
value: list
AST
How to parse
• Regular expressions - Pygments, highlight.js
• Abstract syntax tree (AST) - ANTLR
• Compilation
 Universal AST
• ± uniform structure
• ± standard node types (roles)
• XPath queries
• 4 traversal orders
dashboard.bblf.sh
>>> engine.repositories.references.head_ref 
.commits.tree_entries.blobs 
.classify_languages() 
.filter('lang = "Python"') 
.extract_uasts() 
.query_uast('//*[@roleIdentifier]') 
.extract_tokens("result", "tokens") 
.select("blob_id", "path", "tokens")
Powerful
modelforge
 Automatically fetch from the internet
 "Model Store"
 Flexible, modern binary format - ASDF
 Versioning
 Reproducability
 Support various programming languages
GitHub
Machine
Learning
hercules
• Go
• Mining Git history
• Relies on go-git, Babelfish and Tensorflow
• Apache license
GitHub
MLonCode in hercules
• Structural hotness
• Comment sentiment
source{d} ml
• Python
• Targets large-scale MLonCode
• Runs on top of source{d} engine, modelforge and Tensorflow
• Apache license
GitHub
Solved practical problems
• Identifier splitting
• O(1) code similarity search
• Topic modeling
• Embeddings
Identifier
embeddings
V1 ⇔ "foo"
V2 ⇔ "bar"
V3 ⇔ "integrate"
distance(V1, V2) < distance(V1, V3)
distance(Vi, Vj) = arccos
Vi⋅Vj
∥Vi∥∥∥Vj
∥∥
Scalar product
Norm
How to estimate
?Vi ⋅ Vj
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
01.
02.
03.
04.
05.
06.
07.
08.
Splitting and normalization
_tcp_socket_connect -> [tcp, socket, connect]
AuthenticationError -> [authentication, error]
authentication, authenticate -> authenticate
01.
02.
03.
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
database, connect , user , password , host , port , tcp,
socket , authenticate , error, close
01.
02.
03.
04.
05.
06.
07.
08.
>>> 2 2 2 2 2
2 2
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
connect , user , password , host , port , tcp, socket ,
authenticate , error, close
01.
02.
03.
04.
05.
06.
07.
08.
>>> 2 2 2 2 2 2
2
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
connect, user, password, host, port
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
tcp, socket, connect, host, port
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
authenticate , user, password, error, socket, close
01.
02.
03.
04.
05.
06.
07.
08.
>>> 2
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
authenticate, user, password
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
authenticate, error, socket, close
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
authenticate, error
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
socket, close
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
Incidence matrix
• number of times and were together 
• Also known as the co-occurrence matrix
Cij
Cij = i j
Pointwise Mutual Information (PMI)
Vi ⋅ Vj = P M Iij = log
Cij ∑ C
∑
N
k=1
Cik ∑
N
k=1
Cjk
Representation
Learning on
Explicit Matrix
Stochastic Gradient
Descent
Swivel
 Multi-GPU
 Multi-node
 Quality tricks
 Swivel: Improving Embeddings by Noticing What's Missing by Shazeer et.al.
 Tensorflow implementation
• afoo • qux
• myfoo • baz
• mfoo • wibble
• dofoo • quux
• dfoo • testing
• ifoo
Nearest to“foo”
Analogies
“bug” - “test” + “expect” = “suppress”
“database” - “query” + “tune” = “settings”
“send” - “receive” + “pop” = “push”
Typos
• recieve = receive
• grey = gray
• calback = callbak = callback
Summary
Summary
 MLonCode is fun
 There is data
 There are tools
 Community is forming
Idea ?  Doubt ?  Write us !
Thank you
 egor@sourced.tech
 egorbu
 egor_bu
 sourcedtech
 blog.sourced.tech
 Awesome #MLonCode
bit.ly/2MAU7qG

Machine learning on source code