Introduction to the source{d} Stack

Machine Learning for Large Scale Code Analysis
October 2018

Vision: to empower through code
2
Organizations accumulate a massive amount of source code over time: an
actionable yet overlooked dataset that is ripe for advanced static analysis with
Machine Learning.
source{d} allows you to take developer tooling & engineering insights to the
next level, modernizing how we evaluate software development and write &
review code.

From bricks to code
5
Most valuable public companies
by market cap, 2006–2018
Today, the most valuable companies
come from tech while brick-and-mortar
businesses were quickly left behind.
What is behind this shift?

For decades already, value has been
shifting from tangible to intangible
assets, such as intellectual property
from computer software.
Why?
Because "Every business will become
a software business."
–Satya Nadella, Microsoft CEO
The value shift
6
S&P500 companies market value
composition, 1975–2015

Despite its rising importance, source
code is still an underutilized asset.
The inherent characteristics of source
code – volume, variety, intricacy,
versioning – make it a powerful asset
holding a wealth of implicit knowledge.
The challenge–opportunity window
7
Why haven't companies leveraged
source code as a competitive
advantage?

Source code analysis is a very hard
problem to tackle, even if you are a
dominant tech company.
No easy path towards the inevitable
digital transformation & business
intelligence on code. The equipped
with the adequate tools will lead.
The challenge–opportunity window
It is hard to retrieve and store source code across
scattered repositories in scalable ways.
Maintenance and complexity leave no budget &
resources for codebase modernization.
Tooling has not improved at the pace of innovation.
Engineering executives have poor visibility into
their organization's actual source code.
Companies are not getting any business insights
from their codebase.
8

The solution
9
"The future is already here — it's
just not very evenly distributed."
–William Gibson, sci-fi writer

Quality
Enhance the quality of your
codebase with better data
and insights, detecting
potential defects and
vulnerabilities while having
a deeper understanding of
your architecture.
Agility
Faster retrieval and
analysis of your source
code improving the
efficiency of your
engineering organization
so you can ship on time.
Intelligence
All your codebases in a
single place coupled with
powerful machine learning
based analysis tools to
gain actionable insights for
your teams and business.
Become faster, better, smarter
10

How it works? We made it simple(r)
11
MACHINE LEARNING ON CODECODE AS DATA

Sophisticated machine learning tools,
algorithms and applications on code.
VALUE PROPOSITION
Code as Data feeds machine learning
algorithms which power cutting-edge
applications, such as source{d} Lookout
for assisted code review.
Code as Data + ML on Code
Accessible, language-agnostic,
large-scale source code analysis.
VALUE PROPOSITION
Code becomes a first-class analyzable
asset, be it across a thousands or tens
of millions of repositories, through our
powerful source{d} Engine.
12

How it stacks together
13
CodeasData
source{d} applications
source{d} ML
source{d} Engine
SQL interface & distributed computing using Apache Spark to
generate large datasets of Universal ASTs, analyzed directly or as
input to ML models
Tools & libraries to train ML models in public or private codebases as
well as models pre-trained on big code
Retrieve and store source code from all of the world's public code,
or all code on premise that is stored in version control systems
MLonCode
Next-generation developer tooling taking advantage of machine
learning & big code, as source{d} Lookout for assisted code review

In summary, the one-stop-shop for your source code analysis needs:
I. Retrieve & store your company code history—or the world's—as a dataset.
II. Parse code as language-agnostic syntax trees & semantic concepts.
III. Query your code base using SQL & analyze it via a Spark API.
IV. Analyze features from code, train and apply machine learning models.
V. Empower developers & managers with code-driven tooling & intelligence.
All designed to perform in a flexible, distributed and scalable manner.
Powered by source{d}
14

Familiar APIs
Analyze your code
through powerful friendly
APIs, such as SQL, gRPC,
REST, and various client
libraries. Use tools you're
familiar with to create
reports and dashboards.
History Analysis
Extract information from
the evolution, commits,
and metadata of your
codebase and generate
detailed reports and
insights.
Code Retrieval
Retrieve and store the
code history of your
organization (including
your open-source
repositories) as a dataset.
source{d} Engine
16
Analysis in/for
any Language
Automatically identify
languages, parse source
code, and extract the
pieces that matter in a
distributed and
language-agnostic way.

Detecting languages
& parsing code
Key challenges to use Code as Data
17
Distributed retrieval
and storage
Language-agnostic
Universal ASTs
SQL queries for
repositories & UASTs
Running analysis
pipelines at scale
Apache Spark for
code analysis
Retrieving and
storing code at scale
SOLUTIONS
Turning code and
history into insights
CHALLENGES

Retrieving and storing code at scale
? Code bases can be extremely large.
? Code repositories are often scattered over teams & servers.
? Version Control Systems history knowledge requires special tooling.
? Code repositories often contain duplicates from the same root.
? Code repositories are frequently updated.
? Implementation & maintenance requires massive data engineering effort.
18

Distributed retrieval and storage
19
Discovery
Fetcher
workerworker
Workers
Storage Layer
Public Code
Code
Repositories
Retrieval Architecture✔ Distributed & scalable
✔ Repository discovery
✔ Keeps VCS history
✔ Efficient storage
✔ Efficient updating
✔ Saves large data engineering effort

Detecting languages & parsing code
? Identifying programming languages accurately & quickly.
? Analyzing code as plain text is limiting—compilers/interpreters use ASTs.
? Language-specific parsers are needed to tap the power of ASTs.
? Resulting ASTs differ wildly over languages, AST standardization is needed.
? Semantic concepts (e.g. functions) have no standard across languages.
? Speed and scalability.
20

Language-agnostic Universal ASTs
21
✔ Perform language-agnostic analysis
✔ Use standardized semantic concepts
✔ Do complex analyses powered by
Universal ASTs & XPath filters
✔ Add new language support easily
✔ Performance at scale
Source code→ →Universal ASTs

Turning code and history into insights
22
? Querying large amounts of code at different levels for insights.
? Typical tooling works only on the current version of your code.
? Access to code bases requires distributed, large scale data engineering.
? New data sources require teams learning new languages and tools.
? Reports, dashboards, updates require custom work from developers.

SQL queries for repositories & UASTs
23
✔ Save on large data engineering effort
✔ Answer from code history via SQL
✔ Query repositories, files, UASTs over
history
✔ Use MySQL tools your team knows
✔ Any team member can query for
reports, dashboards, charts
Storage Layer
Git interface
Query engine
MySQL Server
Distributed Layer
Architecture
DataFlow

Running analysis pipelines at scale
24
? Analyzing large amounts of code data and history for insights.
? Building performant custom data pipelines for ML on code over thousands or
millions of code repositories require distributed computing.
? Complexity of pipeline and integration technical components in such projects make
deployment in production demanding.
? Different systems/requirements for research/prototyping and production
environments, prompting extra development work.

Apache Spark for code analysis
25
✔ MapReduce for source code
✔ Friendly API extends Apache Spark™
✔ Integrate tech stack pieces seamlessly
✔ Build data pipelines over code,
VCS history, UASTs
✔ Run locally or on large scale distributed
clusters using containers
DataFlow
Apache Spark API
engine worker
engine worker
Engine instances
gitbase instance
gitbase instance
Query instances
gitbase instance
gitbase instance
Storage Layer
Code parsing
instances
Architecture

Analyses via CLI, GUI, APIs
Graphical Web Client Command
Line
Interface
Or your own
tools via APIs,
client libraries
26

Have questions? Ask your codebase
"What are our top 10 projects with the
most developers working on them?"
27
"How many new projects do we start per
year in our organization?"

"How many repositories our codebase
has per programming language?"
28
“Are our security keys for certificates or
applications exposed in our code?”

Text 2
29

Have questions? UASTs have answers
30
def check(uast):
findings = []
sql_commands = set({"SELECT", "UPDATE", "DELETE", "INSERT",
"CREATE", "ALTER", "DROP"})
infixes = bblfsh.filter(uast, "//InfixExpression[@roleAdd and @roleBinary and @roleOperator]")
for i in infixes:
strs = bblfsh.filter(i, "//String[@internalRole='leftOperand']")
for s in strs:
first_word = s.properties["Value"].split()[0]
if first_word in sql_commands:
findings.append({"msg": "Potential SQL injection vulnerability",
"pos": s.start_position})
return findings
"Is our code vulnerable to SQL injection
types of attacks?"

Code As Data Roadmap
31
Distributed SQL for source code
Implement a distributed layer over
MySQL that allows users to query all
the history of repos, code & UASTs
Cross-Reference Resolution Provided
Full integration of cross-references to
enable more powerful dependency
aware static analysis
2015
First data pipeline for the
world’s open source code
Index 10+ million git repositories
and process git & source code by
extending Apache Spark
Universal ASTs are created
We now have the ability to analyze
source code agnostic to languages as
Universal Abstract Syntax Trees
2016
Creation of go-git
Starting to build what would
become one of 3 reference
implementations of Git
2017 2019
2018

source{d} Engine - Enterprise Edition
32
Multi-node / Hybrid Cluster (Public Clouds and On Prem) analysis for large scale
distributed codebases
Distributed Analysis
Security & Governance
Support & Certification
Controlled Code Deployment, Forensic code history, RBAC, 3rd party integrations,
Audit Logs, Deployment Options, etc
Enterprise grade support and SLA, Certified plugins & infrastructure
SOURCE[D} ENGINE FOR THE ENTERPRISE

Areas of interest for ML on Code
34
Malicious Actor Detection, Vulnerability Detection, Malicious Code Detection
Style Conventions, Idiomatic Code, Naming Suggestions, Architecture Suggestions
Security & Compliance
Bug Detection & Prediction
Performance
Test Suggestions, Test Generation, Predicting & detecting bugs by learning from code
history and issues, Memory, CPU & battery optimizations
Predicting & detecting bugs by learning from code history and issues
Memory, CPU & battery optimizations
Assisted Code Review
Language-agnostic duplicate & similar code detection from project to function level
Assisted QA & Testing
Duplicate Code Detection

ML on Code applications
35
assisted code review
naming suggestions
style transfer on code
vulnerability detection
defect detection
code suggestions
automatic bug fixing
inductive programming
automatic refactoring
AI-based pair programming
natural language to code
math to code
neural compilers
transpilation
the
future
near
term

Public Git Archive dataset
Published the largest dataset of code
to date at 3TB and 180k OSS projects
Gemini for Duplicate Code Detection
Duplicate & similarity code detection
at scale up to function level
ML on Code Roadmap
36
Automate parts of the code review
process with multiple, such as code style
analyzers
First ML on Code models
First models trained on code:
topic modeling of 18M
repositories and clustering of
duplicate code over 10M repos
sourced.ml is created
Fundamental for ML on Code R&D:
feature extraction, model training
ML on Code research repository
The largest reference of machine
learning on code with 3k+ readers
2016
2017 2019
2018

ML on Code-powered applications will
revolutionize software development.
source{d} ML is the to-go place for ML
on Code tools, models & research.
Empower devs & managers with apps as
source{d} Lookout for code review &
source{d} Gemini to detect similar code.
Code as Data is inevitable; those
equipped to benefit will be ahead.
source{d} Engine is the one-stop-shop
for large-scale code analysis needs.
Distill the knowledge from your code
and benefit from more agility, quality
and intelligence over your SDLC.
Takeaways
38

Cross-Reference Resolution Provided
Universal ASTs are created
sourced.ml is created
ML on Code research repository
Roadmap
39
Distributed SQL for Git
Public Git Archive dataset
Gemini for Duplicate Code Detection
2015
First data pipeline for the
world’s open source code
First ML on Code models
2016
Creation of go-git
2017 2019
2018

An international Open Core company
41
30+ employees worldwide; remote first with offices in San Francisco, Madrid
and soon Seattle.
Three pillars: For developers; By developers; Opinionatedly Free Spoken.
Practitioners of Open Source, Open Science and Open Company philosophies.
Experienced founders, senior team members and domain-expert advisors.
$10 million funding from Xavier Niel, Otium, Sunstone Capital and others.

Team: key people
42
Francesc Campoy
VP of Developer
Community
‒ Key Golang Developer
Advocate at Google for
the last 5 years
‒ International software
engineer with an
extensive experience in
C++ developing at Google
and Amadeus
Vadim Markovtsev
Lead Machine Learning
Engineer
‒ Creator of Samsung’s Veles:
distributed machine learning
platform (responsible for
>90% of the code)
‒ Lead Mail.ru anti-spam efforts
‒ Former associate professor at
the Moscow Institute of
Physics and Technology
Victor Coisne
Head of Growth and
Community
‒ 5+ years as Head of
Community at Docker
‒ Open Source contributor to
many developer advocacy,
community education &
engagement programs
‒ Experienced partner
enablement & relations
manager (Microsoft, IBM,
Digital Ocean, etc)
Eiso Kant
Co Founder & CEO
‒ Programming since the age of
12, these days in Haskell & Go
‒ Co-founder of Tyba
(2011-2015)
‒ Founder of Twollars
(2008-2009)
‒ Used to create software that
automatically generated
websites
Jorge Schnura
Co Founder & COO
‒ Responsible for operations,
“never drops the ball”
‒ Co-founder at Tyba
(2011-2015)
‒ Previously in finance
‒ Loves to automate boring
things in Python
Máximo Cuadros
CTO
‒ 5+ years as CTO, 15+ years
experience
‒ Self-taught, programming for
more than 20 years (polyglot)
‒ Open source contributor to
many projects (CoreOS,
Terraform)
‒ Active member of the Golang
community

sourced.tech
blog.sourced.tech
github.com/src-d
Machine Learning for
Large Scale Code Analysis

Introduction to the source{d} Stack

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to the source{d} Stack

Similar to Introduction to the source{d} Stack (20)

More from source{d}

More from source{d} (13)

Recently uploaded

Recently uploaded (20)

Introduction to the source{d} Stack