SlideShare a Scribd company logo
Machine Learning for Large Scale Code Analysis
October 2018
Vision: to empower through code
2
Organizations accumulate a massive amount of source code over time: an
actionable yet overlooked dataset that is ripe for advanced static analysis with
Machine Learning.
source{d} allows you to take developer tooling & engineering insights to the
next level, modernizing how we evaluate software development and write &
review code.
3
Table of contents
Today's landscape
4
From bricks to code
5
Most valuable public companies
by market cap, 2006–2018
Today, the most valuable companies
come from tech while brick-and-mortar
businesses were quickly left behind.
What is behind this shift?
For decades already, value has been
shifting from tangible to intangible
assets, such as intellectual property
from computer software.
Why?
Because "Every business will become
a software business."
–Satya Nadella, Microsoft CEO
The value shift
6
S&P500 companies market value
composition, 1975–2015
Despite its rising importance, source
code is still an underutilized asset.
The inherent characteristics of source
code – volume, variety, intricacy,
versioning – make it a powerful asset
holding a wealth of implicit knowledge.
The challenge–opportunity window
7
Why haven't companies leveraged
source code as a competitive
advantage?
Source code analysis is a very hard
problem to tackle, even if you are a
dominant tech company.
No easy path towards the inevitable
digital transformation & business
intelligence on code. The equipped
with the adequate tools will lead.
The challenge–opportunity window
It is hard to retrieve and store source code across
scattered repositories in scalable ways.
Maintenance and complexity leave no budget &
resources for codebase modernization.
Tooling has not improved at the pace of innovation.
Engineering executives have poor visibility into
their organization's actual source code.
Companies are not getting any business insights
from their codebase.
8
The solution
9
"The future is already here — it's
just not very evenly distributed."
–William Gibson, sci-fi writer
Quality
Enhance the quality of your
codebase with better data
and insights, detecting
potential defects and
vulnerabilities while having
a deeper understanding of
your architecture.
Agility
Faster retrieval and
analysis of your source
code improving the
efficiency of your
engineering organization
so you can ship on time.
Intelligence
All your codebases in a
single place coupled with
powerful machine learning
based analysis tools to
gain actionable insights for
your teams and business.
Become faster, better, smarter
10
How it works? We made it simple(r)
11
MACHINE LEARNING ON CODECODE AS DATA
Sophisticated machine learning tools,
algorithms and applications on code.
VALUE PROPOSITION
Code as Data feeds machine learning
algorithms which power cutting-edge
applications, such as source{d} Lookout
for assisted code review.
Code as Data + ML on Code
Accessible, language-agnostic,
large-scale source code analysis.
VALUE PROPOSITION
Code becomes a first-class analyzable
asset, be it across a thousands or tens
of millions of repositories, through our
powerful source{d} Engine.
12
MACHINE LEARNING ON CODECODE AS DATA
How it stacks together
13
CodeasData
source{d} applications
source{d} ML
source{d} Engine
SQL interface & distributed computing using Apache Spark to
generate large datasets of Universal ASTs, analyzed directly or as
input to ML models
Tools & libraries to train ML models in public or private codebases as
well as models pre-trained on big code
Retrieve and store source code from all of the world's public code,
or all code on premise that is stored in version control systems
MLonCode
Next-generation developer tooling taking advantage of machine
learning & big code, as source{d} Lookout for assisted code review
In summary, the one-stop-shop for your source code analysis needs:
I. Retrieve & store your company code history—or the world's—as a dataset.
II. Parse code as language-agnostic syntax trees & semantic concepts.
III. Query your code base using SQL & analyze it via a Spark API.
IV. Analyze features from code, train and apply machine learning models.
V. Empower developers & managers with code-driven tooling & intelligence.
All designed to perform in a flexible, distributed and scalable manner.
Powered by source{d}
14
Code as Data
15
Familiar APIs
Analyze your code
through powerful friendly
APIs, such as SQL, gRPC,
REST, and various client
libraries. Use tools you're
familiar with to create
reports and dashboards.
History Analysis
Extract information from
the evolution, commits,
and metadata of your
codebase and generate
detailed reports and
insights.
Code Retrieval
Retrieve and store the
code history of your
organization (including
your open-source
repositories) as a dataset.
source{d} Engine
16
Analysis in/for
any Language
Automatically identify
languages, parse source
code, and extract the
pieces that matter in a
distributed and
language-agnostic way.
Detecting languages
& parsing code
Key challenges to use Code as Data
17
Distributed retrieval
and storage
Language-agnostic
Universal ASTs
SQL queries for
repositories & UASTs
Running analysis
pipelines at scale
Apache Spark for
code analysis
Retrieving and
storing code at scale
SOLUTIONS
Turning code and
history into insights
CHALLENGES
Retrieving and storing code at scale
? Code bases can be extremely large.
? Code repositories are often scattered over teams & servers.
? Version Control Systems history knowledge requires special tooling.
? Code repositories often contain duplicates from the same root.
? Code repositories are frequently updated.
? Implementation & maintenance requires massive data engineering effort.
18
Distributed retrieval and storage
19
Discovery
Fetcher
workerworker
Workers
Storage Layer
Public Code
Code
Repositories
Retrieval Architecture✔ Distributed & scalable
✔ Repository discovery
✔ Keeps VCS history
✔ Efficient storage
✔ Efficient updating
✔ Saves large data engineering effort
Detecting languages & parsing code
? Identifying programming languages accurately & quickly.
? Analyzing code as plain text is limiting—compilers/interpreters use ASTs.
? Language-specific parsers are needed to tap the power of ASTs.
? Resulting ASTs differ wildly over languages, AST standardization is needed.
? Semantic concepts (e.g. functions) have no standard across languages.
? Speed and scalability.
20
Language-agnostic Universal ASTs
21
✔ Perform language-agnostic analysis
✔ Use standardized semantic concepts
✔ Do complex analyses powered by
Universal ASTs & XPath filters
✔ Add new language support easily
✔ Performance at scale
Source code→ →Universal ASTs
Turning code and history into insights
22
? Querying large amounts of code at different levels for insights.
? Typical tooling works only on the current version of your code.
? Access to code bases requires distributed, large scale data engineering.
? New data sources require teams learning new languages and tools.
? Reports, dashboards, updates require custom work from developers.
SQL queries for repositories & UASTs
23
✔ Save on large data engineering effort
✔ Answer from code history via SQL
✔ Query repositories, files, UASTs over
history
✔ Use MySQL tools your team knows
✔ Any team member can query for
reports, dashboards, charts
Storage Layer
Git interface
Query engine
MySQL Server
Distributed Layer
Architecture
DataFlow
Running analysis pipelines at scale
24
? Analyzing large amounts of code data and history for insights.
? Building performant custom data pipelines for ML on code over thousands or
millions of code repositories require distributed computing.
? Complexity of pipeline and integration technical components in such projects make
deployment in production demanding.
? Different systems/requirements for research/prototyping and production
environments, prompting extra development work.
Apache Spark for code analysis
25
✔ MapReduce for source code
✔ Friendly API extends Apache Spark™
✔ Integrate tech stack pieces seamlessly
✔ Build data pipelines over code,
VCS history, UASTs
✔ Run locally or on large scale distributed
clusters using containers
DataFlow
Apache Spark API
engine worker
engine worker
Engine instances
gitbase instance
gitbase instance
Query instances
gitbase instance
gitbase instance
Storage Layer
Code parsing
instances
Architecture
Analyses via CLI, GUI, APIs
Graphical Web Client Command
Line
Interface
Or your own
tools via APIs,
client libraries
26
Have questions? Ask your codebase
"What are our top 10 projects with the
most developers working on them?"
27
"How many new projects do we start per
year in our organization?"
Have questions? Ask your codebase
"How many repositories our codebase
has per programming language?"
28
“Are our security keys for certificates or
applications exposed in our code?”
Have questions? Ask your codebase
"How many repositories our codebase
has per programming language?"
Text 2
29
Have questions? UASTs have answers
"How many repositories our codebase
has per programming language?"
30
def check(uast):
findings = []
sql_commands = set({"SELECT", "UPDATE", "DELETE", "INSERT",
"CREATE", "ALTER", "DROP"})
infixes = bblfsh.filter(uast, "//InfixExpression[@roleAdd and @roleBinary and @roleOperator]")
for i in infixes:
strs = bblfsh.filter(i, "//String[@internalRole='leftOperand']")
for s in strs:
first_word = s.properties["Value"].split()[0]
if first_word in sql_commands:
findings.append({"msg": "Potential SQL injection vulnerability",
"pos": s.start_position})
return findings
"Is our code vulnerable to SQL injection
types of attacks?"
Code As Data Roadmap
31
Distributed SQL for source code
Implement a distributed layer over
MySQL that allows users to query all
the history of repos, code & UASTs
Cross-Reference Resolution Provided
Full integration of cross-references to
enable more powerful dependency
aware static analysis
2015
First data pipeline for the
world’s open source code
Index 10+ million git repositories
and process git & source code by
extending Apache Spark
Universal ASTs are created
We now have the ability to analyze
source code agnostic to languages as
Universal Abstract Syntax Trees
2016
Creation of go-git
Starting to build what would
become one of 3 reference
implementations of Git
2017 2019
2018
source{d} Engine - Enterprise Edition
32
Multi-node / Hybrid Cluster (Public Clouds and On Prem) analysis for large scale
distributed codebases
Distributed Analysis
Security & Governance
Support & Certification
Controlled Code Deployment, Forensic code history, RBAC, 3rd party integrations,
Audit Logs, Deployment Options, etc
Enterprise grade support and SLA, Certified plugins & infrastructure
SOURCE[D} ENGINE FOR THE ENTERPRISE
Machine Learning on Code
33
Areas of interest for ML on Code
34
Malicious Actor Detection, Vulnerability Detection, Malicious Code Detection
Style Conventions, Idiomatic Code, Naming Suggestions, Architecture Suggestions
Security & Compliance
Bug Detection & Prediction
Performance
Test Suggestions, Test Generation, Predicting & detecting bugs by learning from code
history and issues, Memory, CPU & battery optimizations
Predicting & detecting bugs by learning from code history and issues
Memory, CPU & battery optimizations
Assisted Code Review
Language-agnostic duplicate & similar code detection from project to function level
Assisted QA & Testing
Duplicate Code Detection
ML on Code applications
35
assisted code review
naming suggestions
style transfer on code
vulnerability detection
defect detection
code suggestions
automatic bug fixing
inductive programming
automatic refactoring
AI-based pair programming
natural language to code
math to code
neural compilers
transpilation
the
future
near
term
Public Git Archive dataset
Published the largest dataset of code
to date at 3TB and 180k OSS projects
Gemini for Duplicate Code Detection
Duplicate & similarity code detection
at scale up to function level
ML on Code Roadmap
36
Assisted Code Review
Automate parts of the code review
process with multiple, such as code style
analyzers
First ML on Code models
First models trained on code:
topic modeling of 18M
repositories and clustering of
duplicate code over 10M repos
sourced.ml is created
Fundamental for ML on Code R&D:
feature extraction, model training
ML on Code research repository
The largest reference of machine
learning on code with 3k+ readers
2016
2017 2019
2018
Summary
37
ML on Code-powered applications will
revolutionize software development.
source{d} ML is the to-go place for ML
on Code tools, models & research.
Empower devs & managers with apps as
source{d} Lookout for code review &
source{d} Gemini to detect similar code.
Code as Data is inevitable; those
equipped to benefit will be ahead.
source{d} Engine is the one-stop-shop
for large-scale code analysis needs.
Distill the knowledge from your code
and benefit from more agility, quality
and intelligence over your SDLC.
Takeaways
38
MACHINE LEARNING ON CODECODE AS DATA
Cross-Reference Resolution Provided
Assisted Code Review
Universal ASTs are created
sourced.ml is created
ML on Code research repository
Roadmap
39
Distributed SQL for Git
Public Git Archive dataset
Gemini for Duplicate Code Detection
2015
First data pipeline for the
world’s open source code
First ML on Code models
2016
Creation of go-git
2017 2019
2018
Who we are
40
An international Open Core company
41
30+ employees worldwide; remote first with offices in San Francisco, Madrid
and soon Seattle.
Three pillars: For developers; By developers; Opinionatedly Free Spoken.
Practitioners of Open Source, Open Science and Open Company philosophies.
Experienced founders, senior team members and domain-expert advisors.
$10 million funding from Xavier Niel, Otium, Sunstone Capital and others.
Team: key people
42
Francesc Campoy
VP of Developer
Community
‒ Key Golang Developer
Advocate at Google for
the last 5 years
‒ International software
engineer with an
extensive experience in
C++ developing at Google
and Amadeus
Vadim Markovtsev
Lead Machine Learning
Engineer
‒ Creator of Samsung’s Veles:
distributed machine learning
platform (responsible for
>90% of the code)
‒ Lead Mail.ru anti-spam efforts
‒ Former associate professor at
the Moscow Institute of
Physics and Technology
Victor Coisne
Head of Growth and
Community
‒ 5+ years as Head of
Community at Docker
‒ Open Source contributor to
many developer advocacy,
community education &
engagement programs
‒ Experienced partner
enablement & relations
manager (Microsoft, IBM,
Digital Ocean, etc)
Eiso Kant
Co Founder & CEO
‒ Programming since the age of
12, these days in Haskell & Go
‒ Co-founder of Tyba
(2011-2015)
‒ Founder of Twollars
(2008-2009)
‒ Used to create software that
automatically generated
websites
Jorge Schnura
Co Founder & COO
‒ Responsible for operations,
“never drops the ball”
‒ Co-founder at Tyba
(2011-2015)
‒ Previously in finance
‒ Loves to automate boring
things in Python
Máximo Cuadros
CTO
‒ 5+ years as CTO, 15+ years
experience
‒ Self-taught, programming for
more than 20 years (polyglot)
‒ Open source contributor to
many projects (CoreOS,
Terraform)
‒ Active member of the Golang
community
sourced.tech
blog.sourced.tech
github.com/src-d
Machine Learning for
Large Scale Code Analysis

More Related Content

What's hot

Building Audi’s enterprise big data platform
Building Audi’s enterprise big data platformBuilding Audi’s enterprise big data platform
Building Audi’s enterprise big data platform
DataWorks Summit
 
Profile_Prakash
Profile_PrakashProfile_Prakash
Profile_PrakashPrakash BS
 
IBM Watson & PHP, A Practical Demonstration
IBM Watson & PHP, A Practical DemonstrationIBM Watson & PHP, A Practical Demonstration
IBM Watson & PHP, A Practical Demonstration
Clark Everetts
 
Dennis DeWittt-11-2015-FM-Meteor-Base2Template
Dennis DeWittt-11-2015-FM-Meteor-Base2TemplateDennis DeWittt-11-2015-FM-Meteor-Base2Template
Dennis DeWittt-11-2015-FM-Meteor-Base2TemplateDennis DeWitt
 
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoopSenior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoopabinash bindhani
 
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
DataWorks Summit
 
Developex_showcases
Developex_showcasesDevelopex_showcases
Developex_showcasesOlga Rusu
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
Data Science Milan
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
Mark Tabladillo
 
Richard ward2016
Richard ward2016Richard ward2016
Richard ward2016
Rich Ward
 
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
Jeff Hung
 
2015-05-19-resume
2015-05-19-resume2015-05-19-resume
2015-05-19-resumeLee Norris
 
JCommerce – success stories
JCommerce – success storiesJCommerce – success stories
JCommerce – success stories
JCommerce
 
Simplifying the OpenAPI Development Experience
Simplifying the OpenAPI Development Experience Simplifying the OpenAPI Development Experience
Simplifying the OpenAPI Development Experience
confluent
 

What's hot (20)

Building Audi’s enterprise big data platform
Building Audi’s enterprise big data platformBuilding Audi’s enterprise big data platform
Building Audi’s enterprise big data platform
 
SreenivasulaReddy
SreenivasulaReddySreenivasulaReddy
SreenivasulaReddy
 
Profile_Prakash
Profile_PrakashProfile_Prakash
Profile_Prakash
 
Neelima_Resume
Neelima_ResumeNeelima_Resume
Neelima_Resume
 
IBM Watson & PHP, A Practical Demonstration
IBM Watson & PHP, A Practical DemonstrationIBM Watson & PHP, A Practical Demonstration
IBM Watson & PHP, A Practical Demonstration
 
Dennis DeWittt-11-2015-FM-Meteor-Base2Template
Dennis DeWittt-11-2015-FM-Meteor-Base2TemplateDennis DeWittt-11-2015-FM-Meteor-Base2Template
Dennis DeWittt-11-2015-FM-Meteor-Base2Template
 
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoopSenior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop
 
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
 
ChandanResume
ChandanResumeChandanResume
ChandanResume
 
Developex_showcases
Developex_showcasesDevelopex_showcases
Developex_showcases
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
 
peeyush_resume
peeyush_resumepeeyush_resume
peeyush_resume
 
Richard ward2016
Richard ward2016Richard ward2016
Richard ward2016
 
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
 
SamSegalResume
SamSegalResumeSamSegalResume
SamSegalResume
 
2015-05-19-resume
2015-05-19-resume2015-05-19-resume
2015-05-19-resume
 
JCommerce – success stories
JCommerce – success storiesJCommerce – success stories
JCommerce – success stories
 
chaitanya_Resume
chaitanya_Resumechaitanya_Resume
chaitanya_Resume
 
Simplifying the OpenAPI Development Experience
Simplifying the OpenAPI Development Experience Simplifying the OpenAPI Development Experience
Simplifying the OpenAPI Development Experience
 

Similar to Introduction to the source{d} Stack

Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analytics
source{d}
 
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfGen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
PhilipBasford
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
DataWorks Summit
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
A Tight Ship: How Containers and SDS Optimize the Enterprise
 A Tight Ship: How Containers and SDS Optimize the Enterprise A Tight Ship: How Containers and SDS Optimize the Enterprise
A Tight Ship: How Containers and SDS Optimize the Enterprise
Eric Kavanagh
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
William Poos
 
Introduction to Backend Development (1).pptx
Introduction to Backend Development (1).pptxIntroduction to Backend Development (1).pptx
Introduction to Backend Development (1).pptx
OsuGodbless
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
Cascading
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Arcadia Data
 
Kunal bhatia resume mass
Kunal bhatia   resume massKunal bhatia   resume mass
Kunal bhatia resume mass
Kunal Bhatia, MBA Candidate, BSc.
 
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
NETUserGroupBern
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
Anam Mahmood
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
Paula Koziol
 
Content Strategy and Developer Engagement for DevPortals
Content Strategy and Developer Engagement for DevPortalsContent Strategy and Developer Engagement for DevPortals
Content Strategy and Developer Engagement for DevPortals
Axway
 
AI Solutions with Macnica.ai - AI Expo 2018 Tokyo Japan
AI Solutions with Macnica.ai - AI Expo 2018 Tokyo JapanAI Solutions with Macnica.ai - AI Expo 2018 Tokyo Japan
AI Solutions with Macnica.ai - AI Expo 2018 Tokyo Japan
Avkash Chauhan
 
Single Source of Truth for Network Automation
Single Source of Truth for Network AutomationSingle Source of Truth for Network Automation
Single Source of Truth for Network Automation
Andy Davidson
 
MySQL day Dublin - OCI & Application Development
MySQL day Dublin - OCI & Application DevelopmentMySQL day Dublin - OCI & Application Development
MySQL day Dublin - OCI & Application Development
Henry J. Kröger
 

Similar to Introduction to the source{d} Stack (20)

Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analytics
 
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfGen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
SrinivasaVithal_CV
SrinivasaVithal_CVSrinivasaVithal_CV
SrinivasaVithal_CV
 
A Tight Ship: How Containers and SDS Optimize the Enterprise
 A Tight Ship: How Containers and SDS Optimize the Enterprise A Tight Ship: How Containers and SDS Optimize the Enterprise
A Tight Ship: How Containers and SDS Optimize the Enterprise
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
Introduction to Backend Development (1).pptx
Introduction to Backend Development (1).pptxIntroduction to Backend Development (1).pptx
Introduction to Backend Development (1).pptx
 
Amit Bhandari
Amit BhandariAmit Bhandari
Amit Bhandari
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT Analytics
 
Kunal bhatia resume mass
Kunal bhatia   resume massKunal bhatia   resume mass
Kunal bhatia resume mass
 
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
Content Strategy and Developer Engagement for DevPortals
Content Strategy and Developer Engagement for DevPortalsContent Strategy and Developer Engagement for DevPortals
Content Strategy and Developer Engagement for DevPortals
 
AI Solutions with Macnica.ai - AI Expo 2018 Tokyo Japan
AI Solutions with Macnica.ai - AI Expo 2018 Tokyo JapanAI Solutions with Macnica.ai - AI Expo 2018 Tokyo Japan
AI Solutions with Macnica.ai - AI Expo 2018 Tokyo Japan
 
Single Source of Truth for Network Automation
Single Source of Truth for Network AutomationSingle Source of Truth for Network Automation
Single Source of Truth for Network Automation
 
MySQL day Dublin - OCI & Application Development
MySQL day Dublin - OCI & Application DevelopmentMySQL day Dublin - OCI & Application Development
MySQL day Dublin - OCI & Application Development
 

More from source{d}

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored ML
source{d}
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!
source{d}
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
source{d}
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositories
source{d}
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookout
source{d}
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...
source{d}
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQL
source{d}
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
source{d}
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Code
source{d}
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
source{d}
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source code
source{d}
 

More from source{d} (13)

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored ML
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositories
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookout
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQL
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Code
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source code
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Introduction to the source{d} Stack

  • 1. Machine Learning for Large Scale Code Analysis October 2018
  • 2. Vision: to empower through code 2 Organizations accumulate a massive amount of source code over time: an actionable yet overlooked dataset that is ripe for advanced static analysis with Machine Learning. source{d} allows you to take developer tooling & engineering insights to the next level, modernizing how we evaluate software development and write & review code.
  • 5. From bricks to code 5 Most valuable public companies by market cap, 2006–2018 Today, the most valuable companies come from tech while brick-and-mortar businesses were quickly left behind. What is behind this shift?
  • 6. For decades already, value has been shifting from tangible to intangible assets, such as intellectual property from computer software. Why? Because "Every business will become a software business." –Satya Nadella, Microsoft CEO The value shift 6 S&P500 companies market value composition, 1975–2015
  • 7. Despite its rising importance, source code is still an underutilized asset. The inherent characteristics of source code – volume, variety, intricacy, versioning – make it a powerful asset holding a wealth of implicit knowledge. The challenge–opportunity window 7 Why haven't companies leveraged source code as a competitive advantage?
  • 8. Source code analysis is a very hard problem to tackle, even if you are a dominant tech company. No easy path towards the inevitable digital transformation & business intelligence on code. The equipped with the adequate tools will lead. The challenge–opportunity window It is hard to retrieve and store source code across scattered repositories in scalable ways. Maintenance and complexity leave no budget & resources for codebase modernization. Tooling has not improved at the pace of innovation. Engineering executives have poor visibility into their organization's actual source code. Companies are not getting any business insights from their codebase. 8
  • 9. The solution 9 "The future is already here — it's just not very evenly distributed." –William Gibson, sci-fi writer
  • 10. Quality Enhance the quality of your codebase with better data and insights, detecting potential defects and vulnerabilities while having a deeper understanding of your architecture. Agility Faster retrieval and analysis of your source code improving the efficiency of your engineering organization so you can ship on time. Intelligence All your codebases in a single place coupled with powerful machine learning based analysis tools to gain actionable insights for your teams and business. Become faster, better, smarter 10
  • 11. How it works? We made it simple(r) 11 MACHINE LEARNING ON CODECODE AS DATA
  • 12. Sophisticated machine learning tools, algorithms and applications on code. VALUE PROPOSITION Code as Data feeds machine learning algorithms which power cutting-edge applications, such as source{d} Lookout for assisted code review. Code as Data + ML on Code Accessible, language-agnostic, large-scale source code analysis. VALUE PROPOSITION Code becomes a first-class analyzable asset, be it across a thousands or tens of millions of repositories, through our powerful source{d} Engine. 12 MACHINE LEARNING ON CODECODE AS DATA
  • 13. How it stacks together 13 CodeasData source{d} applications source{d} ML source{d} Engine SQL interface & distributed computing using Apache Spark to generate large datasets of Universal ASTs, analyzed directly or as input to ML models Tools & libraries to train ML models in public or private codebases as well as models pre-trained on big code Retrieve and store source code from all of the world's public code, or all code on premise that is stored in version control systems MLonCode Next-generation developer tooling taking advantage of machine learning & big code, as source{d} Lookout for assisted code review
  • 14. In summary, the one-stop-shop for your source code analysis needs: I. Retrieve & store your company code history—or the world's—as a dataset. II. Parse code as language-agnostic syntax trees & semantic concepts. III. Query your code base using SQL & analyze it via a Spark API. IV. Analyze features from code, train and apply machine learning models. V. Empower developers & managers with code-driven tooling & intelligence. All designed to perform in a flexible, distributed and scalable manner. Powered by source{d} 14
  • 16. Familiar APIs Analyze your code through powerful friendly APIs, such as SQL, gRPC, REST, and various client libraries. Use tools you're familiar with to create reports and dashboards. History Analysis Extract information from the evolution, commits, and metadata of your codebase and generate detailed reports and insights. Code Retrieval Retrieve and store the code history of your organization (including your open-source repositories) as a dataset. source{d} Engine 16 Analysis in/for any Language Automatically identify languages, parse source code, and extract the pieces that matter in a distributed and language-agnostic way.
  • 17. Detecting languages & parsing code Key challenges to use Code as Data 17 Distributed retrieval and storage Language-agnostic Universal ASTs SQL queries for repositories & UASTs Running analysis pipelines at scale Apache Spark for code analysis Retrieving and storing code at scale SOLUTIONS Turning code and history into insights CHALLENGES
  • 18. Retrieving and storing code at scale ? Code bases can be extremely large. ? Code repositories are often scattered over teams & servers. ? Version Control Systems history knowledge requires special tooling. ? Code repositories often contain duplicates from the same root. ? Code repositories are frequently updated. ? Implementation & maintenance requires massive data engineering effort. 18
  • 19. Distributed retrieval and storage 19 Discovery Fetcher workerworker Workers Storage Layer Public Code Code Repositories Retrieval Architecture✔ Distributed & scalable ✔ Repository discovery ✔ Keeps VCS history ✔ Efficient storage ✔ Efficient updating ✔ Saves large data engineering effort
  • 20. Detecting languages & parsing code ? Identifying programming languages accurately & quickly. ? Analyzing code as plain text is limiting—compilers/interpreters use ASTs. ? Language-specific parsers are needed to tap the power of ASTs. ? Resulting ASTs differ wildly over languages, AST standardization is needed. ? Semantic concepts (e.g. functions) have no standard across languages. ? Speed and scalability. 20
  • 21. Language-agnostic Universal ASTs 21 ✔ Perform language-agnostic analysis ✔ Use standardized semantic concepts ✔ Do complex analyses powered by Universal ASTs & XPath filters ✔ Add new language support easily ✔ Performance at scale Source code→ →Universal ASTs
  • 22. Turning code and history into insights 22 ? Querying large amounts of code at different levels for insights. ? Typical tooling works only on the current version of your code. ? Access to code bases requires distributed, large scale data engineering. ? New data sources require teams learning new languages and tools. ? Reports, dashboards, updates require custom work from developers.
  • 23. SQL queries for repositories & UASTs 23 ✔ Save on large data engineering effort ✔ Answer from code history via SQL ✔ Query repositories, files, UASTs over history ✔ Use MySQL tools your team knows ✔ Any team member can query for reports, dashboards, charts Storage Layer Git interface Query engine MySQL Server Distributed Layer Architecture DataFlow
  • 24. Running analysis pipelines at scale 24 ? Analyzing large amounts of code data and history for insights. ? Building performant custom data pipelines for ML on code over thousands or millions of code repositories require distributed computing. ? Complexity of pipeline and integration technical components in such projects make deployment in production demanding. ? Different systems/requirements for research/prototyping and production environments, prompting extra development work.
  • 25. Apache Spark for code analysis 25 ✔ MapReduce for source code ✔ Friendly API extends Apache Spark™ ✔ Integrate tech stack pieces seamlessly ✔ Build data pipelines over code, VCS history, UASTs ✔ Run locally or on large scale distributed clusters using containers DataFlow Apache Spark API engine worker engine worker Engine instances gitbase instance gitbase instance Query instances gitbase instance gitbase instance Storage Layer Code parsing instances Architecture
  • 26. Analyses via CLI, GUI, APIs Graphical Web Client Command Line Interface Or your own tools via APIs, client libraries 26
  • 27. Have questions? Ask your codebase "What are our top 10 projects with the most developers working on them?" 27 "How many new projects do we start per year in our organization?"
  • 28. Have questions? Ask your codebase "How many repositories our codebase has per programming language?" 28 “Are our security keys for certificates or applications exposed in our code?”
  • 29. Have questions? Ask your codebase "How many repositories our codebase has per programming language?" Text 2 29
  • 30. Have questions? UASTs have answers "How many repositories our codebase has per programming language?" 30 def check(uast): findings = [] sql_commands = set({"SELECT", "UPDATE", "DELETE", "INSERT", "CREATE", "ALTER", "DROP"}) infixes = bblfsh.filter(uast, "//InfixExpression[@roleAdd and @roleBinary and @roleOperator]") for i in infixes: strs = bblfsh.filter(i, "//String[@internalRole='leftOperand']") for s in strs: first_word = s.properties["Value"].split()[0] if first_word in sql_commands: findings.append({"msg": "Potential SQL injection vulnerability", "pos": s.start_position}) return findings "Is our code vulnerable to SQL injection types of attacks?"
  • 31. Code As Data Roadmap 31 Distributed SQL for source code Implement a distributed layer over MySQL that allows users to query all the history of repos, code & UASTs Cross-Reference Resolution Provided Full integration of cross-references to enable more powerful dependency aware static analysis 2015 First data pipeline for the world’s open source code Index 10+ million git repositories and process git & source code by extending Apache Spark Universal ASTs are created We now have the ability to analyze source code agnostic to languages as Universal Abstract Syntax Trees 2016 Creation of go-git Starting to build what would become one of 3 reference implementations of Git 2017 2019 2018
  • 32. source{d} Engine - Enterprise Edition 32 Multi-node / Hybrid Cluster (Public Clouds and On Prem) analysis for large scale distributed codebases Distributed Analysis Security & Governance Support & Certification Controlled Code Deployment, Forensic code history, RBAC, 3rd party integrations, Audit Logs, Deployment Options, etc Enterprise grade support and SLA, Certified plugins & infrastructure SOURCE[D} ENGINE FOR THE ENTERPRISE
  • 34. Areas of interest for ML on Code 34 Malicious Actor Detection, Vulnerability Detection, Malicious Code Detection Style Conventions, Idiomatic Code, Naming Suggestions, Architecture Suggestions Security & Compliance Bug Detection & Prediction Performance Test Suggestions, Test Generation, Predicting & detecting bugs by learning from code history and issues, Memory, CPU & battery optimizations Predicting & detecting bugs by learning from code history and issues Memory, CPU & battery optimizations Assisted Code Review Language-agnostic duplicate & similar code detection from project to function level Assisted QA & Testing Duplicate Code Detection
  • 35. ML on Code applications 35 assisted code review naming suggestions style transfer on code vulnerability detection defect detection code suggestions automatic bug fixing inductive programming automatic refactoring AI-based pair programming natural language to code math to code neural compilers transpilation the future near term
  • 36. Public Git Archive dataset Published the largest dataset of code to date at 3TB and 180k OSS projects Gemini for Duplicate Code Detection Duplicate & similarity code detection at scale up to function level ML on Code Roadmap 36 Assisted Code Review Automate parts of the code review process with multiple, such as code style analyzers First ML on Code models First models trained on code: topic modeling of 18M repositories and clustering of duplicate code over 10M repos sourced.ml is created Fundamental for ML on Code R&D: feature extraction, model training ML on Code research repository The largest reference of machine learning on code with 3k+ readers 2016 2017 2019 2018
  • 38. ML on Code-powered applications will revolutionize software development. source{d} ML is the to-go place for ML on Code tools, models & research. Empower devs & managers with apps as source{d} Lookout for code review & source{d} Gemini to detect similar code. Code as Data is inevitable; those equipped to benefit will be ahead. source{d} Engine is the one-stop-shop for large-scale code analysis needs. Distill the knowledge from your code and benefit from more agility, quality and intelligence over your SDLC. Takeaways 38 MACHINE LEARNING ON CODECODE AS DATA
  • 39. Cross-Reference Resolution Provided Assisted Code Review Universal ASTs are created sourced.ml is created ML on Code research repository Roadmap 39 Distributed SQL for Git Public Git Archive dataset Gemini for Duplicate Code Detection 2015 First data pipeline for the world’s open source code First ML on Code models 2016 Creation of go-git 2017 2019 2018
  • 41. An international Open Core company 41 30+ employees worldwide; remote first with offices in San Francisco, Madrid and soon Seattle. Three pillars: For developers; By developers; Opinionatedly Free Spoken. Practitioners of Open Source, Open Science and Open Company philosophies. Experienced founders, senior team members and domain-expert advisors. $10 million funding from Xavier Niel, Otium, Sunstone Capital and others.
  • 42. Team: key people 42 Francesc Campoy VP of Developer Community ‒ Key Golang Developer Advocate at Google for the last 5 years ‒ International software engineer with an extensive experience in C++ developing at Google and Amadeus Vadim Markovtsev Lead Machine Learning Engineer ‒ Creator of Samsung’s Veles: distributed machine learning platform (responsible for >90% of the code) ‒ Lead Mail.ru anti-spam efforts ‒ Former associate professor at the Moscow Institute of Physics and Technology Victor Coisne Head of Growth and Community ‒ 5+ years as Head of Community at Docker ‒ Open Source contributor to many developer advocacy, community education & engagement programs ‒ Experienced partner enablement & relations manager (Microsoft, IBM, Digital Ocean, etc) Eiso Kant Co Founder & CEO ‒ Programming since the age of 12, these days in Haskell & Go ‒ Co-founder of Tyba (2011-2015) ‒ Founder of Twollars (2008-2009) ‒ Used to create software that automatically generated websites Jorge Schnura Co Founder & COO ‒ Responsible for operations, “never drops the ball” ‒ Co-founder at Tyba (2011-2015) ‒ Previously in finance ‒ Loves to automate boring things in Python Máximo Cuadros CTO ‒ 5+ years as CTO, 15+ years experience ‒ Self-taught, programming for more than 20 years (polyglot) ‒ Open source contributor to many projects (CoreOS, Terraform) ‒ Active member of the Golang community