GitHub Copilot is an AI assistant created by GitHub to provide code completions and suggestions to developers. It was trained on publicly available source code from repositories on GitHub. The suggestions are uniquely generated and belong to the developer, not GitHub Copilot or its training data. GitHub Copilot collects usage data to improve but does not access or store users' private code. Future versions aim to prevent insecure or offensive outputs through improved filters and a more carefully curated training set.
21. Recommendations
How do I get the most out of GitHub Copilot?
It works best when you divide your code into small functions, use
meaningful names for functions parameters, and write good docstrings
and comments as you go. It also seems to do best when it’s helping you
navigate unfamiliar libraries or frameworks.
22. Training Set
What data has GitHub Copilot been trained on?
It has been trained on a selection of English language and source code from publicly
available sources, including code in public repositories on GitHub.
Why was GitHub Copilot trained on data from publicly available sources?
Training machine learning models on publicly available data is considered fair use
across the machine learning community. The models gain insight and accuracy from the
public collective intelligence.
Will GitHub Copilot help me write code for a new platform?
When a new platform or API is launched for the first time, developers are the least
familiar with it. There is also very little public code available that uses that API, and a
machine learning model is unlikely to generate the code without fine tuning. In the
future, we will provide ways to highlight newer APIs and samples to raise their
relevance in GitHub Copilot’s suggestions.
23. Does GitHub Copilot recite code from the training set?
GitHub Copilot is a code synthesizer, not a search engine: the vast
majority of the code that it suggests is uniquely generated and has
never been seen before. We found that about 0.1% of the time, the
suggestion may contain some snippets that are verbatim from the
training set. Many of these cases happen when you don’t provide
sufficient context (in particular, when editing an empty file), or when
there is a common, perhaps even universal, solution to the problem.
We are building an origin tracker to help detect the rare instances of
code that is repeated from the training set, to help you make good real-
time decisions about GitHub Copilot’s suggestions.
24. Who owns the code GitHub Copilot helps me write?
GitHub Copilot is a tool, like a compiler or a pen. The suggestions
GitHub Copilot generates, and the code you write with its help, belong
to you, and you are responsible for it. We recommend that you
carefully test, review, and vet the code, as you would with any code you
write yourself.
The code you create with GitHub Copilot’s help belongs to you. While
every friendly robot likes the occasional word of thanks, you are in no
way obligated to credit GitHub Copilot. Just like with a compiler, the
output of your use of GitHub Copilot belongs to you.
25. Privacy
Does GitHub Copilot ever output personal data?
Because GitHub Copilot was trained on publicly available code, its
training set included public personal data included in that code. From
our internal testing, we found it to be extremely rare that GitHub
Copilot suggestions included personal data verbatim from the training
set. In some cases, the model will suggest what appears to be personal
data – email addresses, phone numbers, access keys, etc. – but is
actually made-up information synthesized from patterns in training
data. For the technical preview, we have implemented a rudimentary
filter that blocks emails when shown in standard formats, but it’s still
possible to get the model to suggest this sort of content if you try hard
enough.
26. What data is collected
• The GitHub Copilot collects activity from the user’s Visual Studio Code editor, tied to a timestamp, and
metadata. This metadata consists of the extension settings and the standard metadata collected by
the Visual Studio Code extension telemetry package:
• Visual Studio Code machine ID (pseudonymized identifier)
• Visual Studio Code session ID (pseudonymized identifier)
• Visual Studio Code version
• Geolocation from IP address (country, state/province and city, but not the IP address itself)
• Operating system and version
• Extension version
• The VS Code UI (web or desktop)
The activity collected consists of events that are triggered when:
• An error occurs (it records the error kind and relevant background; e.g. if it’s an authentication error the key
expiry date is recorded)
• Our models are accessed to ask for code suggestions (it records editor state like position of cursor and
snippets of code)—this includes cases when the user takes an action to request code suggestions
• Code suggestions are received or displayed (it records the suggestions, post-processing, and metadata like
model certainty and latency)
• Code suggestions are redacted due to filters that ensure AI safety
• The user acts on code suggestions (e.g. to accept or reject them)
• The user has acted on code suggestions and then it records whether or how they persisted in the code
27. Responsible AI
Can GitHub Copilot introduce insecure code or offensive outputs in its suggestions?
There’s a lot of public code in the world with insecure coding patterns, bugs, or references
to outdated APIs or idioms. When GitHub Copilot synthesizes code suggestions based on
this data, it can also synthesize code that contains these undesirable patterns. This is
something we care a lot about at GitHub, and in recent years we’ve provided tools such as
Actions, Dependabot, and CodeQL to open source projects to help improve code quality.
Similarly, as GitHub Copilot improves, we will work to exclude insecure or low-quality code
from the training set. Of course, you should always use GitHub Copilot together with
testing practices and security tools, as well as your own judgment.
The technical preview includes filters to block offensive words and avoid synthesizing
suggestions in sensitive contexts. Due to the pre-release nature of the underlying
technology, GitHub Copilot may sometimes produce undesired outputs, including biased,
discriminatory, abusive, or offensive outputs. If you see offensive outputs, please report
them so that we can improve our safeguards. GitHub takes this challenge very seriously
and we are committed to addressing it with GitHub Copilot.
28. Responsible AI
How will advanced code generation tools like GitHub Copilot affect
developer jobs?
Bringing in more intelligent systems has the potential to bring
enormous change to the developer experience. We expect this
technology will enable existing engineers to be more productive,
reducing manual tasks and helping them focus on interesting work. We
also believe that GitHub Copilot has the potential to lower barriers to
entry, enabling more people to explore software development and join
the next generation of developers.
29. Reinforcement Learning
How is the data that GitHub Copilot collects used?
In order to generate suggestions, GitHub Copilot transmits part of the file you are
editing to the service. This context is used to synthesize suggestions for you. GitHub
Copilot also records whether the suggestions are accepted or rejected. This
telemetry is used to improve future versions of the AI system, so that GitHub
Copilot can make better suggestions for all users in the future.
We do not reference your private code when generating code for other users.
All data is transmitted and stored securely. Access to the telemetry is strictly limited
to individuals on a need-to-know basis. Inspection of the gathered source code will
be predominantly automatic, and when humans read it, it is specifically with the
aim of improving the model or detecting abuse.
30. Will there be a paid version?
If the technical preview is successful, our plan is to build a commercial
version of GitHub Copilot in the future. We want to use the preview to
learn how people use GitHub Copilot and what it takes to operate it at
scale.
For now, we’re focused on delivering the best experience in Visual
Studio Code only.
Editor's Notes
Una nueva herramienta de programación de pares impulsada por Inteligencia Artificial,
GitHub Copilot is an AI pair programmer that helps you write code faster and with less work. GitHub Copilot draws context from comments and code, and suggests individual lines and whole functions instantly. GitHub Copilot is powered by OpenAI Codex, a new a new language-generating algorithm AI system created by OpenAI. It is the descendant of GPT-3 but narrowly focused on code generation.
It is the first taste of Microsoft’s $1 billion investment into OpenAI — a research and development company specializing in AGI (Artificial General Intelligence).
The GitHub Copilot technical preview is available as a Visual Studio Code extension.
Copilot isn’t the first of its kind; Tabnine and Kite are two popular tools that cover the same functionality. However, Copilot stands out because it’s powered by Codex, a descendant from GPT-3 that provides a deeper understanding of the context than other assistants. Codex is similar to GPT-3, with an important distinction: It has been trained on huge amounts of coding data publicly available, from GitHub repositories and other sites.
The power of language models
GPT-2, GPT-3, the Switch transformer, LaMDA, MUM, Wu Dao 2.0… Pre-trained language models are the cake in AI right now and every major player in the field is fighting to get the biggest portion. The reason is that these models work extremely well. GPT-3 is the most popular, and rightly so. When OpenAI released the GPT-3 API, it allowed the world to take a peek at its power, and it took effect. GPT-3 was so powerful and performed so well, people even dared to call it AGI — which it isn’t.
However, because GPT-3 is a general-purpose language model, it’s fair to assume it could improve its performance in specific tasks if prepared adequately to do so. Baseline GPT-3 could be portrayed as a jack of all trades, whereas its specialized versions would be masters of their task. DALL·E was the first hint of this possibility. OpenAI trained the system on text-image pairs, which allowed DALL·E to generate images from captions and much more, excelling at visual creativity. GPT-J, a GPT-3-like system 30x smaller, could generate better code because it was trained heavily on GitHub and StackExchange data. PALMS, a method to reduce biases on GPT-3, further reinforced this hypothesis. Researchers improved GPT-3 behavior significantly by fine-tuning it on a small curated dataset.
Now, Copilot, in collaboration with Codex, has definitely proved this idea: Language models can be specialized to become masters of their trade. Codex is a programming master, what other areas could be mastered by a super-powerful specialized language model? Could they master other formal languages such as logic or math? Could they master each and every capability found on GPT-3? Could a language model write fiction at the level of Shakespeare, Cervantes, or Tolstoi? Could it compose the best hits of the summer? What’s possible isn’t clear. What’s clear is that we haven’t yet found their limits.
Skip the docs and stop searching for examples. GitHub Copilot helps you stay focused right in your editor.
It works best with Python, JavaScript, TypeScript, Ruby, and Go, but it “understands dozens of languages.” However, because it fails sometimes, it may be more useful when exploring new libraries or writing in an unknown programming language. Otherwise, it could take more time to review Copilot’s code than to write it yourself.
ha sido desarrollado en colaboración con OpenAI y aprovecha Codex, un nuevo sistema de inteligencia artificial entrenado en código fuente y lenguaje natural disponible públicamente con el objetivo de interpretar los comentarios y el código escrito por un usuario y generar fragmentos de código asociados.
It is coming soon to Visual Studio Code. It basically draws context from the code you’re working on, suggesting whole lines or entire functions.
It does not yet use other files in your project as inputs for synthesis. This means that, for example, copy/pasting a type declaration into the file you’re working on may improve suggestions from GitHub Copilot. This is something we will improve in the future.
Le ayuda a descubrir rápidamente formas alternativas de resolver problemas, escribir pruebas y explotar nuevas API sin tener que adaptar tediosamente una búsqueda de respuestas en Internet», dijo Nat Friedman, CEO de GitHub.
One thing GitHub copilot can guarantee is that it is going to make coding and software, in general, more accessible
OpenAI Codex has a broad knowledge of how people use code and is significantly more capable than GPT-3 in code generation
The quality of the suggestions also depends on the existing code. Copilot can only use the current file as a context. It cannot test its own code meaning it may not even run or compile.
The tool can generate up to 10 alternatives for a single suggestion. The algorithm consistently improves by recording whether each suggestion is accepted or not.
We recently benchmarked against a set of Python functions that have good test coverage in open source repos. We blanked out the function bodies and asked GitHub Copilot to fill them in. The model got this right 43% of the time on the first try, and 57% of the time when allowed 10 attempts. And it’s getting smarter all the time.
GitHub Copilot doesn’t actually test the code it suggests, so the code may not even compile or run. GitHub Copilot can only hold a very limited context, so even single source files longer than a few hundred lines are clipped and only the immediately preceding context is used. And GitHub Copilot may suggest old or deprecated uses of libraries and languages. You can use the code anywhere, but you do so at your own risk.
Does GitHub Copilot write perfect code?
No. GitHub Copilot tries to understand your intent and to generate the best code it can, but the code it suggests may not always work, or even make sense. While we are working hard to make GitHub Copilot better, code suggested by GitHub Copilot should be carefully tested, reviewed, and vetted, like any other code. As the developer, you are always in charge.
It can suggest complete lines of code or entire functions by analyzing how you code. GitHub Copilot can assemble code from user comments and predicts your code by just reading the function name you have declared. It allows you to cycle through alternative suggestions and manually edit the suggested code. It autofill repetitive code, or create unit tests for your methods.
The GitHub Copilot editor extension sends your comments and code to the GitHub Copilot service, which then uses OpenAI Codex to synthesize and suggest code. it actually works by reading through all the open-source code on the GitHub repos worldwide and then collect the data and tries to find the best possible code related to it! It is said to work great with repetitive code patterns so users can let it generate the rest of the code. The AI assistant can also help you learn a new programming language.
OpenAI Codex was trained on publicly available source code and natural language, so it understands both programming and human languages. The GitHub Copilot editor extension sends your comments and code to the GitHub Copilot service, which then uses OpenAI Codex to synthesize and suggest individual lines and whole functions.
The best way to contribute is to sign up for the technical preview. By using GitHub Copilot and sharing your feedback, you help to improve the models that power GitHub Copilot.
GitHub states that the current preview phase is restricted because of the “state-of-the-art AI hardware” required for the project. Once the free preview phase is over, the company plans to build a commercial version, which should be “available as broadly as possible”.
For now, GitHub Copilot will only be made available to use in Visual Studio Code and the access is limited to a small group of testers.
The preview version is available for free. The preview version is already installed by 88,783 coders.
The technical preview works well for Python, JavaScript, TypeScript, Ruby and Go but the final version will work with a broader set of frameworks and languages.
It will be used to learn how people use GitHub Copilot and what it takes to operate it at scale.
GitHub Copilot tries to understand your intent and generate the best code it can, but the code it suggests may not always work, or even make sense.
GitHub Copilot ya está disponible como una extensión para el editor de código multiplataforma de Microsoft, Visual Studio Code, tanto localmente en la máquina como en la nube en los espacios de código de GitHub
This is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.
A good analogy I can think of is Google Translate. It’s been there for years but it has not replaced the need for an actual translator. You could translate an article from English to Japanese in a few seconds. However, you would still need somebody fluent in both languages to ensure that the translation is grammatically correct and it delivers the same message as the original article.
The only way is ethics
Since Copilot will be a paid-for product, there is an ethical as well as a legal debate. That said, open source advocate Simon Phipps has said: "I'm hearing alarmed tech folk but no particularly alarmed lawyers. Consensus seems to be that training the model and using the model are to be analysed separately, that training is Just Fine and that using is unlikely to involve a copyright controlled act so licensing is moot."
Michael Wolfe, a copyright lawyer, representing the firm Rosen, Wolfe & Hwang, told El Reg: “It looks like [GitHub Copilot] is likely to fall under fair use and it’s unlikely software licenses can get around that.” US courts are generally favorably toward defendants if the copyrighted work has been used in a way that is considered “transformative.”
Wolfe said GitHub Copilot would probably qualify as being transformative since it repurposed people’s code and used it for a different application compared to its original intent, whether it’s a program to create a game or a complex encryption algorithm.
"GitHub hasn’t used the code in a way to make it subject to software licenses," he said. "I think it’s very likely that its purpose is different from the applications that they’re using. It’s doing something distinct."
Was able to auto-complete my comment as well. Initially, I never intended to save it to a json file.
Added some in-line code comments
Used an external library (requests) to make a request. Used Json to save the data.
Chose a decent file name on its own
Was able to find a data source, surprisingly someone’s GitHub repo
Was able to import all the necessary libraries. Although it did import shutil and never used it.
Used zipfile library to unzip/zip
For the second function, it didn’t import the zipfile library.
Was able to use the right parameters
Generated 64 lines of code
Was able to write functions for various purposes
Knew the winning combinations of a tictactoe board
Added Error Handling
Added print statements and ability to take input from the user
Logic to check the result of the game
Although it wrote all the sub-functions, it never invoked them to actually build a playable game
Added a parameter
Used a crypto api to get the data
Was able to return the correct column for the price
For this, I had to write multiple comments and it actually felt like I was pair-programming with the Copilot. However, most of the code was generated by Copilot.
Was able to get data from github api
Since I mentioned popular, it sorted the repos based on ‘stars’. It is mind-boggling how it was able to relate ‘stars’ to popularity.
It was able to use an external library streamlit (streamlit is used to build web apps)
It also added a title and text to be displayed in the web app
It re-used the previously created function
For most parts, it was auto-completing my comments as well
General Observations
The variable and function names are pretty explanatory
Added relevant in-line code comments
Was able to use external libraries
Was able to get data from various data sources
The format of the code was clean with proper indentation and line breaks
It took me quite a few tries (trying out different comments) for it to actually use streamlit and build a simple app. In the end, I ended up importing the library and it started generating code using the library. However, sometimes it was able to generate code on its own as well
When trying to get the pokemon/crypto data, it often made suggestions that used beautiful soup to scrape the data. Web Scraping is not always the best option and in some cases, you might even end up breaking some laws.