2. Agenda
Copilot – what’s Microsoft building, what
does it look like to build Copilot in
Microsoft, what’s done in Serbia dev center.
Sort of unrelated to the rest of the talk
Types of user interfaces – go over how
people interact with software today
How will products be impacted – and
how you can add value to your company or
business. Examples of hypothetical future
products
3. Copilot
ChatGPT covers general inquiries related to
general knowledge. But it lacks proprietary
context.
Copilot aims to provide necessary context to
LLMs – at least the one which exists in Microsoft
ecosystem: your and your company’s documents,
e-mails, databases and anything else you have
access to
6. Transforms the writing process to make
you more creative and efficient.
With now you can:
• Create a summary of any document to
share as a recap or quickly get up to
speed.
• Rewrite a paragraph or save time on
formatting by asking Copilot to generate
a table from your copy.
• Create custom graphics right in the
document with Microsoft Designer, which
will pull from stock images, or your own
uploads in the chat.
• And much more (video on the next slide)
Copilot in Word – Made in Serbia
7. Microsoft 365 Copilot basic architecture
6
2
3
5
3
4
Data flow ( = all requests are encrypted via HTTPS and wss://)
User prompts from Microsoft 365 Apps are sent to Copilot
1
2
3
4
5
6
Microsoft 365 Service Boundary
Customer Microsoft 365 Tenant
Semantic
Index
Azure
OpenAI
RAI
Azure OpenAI
instance is
maintained by
Microsoft. OpenAI
has no access to the
data or the model.
RAI is performed
on input prompt
and output results
Prompts, responses, and
data accessed through
Microsoft Graph aren't
used to train foundation
models
1
8. Microsoft 365 Copilot basic architecture
6
2
3
5
3
4
Data flow ( = all requests are encrypted via HTTPS and wss://)
User prompts from Microsoft 365 Apps are sent to Copilot
1
2
3
4
5
6
Microsoft 365 Service Boundary
Customer Microsoft 365 Tenant
Semantic
Index
Azure
OpenAI
RAI
Azure OpenAI
instance is
maintained by
Microsoft. OpenAI
has no access to the
data or the model.
RAI is performed
on input prompt
and output results
Prompts, responses, and
data accessed through
Microsoft Graph aren't
used to train foundation
models
1
9. What is it like to work on Copilot
Prompt engineering
• Super-complex prompts with state-of-
the-art prompting techniques. Main issue
from quality perspective - hallucination
• Building systems for automatic
evaluation of prompts (sort of like
regtests for prompt changes)
• Manual evaluation of outputs
AI engineering
• Building and improving agents with
iterative planning
• Fine-tuning smaller models (e.g. gpt-3.5-
turbo, open-source models)
Safety
• Responsible AI – LLMs can cause serious
damage. Need to make sure people are
not able to abuse the vast knowledge
behind these models, while reducing block
rate
• Privacy, Compliance, Legal – this always
comes first, it’s slowing development quite
a bit, but necessary for Microsoft’s
business model
• Prompt injection – Could be part of either
RAI or Privacy, but such a huge effort it
deserves its own bullet point. With
increasing the scope of LLM connectors
with various data sources, prompt injection
becomes a large security issue
10. What is it like to work on Copilot
Prompt engineering
• Super-complex prompts with state-of-
the-art prompting techniques. Main issue
from quality perspective - hallucination
• Building systems for automatic
evaluation of prompts (sort of like
regtests for prompt changes)
• Manual evaluation of outputs
AI engineering
• Building and improving agents with
iterative planning
• Fine-tuning smaller models (e.g. gpt-3.5-
turbo, open-source models)
Bureaucracy
• Responsible AI – LLMs can cause serious
damage. Need to make sure people are
not able to abuse the vast knowledge
behind these models, while reducing block
rate
• Privacy, Compliance, Legal – this always
comes first, it’s slowing development quite
a bit, but necessary for Microsoft’s
business model
• Prompt injection – Could be part of either
RAI or Privacy, but such a huge effort it
deserves its own bullet point. With
increasing the scope of LLM connectors
with various data sources, prompt injection
becomes a large security issue
WE’RE HIRING
(aka.ms/careers)
11. Current types of user
experiences
Onto the main topic of the talk
In order to understand how Generative AI will
change the products we are building, we first
need to understand how products are built today
12. Current types
of user
experiences
one of the many ways to skin a cat
Simple Task-Based Applications –
Intuitive, simple, limited UIs. Likes of
Instagram, TikTok, FaceApp, etc.
Search-and-Select Interfaces – Highly
visual by nature. Likes of Amazon,
AliExpress and other e-commerce
platforms
Complex System-Operation Interfaces –
Complex interfaces for complex software
solutions: Word, Photoshop, SAP, etc.
13. Search-and-Select
Interfaces
Still mostly consumer products – but they
are solving a specific problem of shopping,
where a large stock is an advantage, hence
can be more complex.
Intuition and relevance of search results are
crucial in these UIs. Good filtering is a huge
competitive advantage. Good visuals as well.
Complex online documentation (e.g. API) or
web presentations are also a part of this
group.
14. Simple Task-Based
Applications
TikTok, Instagram, FaceApp, Twitter –
consumer products
Outside of work, people are trying to
minimize the amount of cognitive load.
People don’t want options. They are ready to
exchange flexibility for simplicity.
Hence modern app UIs – simple, highly
repeatable interactions with almost no
customization possibilities.
15. Complex System-
Operation Interfaces
Professional software requires heavy
customization capabilities. This means a LOT
of different functionalities need to be built-
in. This means very complex interfaces.
Examples: ERP systems, Excel, Photoshop.
Any intent (e.g. “remove the bird from a
photo”) implies a set of complex actions to
be fulfilled.
Expertise in these UIs is a market
commodity.
16. New types of interactions
Chat (for Search-and-Select Interfaces) – old UI
with revolutionary new capabilities
Voice (for Simple Task-Based Applications) – the
new generations and the fall of typing
Adaptive UIs (for Complex System-Operation
Interfaces) – democratization of expertise
Vision – what can a software do when it has a
sense of sight
17. Chat
Most useful for search-and-select interfaces, as a
replacement for complex search or live support
Standard RAG: Today, you can just encode your
whole content of the documentation/website (as
well as some non-visible documentation), put an
LLM on top of it and voila – you have an
automated chat covering >90% of search and
support inquiries for a fraction of the cost
It doesn’t have to. It should just know enough to replace majority of
user inquiries and it needs to know when it doesn’t know the answer
so it can direct the user to other sources (e.g. support)
RAG system
I tried it for this question and it didn’t know the answer
18. The rise of voice and
the decline of typing
Frequency of sending voice messages among mobile users
by age group (UK, May 2023)
Consumers are changing their preferences
when it comes to input modality – by
more and more preferring voice over
typing. 7 billion voice messages only on
WhatsApp daily.
Whisper by OpenAI – making it easy to
transcribe any verbal request in >90
languages. Still requires human check
though.
Most useful for mobile apps. E.g. simple
task-based applications for expanding
their flexibility.
19. Adaptive UIs
How do we significantly lower the level of
expertise needed for complex system-
operation software (like Excel), while
enhancing their capabilities? Using agents.
Let’s rebuild Photoshop using this approach.
Very, very
high-level
representation
of agents
20. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
21. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
Remove dog from
the photo
22. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
Remove dog from
the photo
Plan:
1. Run object detection
for “dog”
2. Run semantic
segmentation within
detected object
3. Create a mask in
based on segment
and add 5%
4. Run inpainting
mechanism using
Stable Diffusion v1.5
23. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
Remove dog from
the photo
Plan:
1. Run object detection
for “dog” (Gr.-DINO)
2. Run semantic
segmentation within
detected object
3. Create a mask in
based on segment
and add 5%
4. Run inpainting
mechanism using
Stable Diffusion v1.5
Selected the dog.
Please verify the
selection
Apply
24. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
Remove dog from
the photo
Plan:
1. Run object detection
for “dog” (Gr.-DINO)
2. Run semantic
segmentation within
detected object SAM
3. Create a mask in
based on segment
and add 5%
4. Run inpainting
mechanism using
Stable Diffusion v1.5
Selected the dog.
Please verify the
selection
Apply
Done
Segmented the dog.
Please verify the
selgment
25. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
Remove dog from
the photo
Plan:
1. Run object detection
for “dog” (Gr.-DINO)
2. Run semantic
segmentation within
detected object SAM
3. Create a mask in
based on segment
and add 5%
4. Run inpainting
mechanism using
Stable Diffusion v1.5
Selected the dog.
Please verify the
selection
Done
Segmented the dog.
Please verify the
selgment
Done
26. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
Remove dog from
the photo
Plan:
1. Run object detection
for “dog” (Gr.-DINO)
2. Run semantic
segmentation within
detected object SAM
3. Create a mask in
based on segment
and add 5%
4. Run inpainting
mechanism using
Stable Diffusion v1.5
Selected the dog.
Please verify the
selection
Done
Segmented the dog.
Please verify the
selgment
Done
27. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
Remove dog from
the photo
Selected the dog.
Please verify the
selection
Done
Segmented the dog.
Please verify the
selgment
Done
28. Adaptive UIs
How do we significantly
lower the level of expertise
needed for complex system-
operation software (like
Excel), while enhancing their
capabilities? Using agents.
Let’s rebuild Photoshop
using this approach.
Remove dog from
the photo
Selected the dog.
Please verify the
selection
Done
Segmented the dog.
Please verify the
selgment
Done
I have generated the
final picture without
the dog. Hope you
like it.
29. Adaptive UIs
This approach can be used for any complex
software with a number of hidden and/or
complex capabilities, as well as a way to
reduce the cost of “real-estate” on UI – you
can only show capabilities relevant for the
user at that specific moment.
Or the software could just perform the tasks
automatically (though not advised, it’s best
to always keep human in the loop)
32. Vision
GPT-4V and other multi-modal generative models
(like LLaVa) are going to change the way people
interact with software.
As more and more products adopt visual input (like
screenshots, doodles or just style references)
expectations of the users are going to change
• Why would I type in one product if I can just
paste the screenshot in that other product?
• Why would I retype in company template
when I can just post an image of reference
document and text?
And then AR/VR in combination with these models
– yet to see where that takes us
Microsoft 365 Copilot is your AI assistant at work.
The most important thing about Copilot is that you’re always in control.
You decide what to keep, modify, or discard.
Let’s take a look at what Copilot can do for you.
<< Click to play video >>
Copilot in Word transforms every part of the creative process. Coming soon with Designer integration in Word, you can effortlessly incorporate custom graphics into your document--Copilot uses the context of your content to propose stock visuals in the Chat, and you can upload and customize your own images for a more personal touch.
Designer is the latest addition to our family of Microsoft 365 Consumer apps. Designer is in preview and available in English only.
Copilot receives an input prompt from a user in an app, like Word or PowerPoint.
Copilot then pre-processes the prompt through an approach called grounding, which improves the specificity of the prompt, ensuring that you get answers that are relevant and actionable to your specific task. It does this, in part, by making a call to Microsoft Graph and accessing your organization’s data. Data used by Copilot for an authenticated user is scoped to the documents and data that are already visible to them through existing Microsoft 365 role-based access controls.
This retrieval of information is referred to as retrieval-augmented generation. It allows Copilot to provide exactly the right type of information as input to an LLM, combining this user data with other inputs such as information retrieved from knowledge base articles to improve the prompt. Copilot takes the response from the LLM and post-processes it. This post-processing includes additional grounding calls to Microsoft Graph, responsible AI checks, security, compliance and privacy reviews, and command generation.
Copilot returns a recommended response to the user, and commands back to the apps where a user can review and assess the suggested response. Copilot iteratively processes and orchestrates these sophisticated services to produce results that are relevant to your business because they are contextually based on your organization’s data.
Copilot receives an input prompt from a user in an app, like Word or PowerPoint.
Copilot then pre-processes the prompt through an approach called grounding, which improves the specificity of the prompt, ensuring that you get answers that are relevant and actionable to your specific task. It does this, in part, by making a call to Microsoft Graph and accessing your organization’s data. Data used by Copilot for an authenticated user is scoped to the documents and data that are already visible to them through existing Microsoft 365 role-based access controls.
This retrieval of information is referred to as retrieval-augmented generation. It allows Copilot to provide exactly the right type of information as input to an LLM, combining this user data with other inputs such as information retrieved from knowledge base articles to improve the prompt. Copilot takes the response from the LLM and post-processes it. This post-processing includes additional grounding calls to Microsoft Graph, responsible AI checks, security, compliance and privacy reviews, and command generation.
Copilot returns a recommended response to the user, and commands back to the apps where a user can review and assess the suggested response. Copilot iteratively processes and orchestrates these sophisticated services to produce results that are relevant to your business because they are contextually based on your organization’s data.