Visual ChatGPT is an AI assistant that can understand text and images, using tools to process images indirectly and reasoning to complete tasks like replacing objects in an image and applying image filters, while being strict about file names and tool outputs. It aims to have natural conversations and help with a wide range of visual tasks through invoking external tools in a thoughtful manner.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
TransPose: Towards Explainable Human Pose Estimation by TransformerYasutomo Kawanishi
TransPose proposes a Transformer-based model for human pose estimation that aims to improve explainability. It applies a Transformer encoder to feature maps from an image to estimate keypoint heatmaps. Self-attention can visualize relationships between pixels. The model achieves comparable accuracy to CNN models but with 73% fewer parameters and faster speed. Heatmap visualizations show which locations influence each joint the most.
STEAMy AI in Education_VWEC AI Expert Series.pdfCynthia Calongne
VWEC AI Expert Series presentation by Dr. Cynthia Calongne, aka Lyr Lobo, and Joyce Bettencourt, aka Rhiannon Chatnoir. Description: Join us at the Virtual Worlds Education Consortium in Second Life as we explore the use of AI in education and the Arts. This session features AI use in virtual worlds, teaching AI for ethical use, safety, explainable AI, and research, and prompt generation to create artifacts for sparking creativity and inspiration.
Why AI Image Generators Won’t Replace UI_UX Designers & Illustrators.docxShakuro
Neural networks are thriving in many sectors: art, design, and development. Discover the reasons why you shouldn’t be afraid of AI generators and which of them you can use to your advantage.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
TransPose: Towards Explainable Human Pose Estimation by TransformerYasutomo Kawanishi
TransPose proposes a Transformer-based model for human pose estimation that aims to improve explainability. It applies a Transformer encoder to feature maps from an image to estimate keypoint heatmaps. Self-attention can visualize relationships between pixels. The model achieves comparable accuracy to CNN models but with 73% fewer parameters and faster speed. Heatmap visualizations show which locations influence each joint the most.
STEAMy AI in Education_VWEC AI Expert Series.pdfCynthia Calongne
VWEC AI Expert Series presentation by Dr. Cynthia Calongne, aka Lyr Lobo, and Joyce Bettencourt, aka Rhiannon Chatnoir. Description: Join us at the Virtual Worlds Education Consortium in Second Life as we explore the use of AI in education and the Arts. This session features AI use in virtual worlds, teaching AI for ethical use, safety, explainable AI, and research, and prompt generation to create artifacts for sparking creativity and inspiration.
Why AI Image Generators Won’t Replace UI_UX Designers & Illustrators.docxShakuro
Neural networks are thriving in many sectors: art, design, and development. Discover the reasons why you shouldn’t be afraid of AI generators and which of them you can use to your advantage.
The document discusses CSS pseudo-elements and their power. It begins with an overview of pseudo-elements like ::before and ::after, which allow generating content that is not in the DOM. Various values for the "content" property are demonstrated, including strings, URLs, and counters. Examples show how pseudo-elements can be used to add styling effects, numbers, or captions without altering the HTML. The document encourages exploring advanced uses of pseudo-elements for layout adjustments, animations, and responsive designs.
The document discusses technologies used to construct a media product. It describes both hardware, like an iMac computer, keyboard, mouse, servers, and photography equipment. It also discusses software used, including the operating system, search engines like Google, Photoshop for editing photos, Word for writing articles, and InDesign for laying out pages of the media product. The student learned various tools in Photoshop for editing images and how to layout columns of text in InDesign.
leewayhertz.com-Visual ChatGPT The next frontier of conversational AI.pdfrobertsamuel23
As the field of AI continues to evolve and improve, its impact on daily life is rapidly
increasing, making it an essential area of focus for businesses and individuals alike.
This document discusses the use of various media and software tools for a media project. It summarizes the author's experience using blogs as a new media platform, Adobe Flash and Fireworks for animation and image editing, Photoshop for photo editing, Slideshare for file sharing, PowerPoint for presentations, and Sony Vegas for video editing. The author provides brief descriptions of how each tool was used and their level of experience with it.
Visual ChatGPT is a conversational AI model that combines computer vision and natural language processing to create a more enhanced and engaging chatbot experience. There are many potential applications for Visual ChatGPT, such as creating and editing photographs, which may not be available online. It can remove objects from pictures, change the background color, and provide more accurate AI descriptions of uploaded pictures.
Apple makes it really easy to get started with Machine Learning as a developer. See how you can easily use Create ML and Turi Create to train Machine Learning models and use them in your iOS apps.
Blogger, Prezi, PowerPoint, Photoshop, and DaFont were the key technologies used to construct the product. Blogger was used to post work online, including research and drafts. Prezi allowed for interactive, nonlinear presentations. PowerPoint was used to create presentations from blog posts. Photoshop enabled editing and designing magazine covers through tools like spot healing, magnetic lasso, layers, and magic wand. DaFont provided fonts to represent the genre. Learning to effectively use these technologies was an important part of constructing the final product.
DynamicsPower! Melbourne AI for everyone: Virtual Agent & AI BuilderAndre Margono
The document provides an overview of Microsoft's Dynamics 365 Virtual Agent and AI Builder tools. It discusses the history and capabilities of both tools, including how Virtual Agent allows building bots with graphical interfaces and no code, while AI Builder enables creating AI models without data science expertise. The document also shares findings from using the tools, such as limitations in Virtual Agent and needed workarounds in AI Builder. Design considerations for building effective chatbots are also presented.
BTec Level 3 Extended Diploma in Creative Media Production covers key concepts in digital media including raster images, vector images, anti-aliasing, resolution, aspect ratio, file formats, color models, and Adobe software applications like Photoshop, Illustrator, and InDesign. Raster images use pixels on a grid and are resolution dependent while vector images use mathematical relationships between points and can scale smoothly. Anti-aliasing reduces jagged edges to make images appear smoother. Resolution and aspect ratio impact image quality, and different file formats like JPG, TIFF, EPS are suited to different editing software and purposes.
Adobe Photoshop is a popular photo editing software used by graphic designers, video professionals, and web designers to create and edit images and videos. It has many tools that allow users to select parts of images, add effects, and make alterations. Some key tools include selection tools like the marquee and lasso tools, and editing tools like the crop, healing, clone stamp, and eraser tools. Photoshop can be downloaded and installed easily online for creating and modifying digital graphics and photos.
Artificial Intelligence Workshop, Collegio universitario Bertoni, Milano, 20 May 2017.
Audience of the workshop: undergraduate students without neural networks background.
Summary:
- Deep Learning Showcase
- What is deep learning and how it works
- How to start with deep learning
- Live demo: image recognition with Nvidia DIGITS
- Playground
Duration: 2 hours.
This document discusses Adobe Photoshop and provides instructions for editing a photo. It begins with an introduction to Photoshop and descriptions of its advantages like saving time by repeating steps and applying effects with one click. Disadvantages include its complexity, cost, and system requirements. Step-by-step instructions are given for editing a photo to create a blog header, including cropping, adding text and filters. The conclusion notes that Photoshop can be complex but editing photos slowly helps achieve good results.
Data Scenarios 2020: 6 Amazing TransformationsSafe Software
We’ll take you through the most cutting-edge scenarios our team has been working on over the last year, including applying machine learning to geospatial data, real-world use cases for immersive environments, photogrammetry, and more.
The document provides an overview of a presentation about Google Cloud developer tools and an easier path to machine learning. It introduces the speaker and their background and experience. It then outlines the agenda which includes introductions to machine learning and Google Cloud, Google APIs, Cloud ML APIs, and other APIs to consider. It provides examples of using various Cloud ML APIs like Vision, Natural Language, and Speech for tasks like image labeling, text analysis, and speech recognition. The goal is to demonstrate how APIs powered by machine learning can help ease the burden of learning machine learning by allowing users to leverage pre-built models if they can call APIs.
The document outlines objectives and tasks for creating a 3D character model for a game. The objectives include identifying the character's style, creating a 3D model, researching fabric creation methods, and rigging the final model. Various teams are tasked with 3D modeling, 2D textures, video production, and client communication. Progress has been slowed by issues perfecting the character's hands and coat simulation. Deadlines need to be adjusted and the client updated on changes. Overall, more time is needed to achieve the objectives due to underestimating the work required.
Short Talk on Graphiti at EclipseCon 2010michaelwenz
Graphiti is a framework that aims to support the fast and easy creation of graphical tools for displaying and editing domain models using a tool-defined graphical notation. It works by defining diagram types, features, and objects. To build a tool with Graphiti, a developer creates a diagram type provider, feature provider, and features, and registers them using extensions. Features implement graphical operations for domain objects and define contexts for user interactions. This allows tools to be created that graphically represent and manipulate domain models.
Similar to 2023/06/01 IoT ALGYAN ChatGPT研究会第9弾 資料 (20)
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
5. 言語モデルの発展
出典 “A Survey of Large Language Models” (Wayne et. al., 2023) https://arxiv.org/abs/2303.18223
6. LLM の Emergent Abilities
Emergency = 突然出
現するもの
言語モデルが、ある
パラメーター数を超
えると突然持ち始め
る予測不能な現象
出典 : “Emergent Abilities of Large Language Models” (Jason, et. al., 2022)
https://arxiv.org/abs/2206.07682
7. 考え : 会社 C の請求金額を確認する必要がある
行動 : GetInvoice[C]
考え : 会社 F の請求金額を確認する必要がある
行動 : GetInvoice[F]
考え : C と F の合計を計算する必要がある
行動 : Total[20000, 4100]
考え : 会社 A の請求金額を確認する必要がある
行動 : GetInvoice[A]
考え : 会社 E の請求金額を確認する必要がある
行動 : GetInvoice[E]
請求 DB から C の請求金額を検索:
20000
請求 DB から F の請求金額を検索:
4100
計算機 20000 + 4100 = 24100
請求 DB から A の請求金額を検索:
2000
請求 DB から E の請求金額を検索:
1000
言語モデル (OpenAI GPT-3) 外部ツール
考え : A と E の合計を計算する必要がある
行動 : Total[2000, 1000] 計算機 2000 + 1000 = 3000
考え : 会社 C, F の請求金額の合計と会社 A, E の請求
金額の合計との間の差額を計算する必要がある
行動 : Diff[24100 3000]
計算機 | 24100 – 3000 | = 21100
考え : よって、答えは 21100
行動 : Finish[21100]
OpenAI GPT が、思考を担当
(つぎにすべきことを提案)
= Reasoning
外部ツール (関数部分) が
実作業を担当
= Acting
8. • 下記ツールを使用
• Search
• Lookup
• 回答精度の大幅な向上 (ベンチマーク
結果)
• より高度な方式では、強化学習・模倣
学習などとも連携
出典 “REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE
MODELS” (Shunyu et. al., 2022)
ReAct / MRKL Systems
9. Prior knowledge vs Label mapping
(Google がおこなった ICL の実験)
unrelated
labels
flipped
labels
IF あり 1 2
IF なし 3 4
10. unrelated
labels
flipped
labels
IF あり 1 2
IF なし 3 4
Prior knowledge vs Label mapping
(Google がおこなった ICL の実験)
出典 : “Larger language models do in-context learning differently” (Jerry, et. al., 2023)
https://arxiv.org/abs/2303.03846
11. Prior knowledge vs Label mapping
(Google がおこなった ICL の実験)
unrelated
labels
flipped
labels
IF あり 1 2
IF なし 3 4
出典 : “Larger language models do in-context learning differently” (Jerry, et. al., 2023)
https://arxiv.org/abs/2303.03846
12. Symbol Tuning
出典 : “Symbol tuning improves in-context learning in language models” (Jerry, et. al., 2023)
https://arxiv.org/abs/2305.08298
13. LLM の Emergent Abilities
Emergency = 突然出
現するもの
言語モデルが、ある
パラメーター数を超
えると突然持ち始め
る予測不能な現象
出典 : “Emergent Abilities of Large Language Models” (Jason, et. al., 2022)
https://arxiv.org/abs/2206.07682
14. Multi Modalities への期待
出典 : “ImageBind: One Embedding Space To Bind Them All” (Rohit, et. al., 2023)
https://arxiv.org/abs/2305.05665
15. TaskMatrix (Visual ChatGPT)
• 画像の処理は、主に
Hugging Face 上の Visual
Foundation Model が担当
• 言語モデル (ChatGPT) を
使って Reasoning
16. 実行例
• Thought: Do I need to use
a tool? Yes
• Action: Replace Something
From The Photo
• Action Input:
image/9bb5e03b.png,
couch, desk
• Observation:
image/5737_replace-
something_9bb5e03b_9bb
5e03b.png
• Thought: Do I need to use
a tool? Yes
• Action: Instruct Image
Using Text
• Action Input:
image/5737_replace-
something_9bb5e03b_9bb
5e03b.png, make it like a
water-color painting
• Observation:
image/770e_pix2pix_5737
_9bb5e03b.png
• Thought: Do I need to use
a tool? No
• AI: Here is the image you
requested.
• ![image/770e_pix2pix_573
7_9bb5e03b.png](image/7
70e_pix2pix_5737_9bb5e0
3b.png)
“replace the sofa in this image with a desk and then make it like
a water-color painting”
Action 1:
Replace Something From The Photo
“couch, desk”
Action 2:
Instruct Image Using Text
“make it like a water-color painting”
17.
18. Visual ChatGPT is designed to be able to assist with a wide range of text and visual related tasks,
from answering simple questions to providing in-depth explanations and discussions on a wide range of
topics. Visual ChatGPT is able to generate human-like text based on the input it receives, allowing it
to engage in natural-sounding conversations and provide responses that are coherent and relevant to
the topic at hand.
Visual ChatGPT is able to process and understand large amounts of text and images. As a language
model, Visual ChatGPT can not directly read images, but it has a list of tools to finish different
visual tasks. Each image will have a file name formed as "image/xxx.png", and Visual ChatGPT can
invoke different tools to indirectly understand pictures. When talking about images, Visual ChatGPT is
very strict to the file name and will never fabricate nonexistent files. When using tools to generate
new image files, Visual ChatGPT is also known that the image may not be the same as the user's demand,
and will use other visual question answering tools or description tools to observe the real image.
Visual ChatGPT is able to use tools in a sequence, and is loyal to the tool observation outputs rather
than faking the image content and image file name. It will remember to provide the file name from the
last tool observation, if a new image is generated.
Human may provide new figures to Visual ChatGPT with a description. The description helps Visual
ChatGPT to understand this image, but Visual ChatGPT should use tools to finish following tasks,
rather than directly imagine from the description.
Overall, Visual ChatGPT is a powerful visual dialogue assistant tool that can help with a wide range
of tasks and provide valuable insights and information on a wide range of topics.
TOOLS:
------
19. Visual ChatGPT has access to the following tools:
> Get Photo Description: useful when you want to know what is inside the photo. receives image_path as
input. The input to this tool should be a string, representing the image_path.
> Remove Something From The Photo: useful when you want to remove and object or something from the
photo from its description or location. The input to this tool should be a comma seperated string of
two, representing the image_path and the object need to be removed.
> Replace Something From The Photo: useful when you want to replace an object from the object
description or location with another object from its description. The input to this tool should be a
comma seperated string of three, representing the image_path, the object to be replaced, the object to
be replaced with
> Instruct Image Using Text: useful when you want to the style of the image to be like the text. like:
make it look like a painting. or make it like a robot. The input to this tool should be a comma
seperated string of two, representing the image_path and the text.
. . . . .
To use a tool, please use the following format:
```
Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [Get Photo Description, Remove Something From The Photo,
Replace Something From The Photo, Instruct Image Using Text]
Action Input: the input to the action
Observation: the result of the action
```
When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the
format:
20. ```
Thought: Do I need to use a tool? No
AI: [your response here]
```
You are very strict to the filename correctness and will never fake a file name if it does not exist.
You will remember to provide the image file name loyally if it's provided in the last tool
observation.
Begin!
Previous conversation history:
Human: provide a figure named image/9bb5e03b.png. The description is: a living room with a couch and a
couch in the corner. This information helps you to understand this image, but you should use tools to
finish following tasks, rather than directly imagine from my description. If you understand, say
"Received".
AI: Received.
New input: replace the sofa in this image with a desk and then make it like a water-color painting
Since Visual ChatGPT is a text language model, Visual ChatGPT must use tools to observe images rather
than imagination.
The thoughts and observations are only visible for Visual ChatGPT, Visual ChatGPT should remember to
repeat important information in the final response for Human.
Thought: Do I need to use a tool? Yes
Action: Replace Something From The Photo
Action Input: image/9bb5e03b.png, couch, desk
21. could you
generate a cat
for me ?
could you
replace a cat to
a dog and then
remove the
book ?
could you
generate a
canny edge of
this image ?
generate a yellow
dog based on
......png
22. Visual Foundation Models
Get Photo Description
Generate Image From User Input Text
Remove Something From The Photo
Replace Something From The Photo
Instruct Image Using Text
Answer Question About The Image
Edge Detection On Image
Generate Image Condition On Canny Image
Line Detection On Image
Generate Image Condition On Line Image
Hed Detection On Image
Generate Image Condition On Soft Hed
Boundary Image
Segmentation On Image
Generate Image Condition On Segmentations
Predict Depth On Image
Generate Image Condition On Depth
Predict Normal Map On Image
Generate Image Condition On Normal Map
Sketch Detection On Image
Generate Image Condition On Sketch Image
Pose Detection On Image
Generate Image Condition On Pose Image
24. Related Projects
• JARVIS (HuggingGPT) - Microsoft
https://github.com/microsoft/JARVIS
• LLM-Augmenter system - Microsoft
https://github.com/pengbaolin/LLM-Augmenter
• Transformers Agent – Hugging Face
https://huggingface.co/docs/transformers/transformers_
agents
• GPT-4 + Stable-Diffusion - Berkeley AI Research
https://llm-grounded-diffusion.github.io/