This document describes research into using pen and voice input for drawing system configurations. It presents an approach called "TalkingDraw" that allows users to draw diagrams and insert elements while speaking to describe configurations. Two experiments tested different techniques for switching between drawing freehand and having inputs be recognized as commands. The "pigtail" technique, where users draw a curl at the end of a stroke, performed best as it allowed natural drawing and talking without disrupting either. The document discusses opportunities to improve gesture recognition and address concurrent voice and pen inputs.
Apidays New York 2024 - The value of a flexible API Management solution for O...
Drawing in Talking: Using Pen and Voice for Drawing System Configuration Figures in Talking
1. Drawing in Talking:
Using Pen and Voice for Drawing System Configuration
Figures in Talking
Research and Technology Department
Xingya Xu (xingya.xu@fujixerox.co.jp)
December 8, 2017
IDW/AD’ 17
December 6-8
Sendai, Japan
INP7/UXC6 - 2
Fuji Xerox Co., Ltd.
2. Drawing in talking vs Making in advance
2
Shop Server Database
Cloud
Drawing by hand
• Quick and easy
• Interact with listeners actively
Making in advance
• Neat and precise
• Well-designed icons and graphs
3. Purpose
3
How to drawing system configuration
figures easily and quickly?
Support drawing quickly and easily
Shop Server Databas
e
Cloud
How to drawing system
configurations in real-time talking?
Support drawing in talking
4. 4
To draw quickly and easily
Multimodal input
Make use of different input modalities such as
touch, pen, and speech in an integrated manner
The strength of Pen
• Talking or thinking during drawing
• Express the position and shape of objects
The strength of Voice
Express linguistic information
Approach 1
PC
smartphone
a. Circle Icon
b. Line Text
5. Previous Research
5
A user sitting on the chair can move the object by pointing to it and saying “move that
there”.
Put-That-There
Bolt, R.A. Put-that-there: Voice and gesture at the graphics
interface. ACM Computer Graphics 14, 3 (1980), 262–270
6. Previous Research
6
The problem of Put-That-There
Voice has two meanings
• to convey messages to the listeners
• to issue commands to the system
The problem of Put-That-There
Cause unintentional system behaviors
when the speaker talks to the listeners
Speaker
Listeners
System
Message
Command
In talking and drawing case, voice not only conveys
commands to the system, but also conveys messages
to the listeners
7. 7
To draw in talking smoothly and naturally
Approach 2
Free mode & Command mode
• In the free mode pen or speech input is
not considered as command.
• In the command mode inputs are
considered as part of a command.
Smooth mode switching
Switch between the free mode and the
command mode smoothly and not disturb
talking
PC
smartphone
a. Circle Icon
b. Line Text
8. 8
Approach 2
Mode switch techniques
Button
A basic technique
Tap
No need to specify the end, but need to
change hand holding posture
Pen-holding
No need to change hand holding posture
Pigtail
Draw a pigtail at the end of drawing
Pigtail gesture examples
Technique Description Start End
Button Press button before and
after drawing
Click Click
Tap Tap the panel before
drawing
Tap ―
Pen-holding Hold the pen for a while
before drawing
Hold the
pen
―
Pigtail Draw a pigtail at the end
of the stroke
― Pigtail
9. System implement
Design
9
TalkingDraw
A prototype system using C# on a Surface Pro 3
with a Surface Pen
Speech recognition
• Recognize users’ speech during the command
mode
Recognize Pen strokes
• Recognize the shape of users’ pen strokes when
the command is ended
• $P Point-Cloud Recognizer (R.D. Vatavu et.al.,
2012)
Talking
Drawing
Voice that will be recognized
Delay
(0.5s)
Start End
The command is automatically ended if there is
no pen and voice input detected in a 0.5s time
break.
10. Elements of system configurations
Design
smartphone
c. Line Line Text
PC
a. Circle Icon
cloud
cloud
b. Rectangle Box
text
d. Line Link
10
Shape of a
stroke Text of voice Behavior of
TalkingDraw
Circle “PC” Input an icon whose
name is “PC”
Rectangle “cloud” Show “cloud” in a text
box
Line
“smartphone” Show “smartphone” as
a simple text
― Make a link between
two objects
12. 12
Experiment 1
Participants: 16 people (12 males and 4 females, age avg. 48.1)
Scenario: TalkingDraw used as a drawing tool in talking.
Task: Participants must speak a given sentence and insert icons while speaking.
練習1) 「ネット」から「資料」をダウンロードしましょう。 The task sentence
The icons to be inserted
Talking-in-drawing task
13. 13
Experiment 1
Result
Task completion time
• One-way ANOVA: The main effect of
techniques was significant (F(3,45)=6.39,
p<.01).
• Tukey's method: Pigtail = Tap << Pen =
Button
Interview
• Pigtail was comfortable even the accuracy of
gesture recognition is complained.
• Pressing the button twice was a pain.
• It is hard to hold the pen on the screen
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
Button Tap Pen Pigtail
Taskcompletiontime(s)
(a) Experiment 1
14. 14
Experiment 2
Participants: 16 people (12 males and 4 females, age avg. 48.1)
Scenario : TalkingDraw as a drawing tool for system configuration figures.
Task: Participants must draw a given figure.
携帯 写真
アップロード データ
ベース
Example: The given figure and the sample
figure
Making-in-advance
15. 15
Experiment 2
Result
Task completion time
• One-way ANOVA: The main effect of
techniques was significant (F(3,45)=5.22,
p<.004).
• Tukey's method: Tap = Pigtail = Pen < Button
Interview
• There is no big difference between techniques.
• Button was more comfortable than in
Experiment 1.
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
Button Tap Pen Pigtail
Taskcompletiontime(s)
(b) Experiment 2
16. Pigtail performs best in experiment 1
• Specify the command mode after actions
• Need to improve the accuracy of pen gesture recognition
No big difference in experiment 2
• Participants don’t need to think during drawing a figure
• Techniques that specify the command mode before actions
perform better than in experiment 1
16
Discussion
17. 17
Future work
The accuracy of Pigtail recognition
• More samples
• Normalize
The accuracy of speech recognition
• Google cloud speech recognition
Context sensitive
• The voice input and the drawing are not concurrent
• Timestamp
• Semantic analysis
Voice input
Pen input
Key content
The command duration
Noise
The left one is a figure I draw by hand in five seconds to explain a fake cloud service. We usually draw many such kind of rough figures in discussion, brainstorming and so on. The advantages of drawing by hand is…. The right one is a figure I made in PowerPoint. Compared to the left one, it is neat and precise. I can also use some well-designed icons and graphs.
So how can we draw quickly and easily? Our approach is to use multimodal input, which means that…. The strength of Pen includes that we can talk or think during drawing. And pen is good at expressing the position and shape of objects. The strength of Voice is express linguistic information. We can talk much faster that drawing or writing.
For example, I draw a circle, and say something like “Here is a PC”. Then The system can get the position and shape of the object by this circle, and get the type of the icon, which is a ‘PC’. Actually, if PowerPoint is clever enough, it can input an PC icon here. Similarly, I draw a line and say ‘smartphone’, a text is inserted here.
There’s some previous researches about multimodal input. Put that there is a pen and voice system to input or modify objects. Like this picture shows, …
In talking and drawing case…
The problem of Put that there is that, when users are just freely talking and drawing, it may cause unintentional behaviors such as inserting wrong objects or sending unintentional command.
For example if I am saying something about a PC, and drawing a circle, that it may insert an icon of PC accidently. So we introduced two modes. First is the free mode, in the free mode…. Another is the command mode, in the command mode….
Then how to switch between…
We explored 4 mode switch techniques.
A basic technique is pressing a button to specify the start and the end of the command mode,.
However, pressing a button twice may be a pain for users. Therefore, we introduce a next Tap technique, where users must tap a panel before drawing to start the command mode. Users do not have to specify the end of the command mode. The system automatically judges the end of the command mode by recognizing a break of drawing and talking.
However, in this Tap technique, users must change the holding posture of their hand to draw something with a pen after tapping with their finger. To lessen this problem, we introduce a Pen-holding technique, where users keep the pen static for a short period of time before starting the command mode. In this technique, users do not have to change their hand posture.
These three techniques require to specify the mode switching before entering into the command mode. However, specifying the mode before doing actions might be difficult for users and this might disturb natural talking because users must judge which mode should they choose before drawing or before talking.
Therefore, we prepare another technique called a Pigtail technique. In this technique, users do not have to specify anything at the start of actions and they must specify the command mode at the end of drawing by using special drawing gesture, which is a crossed curve called a pigtail. In this technique, users don't care about mode during talking and drawing. They specify whether it is a command after performing actions.
We built a prototype system using C# on a Surface Pro 3 with a Surface Pen. The speech recognition engine recognizes users’ speech during the command. And We used an open-source pen gesture recognizer to recognize the shape of users’ pen strokes once the command is completed.
This figure shows how it works. This is the start of the command once starting to draw, and this is the end of the command if there is no pen and voice input detected in a 0.5s time break.
In current system, we recognize three kinds of shapes of a stroke. The first is a circle. When I draw a circle here and say PC, an icon of PC appears here. The second is … The third is …. Specially, if the line connects two objects, it becomes a link.
This is a demonstration video shows that how can TalkingDraw be used in an elemental school class. We used Pigtail in this video.
In the first experiment, we evaluated the 4 techniques in a talking-in-drawing task. The participants must speak a given sentence and insert icons while speaking.
This graph shows the task completion time of the techniques. We found Pigtail and Tap is much faster than Pen-holding and Button. Participants also reported that Pigtail was comfortable even the accuracy of gesture recognition is complained. In the contrast, they found Pressing the button twice was a pain, and it is hard to hold the pen on the screen.
In the second experiment, we evaluated the 4 techniques in a making-in-advance task. The participants must draw a given figure like this.
We found that Button is still the slowest, but there’s no big difference between other techniques. Participants also reported that Button was more comfortable than in Experiment 1.
We found that Pigtail performs best in experiment 1. We think the reason is that it specifies the command mode after actions so users don’t need to think when drawing. And we need to improve the accuracy of pen gesture recognition for Pigtail, which actually has affected its performance in the experiment.
There is no big difference found in experiment 2. Because participants don’t need to think during drawing a figure, techniques that specify the command mode before actions perform better than in experiment 1.
Finally, about the future work. The accuracy of Pigtail recognition and the accuracy of Japanese speech recognition can still we improved. Furthermore, we found that the voice input and the drawing are not concurrent. For example, if I want to inset a PC icon, I may draw the circle before I said “PC” in a sentence. This is a problem we need to figure out in the next experiment.