Deep Learning and Intelligent Applications
Dr Xuedong Huang from Microsoft discusses deep learning and intelligent applications. He explains that big data and GPUs enable deep learning to perform tasks like speech recognition and computer vision. CNTK is introduced as Microsoft's deep learning toolkit that balances efficiency, performance, and flexibility. It allows describing models with code, languages, or scripts and supports CPU/GPU training. Project Oxford APIs are summarized, including APIs for vision, speech, language, and spelling. These APIs make it easy for developers to incorporate intelligent services into applications.
Xuedong Huang - Deep Learning and Intelligent Applications
1. Deep Learning and Intelligent
Applications
Dr Xuedong Huang
Distinguished Engineer & Head of Advanced Technology Group
Microsoft Technology and Research
xdh@microsoft.com
2.
3.
4. What drives speech technology progress?
Stage Oxygen
Our computing
infrastructure
including GPU is a
stage for performers
Performers
Big usage data is
oxygen to beautify
our performance
Deep learning is
changing everything
as our top
performer
5. 2015 System
Human Error Rate 4%
Speech recognition could reach human parity in the next 3 years
10. Design Goal of CNTK
• A deep learning tool that balances
• Efficiency: Can train production systems as fast as possible
• Performance: Can achieve state-of-the-art performance on benchmark tasks
and production systems
• Flexibility: Can support various tasks such as speech, image, and text, and can
try out new ideas quickly
• Inspiration: Legos
• Each brick is very simple and performs a specific function
• Create arbitrary objects by combining many bricks
• CNTK enables the creation of existing and novel models by combining
simple functions in arbitrary ways.
5/25/2016 10
11. Functionality
• Supports
• CPU and GPU with a focus on GPU Cluster
• Windows and Linux
• automatic numerical differentiation
• Efficient static and recurrent network training through batching
• data parallelization within and across machines with 1-bit quantized SGD
• memory sharing during execution planning
• Modularized: separation of
• computational networks
• execution engine
• learning algorithms
• model description
• data readers
• Models can be described and modified with
• C++ code
• Network definition language (NDL) and model editing language (MEL)
• Brain Script (beta)
• Python and C# (planned)
5/25/2016 11
13. At the Heart: Computational Networks
• A generalization of machine learning models that can be described as a
series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model
• Representation:
• A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, · · · , cKn }
• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
• Can flexibly describe deep learning models.
• Adopted by many other popular tools as well
5/25/2016 13
14. CNTK Summary
• CNTK is a powerful tool that supports CPU/GPU and runs under
Windows/Linux
• CNTK is extensible with the low-coupling modular design: adding new
readers and new computation nodes is easy with a new reader design
• Network definition language, macros, and model editing language (as
well as Brain Script and Python binding in the future) makes network
design and modification easy
• Compared to other tools CNTK has a great balance between
efficiency, performance, and flexibility
5/25/2016 14
15. Theano only supports 1 GPU
We report 8 GPUs (2 machines) for CNTK only as it is the only
public toolkit that can scale beyond a single machine. Our system
can scale beyond 8 GPUs across multiple machines with superior
distributed system performance.
0
10000
20000
30000
40000
50000
60000
70000
80000
CNTK Theano TensorFlow Torch 7 Caffe
Speed Comparison (Frames/Second, The Higher the Better)
1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)
5/25/2016 15
CNTK Computational Performance
16.
17. A portfolio of APIs, SDKs and apps that enable developers to easily add intelligent
services, such as vision or speech capabilities, to their solutions
Project Oxford – Adding “smart” to your applications
22. Analyze an Image
Understand content within an image
OCR
Detect and recognize words within an image
Generate Thumbnail
Scale and crop images, while retaining key content
Computer Vision APIs
23. Analyze Image
Type of Image:
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-Line Drawing
Black & White Image False
Content of Image:
Categories [{ “name”: “people_swimming”, “score”: 0.099609375 }]
Adult Content False
Adult Score 0.18533889949321747
Faces [{ “age”: 27, “gender”: “Male”, “faceRectangle”:
{“left”: 472, “top”: 258, “width”: 199, “height”: 199}}]
Image Colors:
Dominant Color Background White
Dominant Color Foreground Grey
Dominant Colors White
Accent Color
24. OCR
LIFE IS LIKE
RIDING A BICYCLE
TO KEEP YOUR BALANCE
YOU MUST KEEP MOVING
JSON:
{
"language": "en",
"orientation": "Up",
"regions": [
{
"boundingBox": "41,77,918,440",
"lines": [
{
"boundingBox": "41,77,723,89",
"words": [
{
"boundingBox": "41,102,225,64",
"text": "LIFE"
},
{
"boundingBox": "356,89,94,62",
"text": "IS"
},
{
"boundingBox": "539,77,225,64",
"text": "LIKE"
}
. . .
Good At:
• Scanned Documents
• Photos with Text
• Fine Grained Location
Information
Need to Improve
• Vehicle License Plate
• Hand-written Text
• Characters with Large
Sizes
26. Face Detection
Detect faces and their attributes within an image
Face Verification
Check if two faces belong to the same person
Similar Face Searching
Find similar faces within a set of images
Face APIs
Face Grouping
Organize many faces into groups
Face Identification
Search which person a face belongs to
30. Video APIs
Stabilization
Smooth and stabilize shaky video
Face Detection and Tracking
Detect and track faces in videos
Motion Detection
Detect when motion occurs
31. Stabilization
The Stabilization API provides automatic video stabilization and smoothing for shaky videos.
This API uses many of the same technologies found in Microsoft Hyperlapse.
Best For:
Small camera motions, with or without rolling shutter effects (e.g. holding a static camera, walking
with a slow speed).
32. Face Detection and Tracking
High precision face location detection and tracking.
Can detect up to 64 human faces in a video (no smaller than 24x24 pixels)
Detected and tracked faces are returned with coordinates and a Face ID to track throughout the
video.
Time (sec) Face ID x, y Width, Height
0 0 0.59, 0.23 0.09, 0.16
0 1 0.38, 0.15 0.07, 0.12
1 0 0.54, 0.25 0.09, 0.15
1 1 0.23, 0.18 0.07, 0.12
33. Motion Detection
Indicates when motion occurs against a fixed background (e.g. surveillance video)
Trained to reduce false alarms, such as lighting and shadow changes.
Current limitations:
• No support for night-vision videos
• Semi-transparent and small objects are not detected well
Start Time End Time In Region
1.9 3.6 0
5.2 15.1 0
34. Speech APIs
Voice Recognition (Speech to Text)
Converts spoken audio to text
Voice Output (Text to Speech)
Synthesize audio from text
Speaker ID & Diarisation
Coming soon
36. Duration of Audio < 15 seconds < 2 minutes
Final Result n-best choice Best Choice, delivered at sentence pauses
Partial Results Yes Yes
Voice Recognition
Short Form Long Form
37. Synthesize audio from text via POST request
Maximum audio return of 15 seconds
17 languages supported
Voice Output
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="http://www.w3.org/2001/mstts"
xml:lang="en-US">
<voice name="Microsoft Server Speech Text to Speech
Voice (en-US, ZiraRUS)">
Synthesize audio from text, to speak to your users.
</voice></speak>
38. Speaker Verification
Check if two voices are the same
Speaker Identification
Identify who is speaking
Speaker Recognition
APIs
39. Speaker Recognition APIs
Enrollment
Create a unique voiceprint for a profile
Recognition
After enrolling one or more voices, identify who is speaking
from an audio clip
Verification
Confirm if a voice belongs to a previously enrolled profile
Is this Anna’s voice?
Anna
AnnaMike
Marry
Who’s voice is this?
41. Create custom language models for the vocabulary of the
application
Adapt acoustic models to better match the expected
environment of the application’s users
Deploy to a custom endpoint and access from any device
Custom Recognition Intelligent Service
42. State-of-the-art cloud based spelling algorithms
Recognizes a wide variety of spelling errors
Spell Check APIs
Recognize name errors and homonyms in context
Difficult to spot errors that use the context of the words around
them
Updates over time
Support for new brands and coined expressions as they
emerge
43. Spell Check APIs
Check a single word or a whole sentence
“Our engineers developed this four you!”
Corrected Text: “four” “for”
Identify errors and get suggestions
"spellingErrors": [
{
"offset": 5,
"token": "gona",
"type": "UnknownToken",
"suggestions": [
{ "token": "gonna" }
] }
45. Reduce labeling effort with interactive featuring
Use visualizations to gauge performance and improvements
Leverage Speech recognition with seamless integration
Deploy using just a few examples with active learning
Language Understanding Intelligent Service
Key messages:
We believe our research in service of these three ambitions will lead to what we think of as an “invisible revolution”.
Where increases in our capabilities are powered by technology that moves further out of sight.
The innovation will come from the shift to the cloud…move from having device power right in front of you magnified and moved to the cloud. Invisible processing power, storage, intelligence, etc.
The innovation will come from what’s happening inside and around the device versus the object you can see. Capabilities that come from machine learning, powerful algorithms, cloud, intelligence, and so on.
The innovation will come from an ecosystem of computing that surrounds you. Pervasive computing which will become so natural, it disappears into the background.
When technology becomes more powerful, but less intrusive, it can fit into more parts of our world and solve an even wider range of problems.
Complexity factors
Each device brings different memory, CPU and power constraints
Networks bring latency
New languages bring new data sources
New scenarios bring new novel acoustics
Strategies
Re-architect runtime to reduce device footprint and latency
Universal models, semi-supervised and unsupervised learning
Personalization
Modular AMs
Crowdsource field collection programs
Project Oxford is broken up into three areas of understanding. Each of these areas offer a range of APIs that developers can use within their applications.
The first area is Vision – the ability to understand the content within photos and videos.
Then, we have Speech – the ability to understand and generate spoken words given a segment of audio
And finally, Language – the ability to understand language and the context in which it applies
All of these APIs are available for public use today
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Xiaomin
Pre-built models from bing and Cortana: LUIS provides access to select Cortana models, and lets you include these in your app. These are deep models that know that (for example) “Ronald Reagan” is a US President, actor, author, and person; and can convert “5:45 PM tomorrow” into a machine-readable form.
Interactive featuring to reduce labeling effort: other platforms let developers improve models by providing more labels; LUIS goes further by allowing developers to provide features. This lets a developer tell LUIS that “car, truck, motorcycle, and SUV” are all types of vehicles. These classes help LUIS generalize faster, so if it sees “I need insurance for my SUV”, it generalizes to all types of vehicles immediately. This reduces labeling effort.
Powerful visualizations to gauge performance and pinpoint improvements: A suite of visualizations enable developers to gauge the performance of their models, and pinpoint if and where improvements are needed
Active learning for continual improvement: once a model is deployed and utterances begin rolling in, active learning prioritizes them so only the utterances which will yield the biggest improvement need to be labeled, again reducing labeling effort.
Built on statistical models, not rules
Xiaomin
Xiaomin
Xiaomin
Pre-built models from bing and Cortana: LUIS provides access to select Cortana models, and lets you include these in your app. These are deep models that know that (for example) “Ronald Reagan” is a US President, actor, author, and person; and can convert “5:45 PM tomorrow” into a machine-readable form.
Interactive featuring to reduce labeling effort: other platforms let developers improve models by providing more labels; LUIS goes further by allowing developers to provide features. This lets a developer tell LUIS that “car, truck, motorcycle, and SUV” are all types of vehicles. These classes help LUIS generalize faster, so if it sees “I need insurance for my SUV”, it generalizes to all types of vehicles immediately. This reduces labeling effort.
Powerful visualizations to gauge performance and pinpoint improvements: A suite of visualizations enable developers to gauge the performance of their models, and pinpoint if and where improvements are needed
Active learning for continual improvement: once a model is deployed and utterances begin rolling in, active learning prioritizes them so only the utterances which will yield the biggest improvement need to be labeled, again reducing labeling effort.
Built on statistical models, not rules
Zhipeng
Zhipeng
Key messages:
We believe our research in service of these three ambitions will lead to what we think of as an “invisible revolution”.
Where increases in our capabilities are powered by technology that moves further out of sight.
The innovation will come from the shift to the cloud…move from having device power right in front of you magnified and moved to the cloud. Invisible processing power, storage, intelligence, etc.
The innovation will come from what’s happening inside and around the device versus the object you can see. Capabilities that come from machine learning, powerful algorithms, cloud, intelligence, and so on.
The innovation will come from an ecosystem of computing that surrounds you. Pervasive computing which will become so natural, it disappears into the background.
When technology becomes more powerful, but less intrusive, it can fit into more parts of our world and solve an even wider range of problems.