Azure Weekend 2020 –
A Journey with Microsoft
Azure Cognitive Service I
Marvin Heng | @hmheng
www.techconnect.io
Microsoft AI
Azure
Cognitive
Services
From faces to feelings, allow your
apps to understand images and video
Hear and speak to your users by filtering noise, identifying
speakers, and understanding intent
Process text and learn how to recognize what
users want
Tap into rich knowledge amassed from
the web, academia, or your own data
Access billions of web pages, images, videos, and news with
the power of Bing APIs
Why Azure Cognitive Services ?
Cognitive Services
Emotion
Computer Vision
Face
Video Indexer
Form Recognizer
Speech To Text
Text To Speech
Speech Translation
Speaker Recognition
Immersive Reader
Language
Understanding
QnA Maker
Text Analytics
Translator
Anomaly Detector
Content Moderator
Metrics Advisor
Personalizer
Bing Autosuggest
Bing Custom Search
Bing Entity Search
Bing Image Search
Bing News Search
Bing Spell Check
Bing Video Search
Bing Visual Search
Bing Web Search
Custom Vision
Bing Search
Bing Search
• Allow developers to integrate a search function to their apps that
allows users to find webpages, images, news, locations, and more
without advertisements
• For knowledge mining
Bing Search
Autosuggest
Entity Search
Custom Search
Image Search
News Search
Video Search
Visual Search
Spell Check
Local Business
Vision
Computer Vision
• Computer vision is an area of artificial intelligence (AI) in which
software systems are designed to perceive the world visually,
though cameras, images, and video.
• Computer vision is one of the core areas of artificial intelligence
(AI), and focuses on creating solutions that enable AI-enabled
applications to "see" the world and make sense of it.
Use Cases of Computer Vision
• Analyze an image and suggest an appropriate caption.
• Suggest relevant tags that could be used to index an image.
• Categorize an image.
• Identify objects in an image.
• Detect faces and people in an image.
• Recognize celebrities and landmarks in an image.
• Read text in an image.
What can CV tell us?
• A black and white photo of a city
• A black and white photo of a large city
• A large white building in a city
Not only that! It tags too!
• Tagging
• Type of identified object
• Bounding Box
• Set of coordinates (Top, left, width and height)
Image Categorization
Categorization in 86-category taxonomy
abstract_ animal_horse building_street food_grilled others_ outdoor_road people_hand plant_tree text_menu
abstract_net animal_panda dark_ food_pizza outdoor_
outdoor_sportsf
ield people_many object_screen text_sign
abstract_nonph
oto building_ drink_ indoor_ outdoor_city
outdoor_stoner
ock people_portrait
object_sculptur
e trans_bicycle
abstract_rect building_arch drink_can
indoor_churchw
indow outdoor_field outdoor_street people_show sky_cloud trans_bus
abstract_shape
building_brickw
all dark_fire indoor_court outdoor_grass outdoor_water people_tattoo sky_sun trans_car
abstract_texture building_church dark_fireworks
indoor_doorwin
dows outdoor_house
outdoor_watersi
de people_young
people_swimmi
ng
trans_trainstatio
n
animal_ building_corner sky_object
indoor_markets
tore
outdoor_mount
ain people_ plant_ outdoor_pool
animal_bird
building_doorwi
ndows food_ indoor_room
outdoor_oceanb
each people_baby plant_branch text_
animal_cat building_pillar food_bread indoor_venue
outdoor_playgro
und people_crowd plant_flower text_mag
animal_dog building_stair food_fastfood dark_light outdoor_railway people_group plant_leaves text_map
Domain-specific content
Optical character recognition
Faith
CAN MOVE
MOUNTAINS
Some Additional Capabilities
• Detect image
• Detect image color schemes
• Generate thumbnails
• Moderate content
Custom Vision
• Azure Custom Vision is an image recognition service that lets you
build, deploy, and improve your own image identifiers.
• An image identifier applies labels (which represent classes or
objects) to images, according to their visual characteristics.
• The Custom Vision service uses a machine learning algorithm to
analyze images.
What can Custom Vision do?
• Classification
• Object Detection
• Export as standalone offline
model for your app
development.
Face Detection
Face Verification
Verification result: The two faces belong to the same
person. Confidence is 0.93468.
Perceived emotion recognition
Video Indexer
• Video Indexer provides ability to extract deep insights
(with no need for data analysis or coding skills) using
machine learning models based on multiple channels
(voice, vocals, visual).
• The service enables deep search, reduces operational
costs, enables new monetization opportunities, and
creates new user experiences on large archives of
videos (with low entry barriers).
Video Indexer
• Keywords extraction
• Named entities extraction
• Topic inference
• Artifacts Sentiment analysis: Identifies positive, negative, and
neutral sentiments from speech and visual text.
Video Indexer
Use Cases of Video Indexer
• Deep search
• Content creation
• Accessibility.
• Monetization
• Content moderation
• Recommendations
Video Indexer
Face detection
Celebrity identification
Account-based face identification
Visual text recognition
Visual content moderation
Labels identification
Scene segmentation
Shot detection
Black frame detection
Keyframe extraction
Rolling credits
Animated characters detection
Editorial shot type detection
Audio transcription
Automatic language detection
Multi-language speech identification and transcription
Two channel processing
Closed captioning
Noise reduction
Transcript customization (CRIS)
Speaker enumeration
Speaker statistics
Textual content moderation
Audio effects
Emotion detection
Translation
Form Recognizer
• Extract text and data from business’s forms and documents.
• Easily extract text and structure, with simple REST API
• Pre-trained model:
• Receipt
• Business Card
• Layouts
• Custom Trained Model
• Supports printed and handwritten forms, PDFs and images.
• Container support
What can you do with Form Recognizer?
• Automate written text > digital text conversion
• Automate capturing receipt data
• Automate converting business card into digital contacts
Speech
Speech-to-Text
• Speech-to-text service
• Improves meeting efficiency by transcribing conversations in real-time
• Help safeguard data with industry-leading security and compliance
certifications.
• Integrates with a variety of meeting conference solutions including
Microsoft Teams and other third-party meeting software.
• SDK is available.
Speech-to-Text
Speaker Recognition
“who is speaking?”
Speaker Verification
• Text-dependent verification means
speakers need to choose the same
passphrase to use during both
enrollment and verification phases.
• Text-independent verification means
speakers can speak in everyday
language in the enrollment and
verification phrases.
Text-to-Speech
• Convert text into human-like synthesized speech.
• Offer 75+ standard in more than 45 languages and locales, and 5
neural voices
• Tune voice output by easily adjusting rate, pitch, pronunciation,
pauses, and more.
• Speech synthesis
• Asynchronous synthesis of long audio
• Speech Synthesis Markup Language (SSML)
Speech Translation
Microsoft’s
Translation
Engine
Statistical
machine
translation
(SMT)
Neural
machine
translation
(NMT)
Speech Translation
• Speech-to-text translation with recognition results.
• Speech-to-speech translation.
• Support for translation to multiple target languages.
• Interim recognition and translation results.
Use case of Speech Service

A Journey with Microsoft Cognitive Service I

  • 1.
    Azure Weekend 2020– A Journey with Microsoft Azure Cognitive Service I Marvin Heng | @hmheng www.techconnect.io
  • 4.
  • 5.
    Azure Cognitive Services From faces tofeelings, allow your apps to understand images and video Hear and speak to your users by filtering noise, identifying speakers, and understanding intent Process text and learn how to recognize what users want Tap into rich knowledge amassed from the web, academia, or your own data Access billions of web pages, images, videos, and news with the power of Bing APIs
  • 6.
  • 7.
    Cognitive Services Emotion Computer Vision Face VideoIndexer Form Recognizer Speech To Text Text To Speech Speech Translation Speaker Recognition Immersive Reader Language Understanding QnA Maker Text Analytics Translator Anomaly Detector Content Moderator Metrics Advisor Personalizer Bing Autosuggest Bing Custom Search Bing Entity Search Bing Image Search Bing News Search Bing Spell Check Bing Video Search Bing Visual Search Bing Web Search Custom Vision
  • 8.
  • 9.
    Bing Search • Allowdevelopers to integrate a search function to their apps that allows users to find webpages, images, news, locations, and more without advertisements • For knowledge mining
  • 10.
    Bing Search Autosuggest Entity Search CustomSearch Image Search News Search Video Search Visual Search Spell Check Local Business
  • 11.
  • 12.
    Computer Vision • Computervision is an area of artificial intelligence (AI) in which software systems are designed to perceive the world visually, though cameras, images, and video. • Computer vision is one of the core areas of artificial intelligence (AI), and focuses on creating solutions that enable AI-enabled applications to "see" the world and make sense of it.
  • 13.
    Use Cases ofComputer Vision • Analyze an image and suggest an appropriate caption. • Suggest relevant tags that could be used to index an image. • Categorize an image. • Identify objects in an image. • Detect faces and people in an image. • Recognize celebrities and landmarks in an image. • Read text in an image.
  • 14.
    What can CVtell us? • A black and white photo of a city • A black and white photo of a large city • A large white building in a city
  • 15.
    Not only that!It tags too! • Tagging • Type of identified object • Bounding Box • Set of coordinates (Top, left, width and height)
  • 16.
  • 17.
    Categorization in 86-categorytaxonomy abstract_ animal_horse building_street food_grilled others_ outdoor_road people_hand plant_tree text_menu abstract_net animal_panda dark_ food_pizza outdoor_ outdoor_sportsf ield people_many object_screen text_sign abstract_nonph oto building_ drink_ indoor_ outdoor_city outdoor_stoner ock people_portrait object_sculptur e trans_bicycle abstract_rect building_arch drink_can indoor_churchw indow outdoor_field outdoor_street people_show sky_cloud trans_bus abstract_shape building_brickw all dark_fire indoor_court outdoor_grass outdoor_water people_tattoo sky_sun trans_car abstract_texture building_church dark_fireworks indoor_doorwin dows outdoor_house outdoor_watersi de people_young people_swimmi ng trans_trainstatio n animal_ building_corner sky_object indoor_markets tore outdoor_mount ain people_ plant_ outdoor_pool animal_bird building_doorwi ndows food_ indoor_room outdoor_oceanb each people_baby plant_branch text_ animal_cat building_pillar food_bread indoor_venue outdoor_playgro und people_crowd plant_flower text_mag animal_dog building_stair food_fastfood dark_light outdoor_railway people_group plant_leaves text_map
  • 18.
  • 19.
  • 20.
    Some Additional Capabilities •Detect image • Detect image color schemes • Generate thumbnails • Moderate content
  • 22.
    Custom Vision • AzureCustom Vision is an image recognition service that lets you build, deploy, and improve your own image identifiers. • An image identifier applies labels (which represent classes or objects) to images, according to their visual characteristics. • The Custom Vision service uses a machine learning algorithm to analyze images.
  • 23.
    What can CustomVision do? • Classification • Object Detection • Export as standalone offline model for your app development.
  • 25.
  • 26.
    Face Verification Verification result:The two faces belong to the same person. Confidence is 0.93468.
  • 27.
  • 29.
    Video Indexer • VideoIndexer provides ability to extract deep insights (with no need for data analysis or coding skills) using machine learning models based on multiple channels (voice, vocals, visual). • The service enables deep search, reduces operational costs, enables new monetization opportunities, and creates new user experiences on large archives of videos (with low entry barriers).
  • 30.
    Video Indexer • Keywordsextraction • Named entities extraction • Topic inference • Artifacts Sentiment analysis: Identifies positive, negative, and neutral sentiments from speech and visual text.
  • 31.
  • 33.
    Use Cases ofVideo Indexer • Deep search • Content creation • Accessibility. • Monetization • Content moderation • Recommendations
  • 34.
    Video Indexer Face detection Celebrityidentification Account-based face identification Visual text recognition Visual content moderation Labels identification Scene segmentation Shot detection Black frame detection Keyframe extraction Rolling credits Animated characters detection Editorial shot type detection Audio transcription Automatic language detection Multi-language speech identification and transcription Two channel processing Closed captioning Noise reduction Transcript customization (CRIS) Speaker enumeration Speaker statistics Textual content moderation Audio effects Emotion detection Translation
  • 35.
    Form Recognizer • Extracttext and data from business’s forms and documents. • Easily extract text and structure, with simple REST API • Pre-trained model: • Receipt • Business Card • Layouts • Custom Trained Model • Supports printed and handwritten forms, PDFs and images. • Container support
  • 36.
    What can youdo with Form Recognizer? • Automate written text > digital text conversion • Automate capturing receipt data • Automate converting business card into digital contacts
  • 38.
  • 39.
    Speech-to-Text • Speech-to-text service •Improves meeting efficiency by transcribing conversations in real-time • Help safeguard data with industry-leading security and compliance certifications. • Integrates with a variety of meeting conference solutions including Microsoft Teams and other third-party meeting software. • SDK is available.
  • 40.
  • 41.
  • 42.
    Speaker Verification • Text-dependentverification means speakers need to choose the same passphrase to use during both enrollment and verification phases. • Text-independent verification means speakers can speak in everyday language in the enrollment and verification phrases.
  • 43.
    Text-to-Speech • Convert textinto human-like synthesized speech. • Offer 75+ standard in more than 45 languages and locales, and 5 neural voices • Tune voice output by easily adjusting rate, pitch, pronunciation, pauses, and more. • Speech synthesis • Asynchronous synthesis of long audio • Speech Synthesis Markup Language (SSML)
  • 45.
  • 46.
    Speech Translation • Speech-to-texttranslation with recognition results. • Speech-to-speech translation. • Support for translation to multiple target languages. • Interim recognition and translation results.
  • 48.
    Use case ofSpeech Service