Successfully reported this slideshow.

AI in RTC - RTC Korea 2018

3

Share

Loading in …3
×
1 of 31
1 of 31

AI in RTC - RTC Korea 2018

3

Share

Download to read offline

Chad Hart examines the use of AI and Machine Learning (ML) in Real Time Communications (RTC) applications including speech analytics, voicebots, computer vision, and ML optimization of RTC components. Chad includes examples from his AI in RTC research report, webrtcHacks, and cogint.ai.

Chad Hart examines the use of AI and Machine Learning (ML) in Real Time Communications (RTC) applications including speech analytics, voicebots, computer vision, and ML optimization of RTC components. Chad includes examples from his AI in RTC research report, webrtcHacks, and cogint.ai.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

AI in RTC - RTC Korea 2018

  1. 1. cwh.consulting Artificial Intelligence in Real Time Communications (AI in RTC) RTC Korea 1 November 2018
  2. 2. cwh.consulting A blog for WebRTC developers webrtcHacks.com @webrtcHacks AI & RTC blog cogint.ai @cogintai WebRTC and ML for Developer Event November 16, 2018 in San Francisco krankygeek.com About Me Chad Hart Analyst & Product Consultant https://cwh.consulting @chadwallacehart chad@cwh.consulting
  3. 3. cwh.consulting AI in RTC Research Study • Authors • Chad Hart – cwh.consulting • Tsahi Levent-Levi - BlogGeek.me • Methodology • 40+ 1-on-1 vendor interviews • ~100 respondent web survey • Analysis of 126 companies & all major products • Output: 147-page report
  4. 4. cwh.consulting + = Image source: pixabay.com/en/a-i-ai-anatomy-2729782 What is AI in RTC? RTC
  5. 5. cwh.consulting AI in RTC use case categories speech analytics voicebots RTC optimization computer vision Image source: pixabay.com/en/a-i-ai-anatomy-2729782
  6. 6. cwh.consulting • Call center agent monitoring • Transcription • Translation • Agent coaching • Customer engagement Speech Analytics
  7. 7. cwh.consulting Promise: machine transcription at human levels Source: Google I/O 2017 keynote
  8. 8. cwh.consulting Reality: transcription quality is often not so great My name is a chat heart of you might be familiar with Dave from a brand or if you are, a web or to see people I've done about five years, I'm or so a of an independent analyst. So I'm mostly do park management strategy type. For a product, marketing. My name is Chad Hart. You might be familiar with me from a brand -- if you are WebRTC people; I've done webrtcHacks now for about five years or so. Outside of webrtcHacks, I have been an independent analyst. I mostly do product management and strategy type work and product marketing. Machine Transcription Actual Transcription https://www.nojitter.com/post/240173958/when-speech-analytics-makes-gibberish-useful
  9. 9. cwh.consulting My name is Chad Hart. You might be familiar with me from a brand -- if you are WebRTC people; I've done webrtcHacks now for about five years or so. Outside of webrtcHacks, I have been an independent analyst. I mostly do product management and strategy type work and product marketing. Reality: transcription quality is often not so great My name is a chat heart of you might be familiar with Dave from a brand or if you are, a web or to see people I've done about five years, I'm or so a of an independent analyst. So I'm mostly do park management strategy type. For a product, marketing. Machine Transcription Actual Transcription Non-standard spelling Industry Jargon Speech disfluencies US-English language assumption https://www.nojitter.com/post/240173958/when-speech-analytics-makes-gibberish-useful
  10. 10. cwh.consulting Higher-level speech analytics • Perfect transcription is not needed to provide useful analysis. • Higher-level speech analytics systems look for patterns in speech. • These patterns can be matched to business outcomes, such as did a caller end up purchasing or did they give a good customer satisfaction score. • There are often meaningful patterns beyond the words that were spoken – like how fast each party was speaking, or how often the agent talked compared to the customer. • There is also a lot of work going into looking at caller emotion and sentiment. Source: CallMiner
  11. 11. cwh.consulting • IVR replacement • Starting meetings • In-call assistance Voicebots – Smart Speakers & Assistants
  12. 12. cwh.consulting • Another area we examined was voice bots. • These are smart speakers like the google home which was recently made available in South Korea and AI assistants like Bixby or Siri. • Building a voicebot is complex. You not only need to transcribe the speech and run some natural language understanding on it like in speech analytics, but you need to also generate speech and deal with interactivity with the customer in real time. • There is very broad interest in using these voicebots • Every telephony device maker is interested in adding a voice user interface to their products – and this is a natural fit since people “talk” to these devices already. • Typical conference room equipment is already setup to capture good quality audio with minimal noise from a variety of locations throughout the room with microphone arrays • However, most companies are just starting to figure out how to use them in their products. Voicebots – Smart Speakers & Assistants
  13. 13. cwh.consulting Flattening the IVR: humans don’t speak in menus https://cogint.ai/dialogflow-phone-bot/ Menu DTMF Menu DTMF Response Response Menu DTMF Response Response Response Menu DTMF Response Response Response Menu DTMF Response Response Utterance Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Traditional IVR Menu Voicebot time 10 potential responses in an IVR menu hierarchy vs. a voicebot
  14. 14. cwh.consulting Flattening the IVR: humans don’t speak in menus • One major area where voicebots will have an impact is in IVRs. • Traditional IVRs were designed for DTMF input and are usually setup with multiple levels of menus. • Because people cannot remember more than a few menu options at a time, you cannot put too many options in each menu. • As a result, to fit many options, you need to have a complex menu with many layers. • Users hate this because they are difficult to navigate and takes too long. • Voicebots help to flatten the IVR into a just a few layers. • Rather than navigating a complex menu, user can just say what they want and use natural language to get the information they need. • This is good for call centers too because users are more likely to stay in the IVR instead of immediately dropping out to an operator. https://cogint.ai/dialogflow-phone-bot/
  15. 15. cwh.consulting New voicebots: consumer ⇨ businessNotable Consumer Voicebot Market Milestones krankygeek.com/research KRANKY GEEK RESEARCH Notable voicebot milestones
  16. 16. cwh.consulting New voicebot technology threatens IVRs Time Abilitytooffloadhumantasks today
  17. 17. cwh.consulting • Funny hats • Face detection • Gestures • Object detection • Emotion analysis Computer vision
  18. 18. cwh.consulting Object detection over WebRTC with TensorFlow Blog post: https://webrtchacks.com/webrtc-cv-tensorflow/ Demo video: https://youtu.be/vzTXW0hGINM • Using open source libraries and existing work, without having a PhD in computer vision it is relatively simple to setup your own server and process real time video. • Here is an example of a server I setup to do real time analysis of a WebRTC stream.
  19. 19. cwh.consulting Object detection over WebRTC with TensorFlow – example architecture https://webrtchacks.com/webrtc-cv-tensorflow/ TensorFlow Object Detection Flask Server Browser local.js index.html objDetect.js POST with image object details web assets GET web assets • This is just a very basic example that uses an HTTP post to send several images per second to a cloud-based server for processing. • As you saw in the video, there can be a little bit of lag. • Using a GPU-accelerated server or even something like Google’s TPU that were specifically designed to accelerate heavy machine learning graphs would have helped • But ultimately streaming a high-quality image can always have its limits. • Wouldn’t it be nice if you do the heavy processing locally with hardware acceleration, just like you can hardware accelerate codecs like H.264?
  20. 20. cwh.consulting ML processing moving to the edge, with faster, local processing • That’s exactly what you can do with some new chipsets from vendors like Intel. • This is an example of a kit from Google called the AIY Vision Kit that includes the Intel Movidius processor. • The Movidius is designed to run deep neural networks locally and is especially well-suited to low-power computer vision applications. • This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM. • Google used to sell just the vision bonnet add-on part of the chip for $45. Now you can buy the complete kit with the Raspberry Pi for $90 in the US. • Note that Amazon also has a computer vision kit it calls Deep Lense. That runs on something more like an Intel NUC mini-PC and costs $250.
  21. 21. cwh.consulting ML processing moving to the edge, with faster, local processing https://webrtchacks.com/aiy-vision-kit-uv4l-web-server/
  22. 22. cwh.consulting Improvements with edge hardware (demonstration) • Let’s look at this in action • This all runs locally on the Pi. • So in this case, I am doing the computer vision process locally while sending the stream and annotation remotely Blog post: https://webrtchacks.com/aiy-vision-kit-uv4l- web-server Video: https://youtu.be/h0O18R1rI9U
  23. 23. cwh.consulting Fun use cases with native mobile libraries • With new native mobile libraries like Apple’s CoreML and Google’s ML Kit, it is relatively simple. • Some of the engineers at Houseparty wrote a blog post demonstrating how to do smile detection • Similar libraries are available that detect facial boundaries and let you put hats, sunglasses, beards, and other silly masks on people – I am sure you have seen some of these! • Similar techniques can be used in a business context to blur out backgrounds for remote workers who call into a video conference. https://webrtchacks.com/ml-kit-smile-detection/
  24. 24. cwh.consulting MLKit CPU consumption: high framerates are not practical (without special hardware) CPU Usage for different framerates processed by ML Kit CPUUsage% https://webrtchacks.com/ml-kit-smile-detection/
  25. 25. cwh.consulting Resource consumption MLKit is small compared to WebRTC https://webrtchacks.com/ml-kit-smile-detection/
  26. 26. cwh.consulting WebRTC CV is coming to the browser https://w3c.github.io/webrtc-nv-use-cases/#funnyhats* This is from a W3C document examining use cases for the next version of WebRTC
  27. 27. cwh.consulting RTC optimization • Noise suppression • Echo cancellation • Error correction • Route optimization
  28. 28. cwh.consulting Mozilla RNNoise – real time, low-power noise suppression with deep learning • One example is a research project from Mozilla that uses Deep Learning to provide better real-time noise suppression. • This is designed for lower power devices and does not require any specialized hardware. • We do not have time now, but you can go to that link and try some demos. • Unfortunately this was just a research project, but it gives you some idea of what could be done in this and other areas. https://people.xiph.org/~jm/demo/rnnoise/
  29. 29. cwh.consulting Special discount for RTC Korea Use code RTC-KOREA until November 7 for $1000.00 off krankygeek.com/research or email me purchase at
  30. 30. cwh.consulting Questions?
  31. 31. cwh.consulting A blog for WebRTC developers webrtcHacks.com @webrtcHacks AI & RTC blog cogint.ai @cogintai WebRTC and ML for Developer Event November 16, 2018 in San Francisco krankygeek.com About Me Chad Hart Analyst & Product Consultant https://cwh.consulting @chadwallacehart chad@cwh.consulting

Editor's Notes

  • As a quick background, my name is Chad Hart.
    I am an analyst and consultant focused on real time communications products and services
    Some of you may be familiar with webrtcHacks – I blog I have run since 2013 that aims to provide useful content for WebRTC developers
    I also recently launched a blog to specifically explore topics related to AI, Machine Learning and RTC. You can check that out at cogint.ai
    Lastly, I also help to run the Kranky Geek series of events with the help of Google and other sponsors like Intel, Nexmo and Agora.
    We hold an event every year in San Francisco.
    This year we will also be focusing on the AI in RTC topics with many great talks from companies like Facebook, Microsoft, IBM and many more.
  • The AI in RTC topic has been a major focus of mine.
    I recently came off a long-term project where I ran a new product incubator group that launched a speech analytics service inside a telco.
    I could see speech analytics and other machine-learning based technologies were starting to intersect with real time communications.
    To understand this better I teamed up with Tsahi Levent-Levi of BlogGeek.me, another WebRTC analyst many of you know, to write a research report on this topic.
    We covered more than 125 vendors, ran an industry survey, and had 1-on1 conversations with 40 vendors.
  • So what is AI in RTC?
    I am not talking about science fiction robots making phone calls
    I am going to talk about how modern machine learning techniques can be used to improve and expand real time communications.
  • We saw 4 major categories of use cases
    Speech analytics
    voice bots
    computer vision,
    And using Machine Learning (ML) to optimize lower-level RTC protocols and networks
  • By far the most common use case was speech analytics
    There is a broad range of use cases that range from providing transcription on conference calls to providing real time agent coaching based on what the customer is saying in the call center.
  • Speech transcription – also known as ASR or Speech-to-text (STT)
    Has made a lot of improvements over the past couple of year thanks to deep learning techniques.
    Many vendors now claim they are at human-levels of accuracy.
  • The reality is that transcription still has a number of challenges.
    The example here shows a transcription where I was introducing myself.
    As you can see – the machine transcription did not do such a great job.
  • This specific example is probably worse than average, but not uncommon.
    The first major challenge is getting languages and dialects correct.
    I am sure that this is a big struggle for this audience as you deal with STT technologies made outside of Korea.
    I am lucky that English, and particularly American English, is by far the best supported language.
    May vendors also have support for many dialects of English, such as British, Australian, and Indian accents.
    You will find much more limited support for Korean.
    I do not think I have seen any major international vendor support specific Korean dialects.
    Fortunately this is improving and newer algorithms require less training data, so it is becoming easier to build support for new languages.
    Non-standard spellings and specific industry jargon that does not appear in the dictionary like “WebRTC” is also a challenge.
    Most systems now have techniques that let you specify a custom vocabulary to correct these.
  • It is also important to note that perfect transcription is not needed to provide useful analysis.
    Higher-level speech analytics systems look for patterns in speech.
    These patterns can be matched to business outcomes, such as did a caller end up purchasing or did they give a good customer satisfaction score.
    There are often meaningful patterns beyond the words that were spoken – like how fast each party was speaking, or how often the agent talked compared to the customer.
    There is also a lot of work going into looking at caller emotion and sentiment.
  • Another area we examined was voice bots.
    These are smart speakers like the google home which was recently made available in South Korea (https://voicebot.ai/2018/09/11/google-home-arriving-in-south-korean-on-september-18-pre-orders-start-today/)
    And AI assistants like Bixby or Siri.
    Building a voicebot is complex. You not only need to transcribe the speech and run some natural language understanding on it like in speech analytics, but you need to also generate speech and deal with interactivity with the customer in real time.
    There is very broad interest in using these voicebots
    Every telephony device maker is interested in adding a voice user interface to their products – and this is a natural fit since people “talk” to these devices already.
    Typical conference room equipment is already setup to capture good quality audio with minimal noise from a variety of locations throughout the room with microphone arrays
    However, most companies are just starting to figure out how to use them in their products.
  • Another area we examined was voice bots.
    These are smart speakers like the google home which was recently made available in South Korea (https://voicebot.ai/2018/09/11/google-home-arriving-in-south-korean-on-september-18-pre-orders-start-today/)
    And AI assistants like Bixby or Siri.
    Building a voicebot is complex. You not only need to transcribe the speech and run some natural language understanding on it like in speech analytics, but you need to also generate speech and deal with interactivity with the customer in real time.
    There is very broad interest in using these voicebots
    Every telephony device maker is interested in adding a voice user interface to their products – and this is a natural fit since people “talk” to these devices already.
    Typical conference room equipment is already setup to capture good quality audio with minimal noise from a variety of locations throughout the room with microphone arrays
    However, most companies are just starting to figure out how to use them in their products.
  • One major area where voicebots will have an impact is in IVRs.
    Traditional IVRs were designed for DTMF input and are usually setup with multiple levels of menus.
    Because people cannot remember more than a few menu options at a time, you cannot put too many options in each menu.
    As a result, to fit many options, you need to have a complex menu with many layers.
    Users hate this because they are difficult to navigate and takes too long.
    Voicebots help to flatten the IVR into a just a few layers.
    Rather than navigating a complex menu, user can just say what they want and use natural language to get the information they need.
    This is good for call centers too because users are more likely to stay in the IVR instead of immediately dropping out to an operator.
  • One major area where voicebots will have an impact is in IVRs.
    Traditional IVRs were designed for DTMF input and are usually setup with multiple levels of menus.
    Because people cannot remember more than a few menu options at a time, you cannot put too many options in each menu.
    As a result, to fit many options, you need to have a complex menu with many layers.
    Users hate this because they are difficult to navigate and takes too long.
    Voicebots help to flatten the IVR into a just a few layers.
    Rather than navigating a complex menu, user can just say what they want and use natural language to get the information they need.
    This is good for call centers too because users are more likely to stay in the IVR instead of immediately dropping out to an operator.
  • Actually, many advanced IVR systems like those sold by companies like Nuance, Aspect, and Genesys already have natural language inputs and responses.
    One big change here is the growth of the consumer voicebot market.
    As this technology has matured, these solutions are not being targeted at business telephony use cases, not just consumers.
    For example, IBM launched a voice gateway option for its Watson assistant.
    Amazon is integrating its natural language engine called Lex into Amazon Connect, its contact center solution.
    Microsoft’s language processing platform is called LUIS and it has a bot-builder framework that can use this to integrate into the consumer Skype and Skype for business.
    Just this summer, Google launched its contact center AI initiative where it has partnered with many major communications providers and vendors.
    As part of Google’s solution, they are looking to penetrate call centers by using Dialogflow, their natural languge understanding engine and are using other tools to help agents more quickly answer questions.
  • Existing IVR technology that incorporates natural language tends to be very expensive.
    Big vendors like Amazon, Google, and Microsoft are adapting technologies they built for the much larger consumer market and applying that to business use cases at much lower costs, often with better performance.
    One of Google’s customers Marks and Spensor, commented they were able to save the equivalent of 100 Full Time employees using this technology across their call center.
  • The last area I would like to discuss is computer vision.
    This domain already had a lot of usage in consumer applications and is just starting to find some business use cases.
    There are many applications area including counting people, identifying faces, using gestures for controls, and even augmented reality.
  • Using open source libraries and existing work, without having a PhD in computer vision it is relatively simple to setup your own server and process real time video.
    Here is an example of a server I setup to do real time analysis of a WebRTC stream.
  • This is just a very basic example that uses an HTTP post to send several images per second to a cloud-based server for processing.
    As you saw in the video, there can be a little bit of lag.
    Using a GPU-accelerated server or even something like Google’s TPU that were specifically designed to accelerate heavy machine learning graphs would have helped
    But ultimately streaming a high-quality image can always have its limits.
    Wouldn’t it be nice if you do the heavy processing locally with hardware acceleration, just like you can hardware accelerate codecs like H.264?
  • That’s exactly what you can do with some new chipsets from vendors like Intel.
    This is an example of a kit from Google called the AIY Vision Kit that includes the Intel Movidius processor.
    The Movidius is designed to run deep neural networks locally and is especially well-suited to low-power computer vision applications.
    This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM.
    Google used to sell just the vision bonnet add-on part of the chip for $45. Now you can buy the complete kit with the Raspberry Pi for $90 in the US.
    Note that Amazon also has a computer vision kit it calls Deep Lense. That runs on something more like an Intel NUC mini-PC and costs $250.
  • That’s exactly what you can do with some new chipsets from vendors like Intel.
    This is an example of a kit from Google called the AIY Vision Kit that includes the Intel Movidius processor.
    The Movidius is designed to run deep neural networks locally and is especially well-suited to low-power computer vision applications.
    This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM.
    Google used to sell just the vision bonnet add-on part of the chip for $45. Now you can buy the complete kit with the Raspberry Pi for $90 in the US.
    Note that Amazon also has a computer vision kit it calls Deep Lense. That runs on something more like an Intel NUC mini-PC and costs $250.
  • Let’s look at this in action
    This all runs locally on the Pi.
    So in this case, I doing the computer vision process locally while sending the stream and annotation remotely
  • With new native mobile libraries like Apple’s CoreML and Google’s ML Kit, it is relatively simple.
    Some of the engineers at Houseparty wrote a blog post demonstrating how to do smile detection
    Similar libraries are available that detect facial boundaries and let you put hats, sunglasses, beards, and other silly masks on people – I am sure you have seen some of these!
    Similar techniques can be used in a business context to blur out backgrounds for remote workers who call into a video conference.
  • The last area is RTC optimization.
    There are many opportunities to use machine learning to improve bandwidth estimation, echo cancellation, and perform better error correction.
    We were very surprised that there has been relatively investment made here.
  • One example is a research project from Mozilla that uses Deep Learning to provide better real-time noise suppression.
    This is designed for lower power devices and does not require any specialized hardware.
    We do not have time now, but you can go to that link and try some demos.
    It is pretty neat.
    Unfortunately this was just a research project, but it gives you some idea of what could be done in this and other areas.
  • Before I take questions, I did want to mention we have a special discount code for RTC Korea attendees.
    If you are interested in seeing out full 147-page report, you can use that for a big discount.
  • ×