Join Brian Morin, CMO, and Phillip Fisher, CX Consultant as they discuss how IVA (intelligent virtual assistants) powered by advanced speech recognition and natural language processing are replacing the traditional IVR with “How can I help you today,” instead of “Press 1” and helping customers self-serve common requests.
We deliver AI-powered virtual agents as a service. That means we deliver the full conversational AI technology stack. It’s turnkey. It’s omnichannel. All of our clients use our voice self-service module. About half rely on us for more than voice and include their chat and SMS channels as well.
But what makes us a little different is that we’re not just trying to sell a software tool set and wish you good luck on your journey. Conversations with machines are complex. It needs experts. So we bundle end-to-end CX services with our technology. And when I say end-to-end, that means everything – the design, the build, and even the ongoing operation after go-live because it requires care and feeding. So at the end of the day, we’re really stepping in more as a partner instead of just a technology provider. That makes us responsible for delivering the CX that was promised and the ROI that was promised.
We’d like to think that approach is working for us. We operate the AI-powered CX for more than 100 brands and currently the top-rated conversational solution on Gartner Peer Insights. So if you’re interested in what others have to say about us, starting with those reviews is a good place to start.
Onscreen is the experience we want you imagine – an experience that is personalized and predictive where everything happens in natural language with AI that sounds like a human, has the ability to read and record data like a human – take cognitive action like a human
This means you can go beyond just automating simple call types like status or balance, but you can automate complex conversations that have multiple back-and-forth exchanges. We do a lot of scheduling, reservations, complex things like emergency roadside assistance, warranties, returns, claims. As long as the interaction is somewhat linear in progression, it can be automated.
When a customer calls in, they are starting from a place of tension. You want them to self-serve but they’re going to immediately zero out to wait on hold for an agent unless you’re able to showcase enough intelligence right away to win their confidence to attempt self-service. This means being personalized, predictive, sounding like a human, giving before taking, and accurately understand their intent from the first open ended question
Brian
Mark is n the final stages of delivering this application to some auto dealerships. I think it’s one everyone can identify with, calling into a dealership to schedule an appointment for some kind of service
Mark, is there any setup you need to give here before kickstarting the demo?
Brian
Dan
Dan
Brian
Ok, so lets step beyond the technology conversation and into the humans required to actually run it and open up this black box a little bit. If you were to attempt to do voice automation on your own, particularly natural language automation, that outer ring is all the jobs or functions where you need experts in their field doing that role. Ultimately, this is why we deliver our technology as a service because of the complexity involved in delivering a great voice experience
The big thing to understand is that Conversational AI is not an off the shelf product – it’s merely a tool set. Contrast that with touchtone IVR. In a matter of a few short days, you design, build, and POOF - you're done. You may not even touch it again. Conversational AI is very different. It's a solution that requires ongoing care and feeding. In fact, once you "go live," you've only started.
There is nothing easy about conversations with machines. To do voice well, you need real human bodies who are experts in AI-powered CX and committed to the ongoing process of perpetual improvement. This means obsessing over the CX and scrutinizing containment across every interaction to identify points of friction to iterate and improve week by week.
[OPTIONAL IF TIME ALLOWS] So what does that actually mean – that means a team of both developers and trained CX professionals who are experts at tuning and customizing the application and underlying technologies like speech recognition and natural language processing to your customer-specific criteria. This means daily monitoring and reporting to troubleshoot for friction and containment, expanding grammars, widening guardrails, finding new data sources, listening to call recordings, tweaking language acoustic models, QA’ing any change. We are constantly looking for opportunities to tune the application or language models and improve the experience and you need real human bodies committed to that process.]
Brian
We’re talking about conversational AI that’s purpose-built for telephony and purpose-built for limited grammar use cases. I’m going to explain in a minute why that’s so important to having a good voice experience.
Speech rec over telephony is really hard to do well. That might not surprise anyone considering the poor experiences we’ve all had. But you may not have understood why and what the bleeding edge of AI is doing to solve for it.
It’s very different than speaking directly into your phone or home device, which is a high def experience. That’s why the speech rec on your phone is so good because you are capturing all the highs and lows at the device level which makes it easy to distinguish utterances and relate those to letters, syllables and words. But the moment you call into a customer service line and those sound waves travel over outdated telephony infrastructure, the resolution is reduced to 8K in most cases – it’s cutting out all the highs in lows by more than half. And if that wasn’t bad enough, it adds noise. This is why conversational AI over telephony is a really difficult challenge. It’s also why these same transcription-based engines like a Google or Amazon don’t deliver a good enough customer experience at the contact center level, because they are now 50% less accurate. So you have to have AI that is purpose-built for this kind of challenge
I should note we have found that Google will perform the best in certain customer service use cases, and I’ll explain where and why in a minute
Due to these challenges, the transcription from speech-to-text engines is wrong more often than you think. Here’s an example from AAA. Our virtual agents handle their emergency roadside assistance calls which are often in the worst conditions possible – they’re outside of their vehicle in the wind with traffic noise on speakerphone and there’s no way we could deliver the accuracy that we do for them if we only relied on speech recognition because it is wrong so often
I’ll play this call then we’ll come back to the play-by-play on what the speech rec heard and what our NLU engine did to make sure it was a successful call [play call]
If you were watching closely, you’ll see speech-to-text transcription was wrong. It transcribed very literally what it heard. If you recall, when he tried to say the word “Ford,” the “F” was not even audible. What we heard is also what the speech to text heard was “Ord.” Since “Ord’ isn’t a word, it transcribed the closest thing it could find which was “Aboard.” Since “F” isn’t a word, it was transcribed that as “have” and the “250” was transcribed as “to” and the word fifty
If the speech to text was so far off, how were we able to get it right?
Here’s where our secret sauce comes in that we do uniquely to really raise the accuracy in voice. In customer service you can predict what the caller might say in response to a question. In this case, we knew we were listening for vehicle names, so we were able to program our NLU engine to only listen for vehicle names. And here’s what’s even more important – listen for anything that sounds even remotely similar to a vehicle name because speech rec is never 100% right.
So even though the speech rec engine was wrong, the NLU engine was able to flag that it didn’t match any vehicle names. And by pattern matching the language acoustic models from what it heard against what we were listening for, the NLU was able to determine Ford F250 was the closest match
A lot of voice platforms are relying on the accuracy of their speech rec and that’s it. And the reason isn’t necessarily a lack of know-how but rather they are trying to be a software platform or voice API that has to be all things to all people. On the converse, we’re of the position that speech rec by itself isn’t good enough. That’s why we bundle services with the technology so we can tailor the AI and tailor the experience to your business question by question across every interaction. In our opinion, the experience over voice just isn’t good enough unless it’s augmented by this level of customization
The best way to think of this is as conversational AI that is purpose-built for the contact centers because only customer service asks questions that have a specific range of answers you know you’re going to get. So as long as you know what those grammars are, you can narrow the aperture of what you’re listening for and tune for only those grammars or anything that sounds even remotely similar to one of those grammars. And that’s what really drives up the accuracy
So let’s talk about a use case where the expected response from a customer is much wider than yes/no – Address Capture is really good example. We do Address Capture for a lot of clients (like Designer Shoe Warehouse and Choice Hotels), but the only reason we can do it (and we do it very well) is because we can match against street names as long as we know the zip code. As you can see onscreen, when we get zipcode from a customer, we’re able to do a data dip to pull up street names to match against. This is what gives that really high accuracy. If we didn’t have pattern matching ability, we would default to a transcription-based approach like Google’s for something like this
We do alphanumeric capture for certain use cases, and we are the only company doing alphanumeric capture - model names and serial numbers for product registration, VIN numbers, policy number for insurance. We don’t like alphanumeric capture unless there’s a defined scope or pattern we can match against. The cases I mentioned have that which is why we’re the only player doing it. In any given sequence, we’re not looking for all letters of the alphabet – only a select ones, so we can weight whatever we hear against what we know we’re listening for
We capture the make and model of vehicles for AAA emergency roadside assistance (we also do it for dealerships which you just heard). That is not a narrow scope – the aperture of makes and models is so wide, you would typically need a transcription-based engine, which, frankly, wouldn’t work very well. But since we can pattern match against a database of makes and models, we’re able to do it far better than you can get from a Google or Nuance or similar engine