Johan Schalkwyk, a principal engineer at Google, discusses "augmented humanity" through the use of a smartphone. Presented at the "Cutting-Edge Technology Showcase" organized by the New York Technology Council.
Good Morning My name is Johan Schalkwyk. I manage the mobile speech recognition efforts at Google. Our vision for mobile is very ambitious. It says allot about where Google thinks the future of computing lies. In several key areas mobile is leading the innovation.
Behind this is vision is a concept familiar to fans of science fiction, something people have been talking about for many years: artificial intelligence. When we think about the term artificial intelligence we immediately think of human attributes that we try to imprint into the machine. When I was about 10 yrs old I started programming computers. One of first programs written in basic was called baby. Ever since I can remember I wanted to teach computers how to talk, how to think..
But now at Google we are looking towards artificial intelligence as a way to augment intelligence. Using a mobile phone we can extend our senses and capabilities It ’s about giving you “super senses”: The ability to navigate the world using your phone, to access the worlds information any time by just using your voice. Giving you super-memory. . Artificial intelligence now becomes the driver towards augmenting our humanity.
As a search engine, we are very interested in doing this. At its heart, search is about helping people make sense of the world, to find things when they need them and to put more knowledge in their hands. Google ’s motto has always been organize the world information and making it universally accessible. But the worlds information is more than what’s on the web. When you’re in the physical world there are many question that needs answering. How do I get to the airport? What’s the name of that monument? How do I translate this menu written in Japanese?
This is exactly where mobile becomes such an important part of this vision. Mobile is the best platform for bringing this vision to fruition. Sensors on the phone can enhances your senses. Our smart-phones are connected to the cloud and have access to the data in the cloud.
To make this more concrete here are several examples of technologies we are working on today that when accessed using your smart-phone brings these to their full potential and quite literally enhances your senses. Read examples out
Over the past few years multiple technologies have come together to create this vision of augmented humanity. At Google we ’re investing heavily into research that not only draws on the power of the clouds but also on what the search engine has taught us about the worlds information, in web pages, images, videos, voice searches, maps, all represented in the different languages of the world. This is what sets Google approach apart from the past .When we bring these 3 things together, cloud computing, the scale of the worlds information and the research in areas like speech recognition, machine translation and computer vision, your phone is suddenly much more than a phone. It is a super computer in your pocket.
My background is in speech recognition, so lets start with a couple of demonstrations of how we can use voice on your mobile phone to enhance your capabilities in a natural way. Congrats to Japan Searching for a map Sending a message Voice actions only launched in us, coming in languages soon. What just happened now?
The human brain is uniquely, mysteriously able to parse spoken language. It ’s an ability so natural it’s invisible: but when it comes to teaching computers to understand speech, you start realizing how complicated these processes are in our own brain. The basic process of a recognizer consists of building statistical models for basic components of human language. First we need to build a model of the basic sounds of the language. We call this the phonemes. My mother tongue is Afrikaans, and one of my favorite sounds is the trilled r, as re^rig. Other interesting sounds are the click sounds for example the word Xhosa, is pronounced with a click of the tongue. The we use the sounds to form words, like tyranisorous rex. We call this the pronunciation model. Finally words are combined to form sentences to form the grammar of the language. We refer to this as the language model.
With large amounts of data we have available we can train large statistical models for each of these components required to teach a computer to listen. Using the voice searches people do on a daily basis we continually feedback to the system to learn how to model the sounds of the language. Using queries people type to google.com we learn grammar or language model of people searching on google.com. The reason we need data is that we need to predict what people can say. For example the query term slum dog millionaire was a very unlikely query term before the movie. If our models did not contain this data we would never be able to recognize these words.
So what I ’ve just talked about how we build large scale statistical models for teaching computer how to understand speech. Now that your phone has ears and you can search google.com naturally just by using your voice. I ’d like to invite Josh Estelle from the translate team to show how translation is augmenting our humanity.
With cloud and mobile we see with these technologies combined the sum is far greater than the parts. Today we truly have a supercomputer that can fit in your pocket.
Mobile: A New Era of Augmented Humanity
<ul><li>Mobile: a new era of augmented humanity </li></ul><ul><li>Johan Schalkwyk, principal engineer, Google </li></ul>Click to edit Master title style
Artificial intelligence…as imagined yesterday Click to edit Master title style
Click to edit Master title style Today: augmented humanity
Click to edit Master title style Search your entire world At its heart, search is about helping people make sense of the world, to find things when they need them and to put more knowledge in their hands.
Click to edit Master title style The smartphone
Click to edit Master title style <ul><li>Machine translation </li></ul><ul><li>Computer vision </li></ul><ul><li>Navigation </li></ul>Augmented humanity…on mobile <ul><li>Speech recognition </li></ul>
Click to edit Master title style Data and the cloud
<ul><li>Demos </li></ul>Click to edit Master title style
“ What is a phoneme?” “ How do you pronounce ty-ra-ni-so-rus?” “ Do I really need to learn grammar?” Teaching computers how to listen
Click to edit Master title style Teaching computers how to listen (data)
Click to edit Master title style <ul><li>Machine translation </li></ul><ul><li>Computer vision </li></ul>Augmented humanity…on mobile <ul><li>Speech recognition </li></ul>
<ul><li>Mobile Translate demo </li></ul>Click to edit Master title style
Click to edit Master title style Translation and machine learning
Click to edit Master title style Machine translation on mobile
<ul><li>Computer Vision demo </li></ul>Click to edit Master title style
Click to edit Master title style Computer vision on mobile