AT&T 2012 DevLab Speech API Deep Dive


Published on

Speech given at the 2012 DevLab ( ) about AT&T's Speech API.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AT&T 2012 DevLab Speech API Deep Dive

  1. 1. 09.25.2012
  2. 2. September 25, 2012AT&T SPEECH API DEEP DIVE Michael Owens (@mko on Twitter, mowens on Github) Jay Lieske (, jayatyp on Github) AT&T Developer Program ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.
  3. 3. WHAT IS THE AT&T SPEECH API?2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  4. 4. How the AT&T Speech API Works2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  5. 5. Powered by AT&T WATSON℠ • Developed 20+ years • Optimized for different usage scenarios: • Web Search • Business Search • Question & Answer • Voicemail-to-Text • Short Message (SMS) • TV Search/Remote (U-Verse) • Generic Speech-to-Text2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  6. 6. Simple Speech-to-Text • One REST endpoint • Accepts audio in WAV or AMR • Structured JSON response • Text spoken by user • Metrics to evaluate recognition quality • AT&T Native SDKs for Android and iOS handle audio capture and streaming2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  7. 7. Apps in the Wild AT&T-Translator Speak4it U4Verse-Easy-Remote2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  8. 8. GETTING STARTED WITH THE AT&T SPEECH API3 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  9. 9. Sign Up for API Access • • Free API Access for DevLab Attendees • Detailed Instructions in your Attendee Packet • Sign up with code “APILAB12” • AT&T Staff is on hand to answer questions and help get you set up2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  10. 10. Before You Code • Get your API Keys from Developer portal: • Client ID (“API Key” on the AT&T Developer Portal) • Client Secret (“Secret Key” on the AT&T Developer Portal) • OAuth 2.0 client_credentials grant type • OAuth 2.0 access_token • Audio File Types: • AMR: narrowband, 12.2 kbits/s, 8 kHz sampling • WAV: 16 bit PCM WAV, single channel, 8 kHz sampling • Audio File Length: • Voicemail: 4 minutes or less • Other: 1 minute or less2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  11. 11. Step 1: Connect via OAuth Request Method: POST Request URL: Request Headers: Content-Type: application/x-www-form- urlencoded Request Body: client_id=ATT_API_CLIENT_ID &client_secret=ATT_API_CLIENT_SECRET &grant_type=client_credentials &scope=SPEECH Response Body: { "access_token": "xxyz123" }2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  12. 12. Step 2: POST Audio to AT&T (Non-Streaming HTTP Request) Request Method: POST Request URL: Request Headers: Accept: application/json Authorization: Bearer xxyz123 Content-Type: audio/wav Content-Length: 1534 X-SpeechContext: BusinessSearch Request Body: AUDIO_BINARY_DATA Note: The Audio Binary Data goes directly in POST Body, not a MIME Attachment.2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  13. 13. Step 2: POST Audio to AT&T (Streaming HTTP Request) Request Method: POST Request URL: Request Headers: Accept: application/json Authorization: Bearer xxyz123 Content-Type: audio/amr Transfer-Encoding: chunked X-SpeechContext: QuestionAndAnswer Request Body: 200 Note: Numbers are the AUDIO_BINARY_DATA_CHUNK recommended chunk size 200 in hexadecimal format. AUDIO_BINARY_DATA_CHUNK 02 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  14. 14. AT&T SPEECH API EXAMPLE APPLICATION Download the Source: ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  15. 15. Transcription in Three Steps 1. Capture Audio Input 2. POST Audio to AT&T 3. Use AT&T API Response Capturing audio input differs Once the audio input has been The AT&T API sends back a very from platform to platform. captured, we send the easy to parse JSON object with compatible audio file from our the interpreted text. In our Basic Example, we use a server to the Speech API using small Adobe Flex app to access In our Basic example, we a simple POST. the mic via Flash, capture the output this to the user’s screen audio in one of the two In our Basic Example, we use a pretty printed and syntax accepted formats, then save small Node.js module called highlighted, but you could do that newly created audio file to “Watson.js” (NPM: “watson-js”) much more. disk on the server. to OAuth to the Speech API In our Speech Labs, we will look and then POST the audio file. In our Speech Labs, we will look at other ways to use this data, at the methods by which you In our Speech Labs, we will do like searching for businesses can capture and stream audio this on iOS, Android, and Web. on Foursquare. directly to the Speech API.2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  16. 16. Watson.js Node.js API Wrapper for the AT&T Speech API GitHub: NPM: ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  17. 17. Using Watson.js 1. Require API Wrapper var WatsonClient = require(‘watson-js’); 2. Set API Client Options var options = { client_id: ATT_API_CLIENT_ID, client_secret: ATT_API_CLIENT_SECRET, access_token: ACCESS_TOKEN, scope: "SPEECH", context: "Generic", access_token_url: "", api_domain: "" }; 3. Instantiate New API Client var Watson = new WatsonClient.Watson(options);2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  18. 18. The Methods of Watson.js Watson.getAccessToken(callback) Method for requesting a new OAuth Access Token using the Client Credentials grant type and passes the returned Access Token to the passed callback function. Watson.speechToText(speechFile, accessToken, callback) Method for piping a speech file (passed as an absolute file location) to the AT&T Speech API using the passed access token. The API Response’s JSON is returned to the passed callback function as parsed JSON.2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  19. 19. AT&T SPEECH API EXAMPLE APP CODE WALKTHROUGH Using the AT&T Speech API to convert generic audio to text in a web browser. example-basic in the examples repo6 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  20. 20. Frameworks & Requirements: Server-side: • Node.js: JavaScript platform for building fast, scalable network apps • FS: Node.js File System module • Express: Minimal web application framework for Node.js • Optimist: Lightweight option parsing module for Node.js • HBS: Express View Engine wrapper for Handlebars • Watson.js: Simple API Wrapper for AT&T Speech API Client-side: • jQuery: The gold standard of client-side JavaScript libraries • swfobject: JavaScript to make embedding Flash objects easier • Bootstrap: Twitter’s CSS framework for quickly developing web apps2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  21. 21. Capture Audio Input recorder.swf: Adobe Flex app that accesses the user’s microphone and emits events to JS recorder.js: JavaScript interface to receive events, update UI, and POST file to Node.js Node.js upload script: function cp(source, destination, callback) { fs.readFile(source, function(err, buf) { fs.writeFile(destination, buf, callback); }); }, function(req, res) { cp(req.files.upload_file.filename.path, __dirname +, function(err) { res.send({ saved: saved }); return; }); });2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  22. 22. POST Audio to AT&T AJAX Request via POST from client side to Node.js // Receive an AJAX POST from client-side JavaScript, function(req, res) { // Pass the audio file and access token to AT&T Speech API Watson.speechToText(__dirname + /public/audio/audio.wav, this.access_token, function(err, reply) { // Pass any errors associated with API call to client-side JS if(err) { res.send({ error: err }); return; } // Return the parsed JSON to client-side JavaScript res.send(reply); return; }); });2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  23. 23. Use Speech API Response Example API Response, returned Response- What-The-Response-Parameter-Means from call using Content-Type of Parameter ‘application/json’: Recognition Body"object"for"the"AT&T"Speech"API"Response ResponseId Unique"IdenGfier"for"a"specific"API"call Array"of"hypothesis"objects"(possible" { NBest transcripGons"of"audio"data). "Recognition": { PlainKtext,"cleaned"up"representaGon"of"the" "ResponseId": "74a964bf2fe", ResultText Hypothesis."This"should"be"used"when"displaying" "NBest": [ { the"text"to"users." "WordScores": [1, 0.75, 1, 0.75], Confidence"score"for"the"overall"Hypothesis." "Confidence": 0.75, Confidence Scored"on"a"scale"from"0"(not"confident)"to"1.0" (very"confident) "Grade": "accept", Recommended"acGon"to"take"with"the"current" "ResultText": "This is a test.", Grade Hypothesis:"accept,"reject,"or"confirm "Words": [“This”, “is”, “a”, Array"of"the"individual"words."Confidence"scores" “test.”], Words for"each"word"are"available"in"the"WordScores" "LanguageId": "en-us", array." "Hypothesis": "This is a test." Array"of"individual"confidence"scores"for"each" WordScores word"in"the"ResultText"parameter."Corresponds" } ] to"Words"array. } RepresentaGon"of"the"response"language." } LanguageId Supports"English"&"Spanish"in"Generic;"EnglishK only"in"other"contexts. The"raw"transcripGon"of"the"audio"that"was" Hypothesis interpreted.2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  24. 24. Up Next: Michael Fitzpatrick2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  25. 25. Up Next: Jason Goecke Adam Kalsey2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  26. 26. ADVANCED EXAMPLES What can you do with Speech-to-text? You could… • Make your mobile or web application accessible with voice commands • Post tweets using voice commands in a simple Twitter app • Add on-the-fly transcripts while recording in a podcasting app • Add captioning to videos hosted on your website automatically • Create real-time closed captions of a conference speaker’s presentation • Search for nearby places to check in at on Foursquare7 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  27. 27. Speech Labs We’re now going to break out into three clusters, each focusing on a different technology stack. Work independently or with a partner! Web (Flex + Node.js) iOS (Objective-C) Android (Java) In the Web Speech Lab, Michael In the iOS Speech Lab, Brant In the Android Speech Lab, Jay will be on hand to help get your will help you try out the AT&T will help you try out the AT&T Node.js app working with the Speech API on iOS and go into Speech API on Android and go AT&T Speech API. Code up your more depth about the AT&T into more depth about the own Speech API app from Speech SDK for iOS. AT&T Speech SDK for Android. scratch, or you can start from a The mobile SDK allows you to The mobile SDK allows you to boilerplate app that uses quickly capture and stream quickly capture and stream Foursquare to search for audio from your iPhone or iPad audio from your Android locations and allow you to app to the AT&T Speech API. phone or tablet app to the check-in from your web AT&T Speech API. browser!2 ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property. AT&T Developer Program
  28. 28. September 25, 2012THANKS! ANY QUESTIONS? Michael Owens (@mko on Twitter, mowens on Github) Jay Lieske (, jayatyp on Github) AT&T Developer Program ©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.