Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
×

# Looking into the Future: Using Google's Prediction API

565 views

Published on

We all would like to predict the future at some point in our lives. Well thanks to Google we can now be one step closer! This talk will give an overview of what the Google Prediction API is, how you can use it to analyze data sets, it's strengths and weaknesses and run open data sets through the system covering both regression and categorization models.

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

### Looking into the Future: Using Google's Prediction API

1. 1. Looking into the Future Using Google’s Prediction API Justin Grammens Recursive Awesome & IoT Weekly
2. 2. What is Prediction? • Deﬁned by Wikipedia as: “A statement about an uncertain event.” • Continues on to read… “It is often, but not always, based upon experience or knowledge.” • In statistics, prediction is a part of Statistical Inference.
3. 3. Statistical Inference • Statistical inference is the process of deducing properties of an underlying distribution by analysis of data. • Two major paradigms used for statistical inference • Frequentist Inference • Bayesian Inference
4. 4. Frequentist Inference • Data is repeatable random sample with a speciﬁc probability • Parameters and probabilities remain constant during the test • Results are independent results from prior tests • Q: Will the sun rise tomorrow? What’s the probability of a sun dying based on all the suns in the universe
5. 5. Bayesian Inference • Take into account prior results and subjective beliefs • Update probabilities of occurrence based on new data • Tests are NOT run in isolation and affect one another • Q: Will the sun rise tomorrow? Depends on how many times we have seen it rise in the past
6. 6. Predictions by Machines • Could therefore deﬁne prediction as an “informed guess or opinion.” • Software systems have to be trained before they can be effective. source: reading.pppst.com
7. 7. What is Prediction API? • Announced at Google I/O in 2011 • Provides pattern-matching and machine learning capabilities. • Handles both numeric or text input • Handles both classiﬁcation or regression output • Access from App Engine, client libs and command line • Able to retrain the model on the ﬂy - Bayesian?
8. 8. What Are Some Usages?
9. 9. What Do You Need? • Google Account • Google Platform Console project • Google Predication API Activated • Google Cloud Storage API Activated
10. 10. Steps Involved • Deﬁne what you are trying to accomplish • Find the training data and format to support your goal (hardest part) • Upload training data to Google Cloud Storage • Train the system against the data you provide • Send queries to your model • Upload additional data with new information gained.
11. 11. Hosted Model • The Prediction API hosts a gallery of user-submitted models • Owners can charge for the use of the model • Hosted models are versioned so they an be updated easily • Models are submitted in PMML format • XML-based language to deﬁne statistical & data models • Appears to currently be a waitlist
12. 12. How To Train • 3 ways to create and train the correct type of model • CSV File - Lives on Google Cloud Storage • Training data embedded in request • Limited to the size of an HTTP Request < 2MB • Empty model created and trained with update calls
13. 13. CSV File Rules • Maximum ﬁle size 2.5 GB • No header row. Yes, to the system it’s irrelevant • One example per line • The ﬁrst column indicates to the system the type of model. • Ideally remove punctuation (other then apostrophes) from your data.
14. 14. CSV File Rules • Text Strings • Double quotes around all text strings • Text matching is case-sensitive • Numeric Values • Integer and decimals are supported • Numbers: "1", "23", “999" • Strings: "6 12", “colt 45"
15. 15. Structuring Data • Example Value • “The Answer” • Features • No limit on number of feature • More features & examples the better • To train 16MB ~ 1 hour
16. 16. What’s The Answer?
17. 17. Regression Model Example Data • Deﬁne your data to support numbers and strings • Query of “Seattle, 288, sunny”, might get back value of 62 • Don’t need to match any values in the dataset • Fill model with all columns then query with ﬁrst column missing
18. 18. Classiﬁcation Model Example Data • Query of “Lose weight now!” you would get result of “spam” • Returns the category from the dataset
19. 19. Authorization • You must use OAuth 2.0 to authorize requests • Can share your model with others • View: User can call Analyze, Get, List and Predict on the project and/or any model owned by the project. • Edit: User has all the permissions of Can view, but can also Delete, Insert, and Update any models owned by the project. • Is Owner: User has all the permissions of Can edit, but can also grant permissions to other users to access the project.
20. 20. Tips & Tricks • The more examples & features the better results • However - Adding more features doesn’t always give better predictions is_comedy is_drama is_action is_horror Y N N N VS genre Comedy
21. 21. Tips & Tricks • Need to add a numeric aspect to the genre? • Add additional genre columns and weight it based on count genre genre genre genre genre Drama Drama Drama Comedy Comedy
22. 22. Tips & Tricks • Always put something into each feature • Include all the features that you know about • For Regression: • Make sure will have the time to ensure the values are correct • Conversely, if you have exact numbers use them • Try to have at least a few hundred examples for each category
23. 23. Tips & Tricks • Can only compare against known relationships • Can’t feed an untrained title and user to get rating • Solution is to break the title into genre, director, actors Rating user_name movie_title 9.5 Justin Star Wars 2.2 Justin Disaster Movie 5.0 Justin Billy Madison
24. 24. Let’s Talk Data! • Nice Ride • Based on the starting station, predict the ending station • New York Cab Rides • Given a starting GPS coordinate, predict where the cab ride will end • Sentiment Analysis • Based on the state of the union speech deﬁne the sentiment
25. 25. Based on the starting station, can we predict the ending station?
26. 26. Nice Ride Location Rides • https:// www.niceridemn.org/ data/ • Offers a live XML stream to update along the way
27. 27. Nice Ride Location Rides Started with this: Next: Ended with this:
28. 28. Nice Ride Insert Data ID & Location
29. 29. Nice Ride Running Prediction Status
30. 30. Lessons Learned • I forgot to put the values in quotes. Treated it as numerical regression. • Verify how it’s interpreting your data with “get” call. Type
31. 31. Nice Ride Location Rides Show Scripts, API & Results
32. 32. Can we predict the movement of NYC cabs?
33. 33. NYC Cab Ride Data Data DictionaryData Website
34. 34. Sample Data Contains pickup & drop off latitude and longitude
35. 35. There’s A Problem • Asking for 2 inputs and 2 outputs! • Not possible with Prediction API as it only supports one dependent variable. :( • Change of plan…
36. 36. Let’s predict the cost of a NYC cab ride instead!
37. 37. Prediction Demo • Features are distances (B) • Examples are prices (A) • Is this accurate? • Different fares based on areas of the city
38. 38. Ok, not really… Let's use location based data instead
39. 39. Prediction Demo • Latitude / Longitude are the features (B, C, D, E • Price Is The Example (A) • Examples
40. 40. NYC Cab Ride Location Show Scripts, API & Results
41. 41. Sentiment Analysis of a Speech
42. 42. Speech Sentiment • Always Check Your Data! • Website incorrectly claimed positive(4), negative(0) and neutral(2) sentiment. • Data had groups of sentiment values. • Source
43. 43. Speech Sentiment FeatureExample Value Training Examples
44. 44. Sentiment Training
45. 45. Sentiment Example Show Scripts, API & Results Obama State of the Union Speech - 1/16 Donald Trump Speech Des Moines, IA - 1/24
46. 46. Smart Spreadsheets Install Smart Autoﬁll Add-on
47. 47. Smart Spreadsheets Prediction API used to ﬁll in missing values
48. 48. Smart Spreadsheets Select columns to use for data training
49. 49. Smart Spreadsheets “Example Values” are populated
50. 50. Final Thoughts - Overﬁtting • Overﬁtting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study. • Therefore, a model that has been overﬁt will generally have poor predictive performance, as it can exaggerate minor ﬂuctuations in the data. • Exact query should not return EXACT examples
51. 51. Thank You Justin Grammens justin@recursiveawesome.com http://recursiveawesome.com Checkout my IoT Weekly Newsletter http://iotweeklynews.com