Cognitive computing with big data, high tech and low tech approaches

  • 1,240 views
Uploaded on

I explain some very approachable methods for analyzing big data via a detour through clipper ships and the 19th century open source scene. …

I explain some very approachable methods for analyzing big data via a detour through clipper ships and the 19th century open source scene.

Note that I mixed up the route of the Flying Cloud record in this talk. The Flying Cloud's record was actually from New York to San Francisco and was even more impressive than what I said. The usual time had been about 180 days. With Maury's charts, the time was reduced to about 135 days. The Flying Cloud's time was 89 days.

Thanks to Chen Kung for noticing my error.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Yes. The Flying Cloud made prominent use of Maury's charts and attributed a major part of their halving the time from NY to SF to his information. The Captain and navigator's skills were, of course, non-trivial, but Maury's information provided a decisive difference.
    Are you sure you want to
    Your message goes here
  • Hi Ted,

    I like the tea race story very much. But I have two questions:
    1) According to the wikipedia, The 1866 tea race route is from China to London. But the 1851 Flying Cloud routs is from New York to San Francisco. Hence I wonder if the comparison is fair?

    2) You attribute the difference partially to big data and refer to the work done by Mary. Are you saying that Flying Cloud had used the detailed sea maps from Maury's work?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,240
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
28
Comments
2
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Mention that the Pony book said “RowSimilarityJob”…
  • Problem starts here…

Transcript

  • 1. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
  • 2. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 2
  • 3. © 2014 MapR Technologies 3 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 4. © 2014 MapR Technologies 4 The outline • The first open source, big data project • Another big data project • Conclusions
  • 5. © 2014 MapR Technologies 5 First: An apology for going off-script
  • 6. © 2014 MapR Technologies 6 Now, the story
  • 7. © 2014 MapR Technologies 7
  • 8. In 1866, the top finishers in the tea race reached London in 99 days, within 2 hours of each other © 2014 MapR Technologies 8
  • 9. © 2014 MapR Technologies 9
  • 10. But in 1851, the record had been set at 89 days by the Flying Cloud © 2014 MapR Technologies 10
  • 11. © 2014 MapR Technologies 11 The difference was due (in part) to big data
  • 12. © 2014 MapR Technologies 12
  • 13. © 2014 MapR Technologies 13
  • 14. © 2014 MapR Technologies 14
  • 15. These charts were free … If you donated your data © 2014 MapR Technologies 15
  • 16. But how does this apply today? © 2014 MapR Technologies 16
  • 17. © 2014 MapR Technologies 17 Key Points of Maury’s Work • Give to get – Give the Abstract Log to captains, get data • Data consortium wins – Merging data gives pictures nobody else can see • Give back – Them that gives, also gets • But this is just what every data driven web site does! – Just 150 years before everybody else
  • 18. © 2014 MapR Technologies 18 The Real News in Behavioral Analysis • Everybody knows that: • You need ensembles of many models to do recommendations • You need to use factorization models • You predict what you observe • (You should predict ratings)
  • 19. © 2014 MapR Technologies 19 But … none of this is really true
  • 20. © 2014 MapR Technologies 20 In fact, • Fancy models are rarely useful expenditures of time • Factorization can be good, but not much better (if at all) • Ratings are disastrously bad data • Cross-recommendation and multi-modal recommendations are much more interesting – Multiple kinds of input are far better than multiple models • The UI has a far larger impact than the models • The best algorithms combine simplicity with accuracy – So simple you can embed them in a search engine
  • 21. © 2014 MapR Technologies 21 Here’s how
  • 22. © 2014 MapR Technologies 22 CooccurrenCcoeoAcncaulryrseisn ce Analysis
  • 23. © 2014 MapR Technologies 23 How Often Do Items Co-occur How often do items co-occur?
  • 24. © 2014 MapR Technologies 24 Which Co-occurrences are Interesting? Which cooccurences are interesting? Each row of indicators becomes a field in a search engine document
  • 25. © 2014 MapR Technologies 25 Recommendations Alice got an apple and Alice a puppy
  • 26. © 2014 MapR Technologies 26 Recommendations Alice got an apple and Alice a puppy Charles Charles got a bicycle
  • 27. © 2014 MapR Technologies 27 Recommendations Alice got an apple and Alice a puppy Bob Bob got an apple Charles Charles got a bicycle
  • 28. © 2014 MapR Technologies 28 Recommendations Alice got an apple and Alice a puppy Bob What else would Bob like? Charles Charles got a bicycle
  • 29. © 2014 MapR Technologies 29 Recommendations Alice got an apple and Alice a puppy Bob A puppy! Charles Charles got a bicycle
  • 30. © 2014 MapR Technologies 30 You get the idea of how recommenders can work…
  • 31. By the way, like me, Bob also wants a pony… © 2014 MapR Technologies 31
  • 32. © 2014 MapR Technologies 32 Recommendations ? Alice Bob Amelia Charles What if everybody gets a pony? What else would you recommend for new user Amelia?
  • 33. © 2014 MapR Technologies 33 Recommendations ? Alice Bob Amelia Charles If everybody gets a pony, it’s not a very good indicator of what to else predict...
  • 34. © 2014 MapR Technologies 34 Problems with Raw Co-occurrence • Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music • Very widespread occurrence is not interesting to generate indicators for recommendation – Unless you want to offer an item that is constantly desired, such as razor blades (or ponies) • What we want is anomalous co-occurrence – This is the source of interesting indicators of preference on which to base recommendation
  • 35. Overview: Get Useful Indicators from Behaviors © 2014 MapR Technologies 35 1. Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations 2. Transform to a co-occurrence matrix of items x items 3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co-occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR
  • 36. Which one is the anomalous co-occurrence? A not A B 1 0 not B 0 2 © 2014 MapR Technologies 36 A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000
  • 37. Which one is the anomalous co-occurrence? A not A 0.90 1.95 B 1 0 not B 0 2 © 2014 MapR Technologies 37 A not A B 13 1000 not B 1000 100,000 A not A 4.52 14.3 B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000
  • 38. Collection of Documents: Insert Meta-Data © 2014 MapR Technologies 38 Search Technology Item meta-data Ingest easily via NFS Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet
  • 39. From Indicator Matrix to New Indicator Field © 2014 MapR Technologies 39 ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) Solr document for “puppy” Note: data for the indicator field is added directly to meta-data for a document in Apache Solr or Elastic Search index. You don’t need to create a separate index for the indicators.
  • 40. Going Further: Multi-Modal Recommendation © 2014 MapR Technologies 40
  • 41. Going Further: Multi-Modal Recommendation © 2014 MapR Technologies 41
  • 42. © 2014 MapR Technologies 42 For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • ATA gives query recommendation – “did you mean to ask for” • BTB gives video recommendation – “you might like these videos”
  • 43. © 2014 MapR Technologies 43 The punch-line • BTA recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)
  • 44. © 2014 MapR Technologies 44 Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 45. © 2014 MapR Technologies 45 Real-life example
  • 46. © 2014 MapR Technologies 46 Hypothetical Example • Want a navigational ontology? • Just put labels on a web page with traffic – This gives A = users x label clicks • Remember viewing history – This gives B = users x items • Cross recommend – B’A = label to item mapping • After several users click, results are whatever users think they should be
  • 47. available for free at http://www.mapr.com/practical-machine-learning © 2014 MapR Technologies 47 More Details Available available for free at http://www.mapr.com/practical-machine-learning
  • 48. © 2014 MapR Technologies 48 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 49. © 2014 MapR Technologies 49 Q & A Engage with us! @mapr maprtech jbates@mapr.com MapR maprtech mapr-technologies