Advertisement

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

NoSQL matters
Jun. 8, 2015
Advertisement

More Related Content

Similar to Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015(20)

More from NoSQLmatters(20)

Advertisement

Recently uploaded(20)

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

  1. Entity-Centric Indexing Mark Harwood @elasticmark 4/6/2015
  2. www.elastic.co 2 (or “when aggregations don’t cut it”) Entity-centric indexes
  3. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 3 A typical “event-centric” deployment Time-based event indexesEvent stream
  4. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 4 Problem: some aggregations are expensive We need to join all event-level data together at query-time. ?Using web server log data, answer the question: "how long on average do customers spend on my site?" !
  5. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 5 How to cripple elasticsearch with a bucket explosion: 1. Ask a question about values that needs to be derived from multiple documents (e.g. deriving a web session’s duration) 2. Make the joining key a high cardinality field e.g. something like “IP address” 3. Extra points if you use no routing of your documents so that related content is spray-gunned across multiple shards
  6. www.elastic.co 6 A “pay-as-you-go” model to the costs of fusing data Solution
  7. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 7 Solution: an “entity-centric” model Usual stream of events Time-based event indexes Entity-based summary indexes Periodic extracts sorted by entity ID and time
  8. www.elastic.co 8 • WebSessions • "how long on average do my customers spend on my site?” • “which users behave like bots?” • “what is the most common exit page?” • Bank Accounts • "Does this new payment match the typical spending behaviour of bank account X?” Entity-centric queries
  9. www.elastic.co 9 • Buyers • "What do the users who bought product X also buy?” • “Which buyers behave like ‘shills’ and who are they promoting?” • Cars • “Which cars drove long distances after failing a road worthiness test?” Entity-centric queries
  10. www.elastic.co 10 Web log analytics Use case
  11. www.elastic.co 11 • Analyses website traffic for retailers and manufacturers in the automotive industry • Summarising many behaviours over time e.g. • unique numbers of visitors per month • engagement: average session durations • Faced scaling issues producing some results from raw events Use case: GFORCES
  12. www.elastic.co 12 • Data store contains 150m events generated by 26m user sessions • Event-centric aggregations were taking ~25 seconds • Equivalent entity-centric aggregations take <50ms • Simplified queries for common entry pages, common exit pages etc Results of moving to entity-centric indexing
  13. www.elastic.co 13 Amazon marketplace reviews - building profiles for reviewers Worked example Play  along!  Code  +  data  here:  bit.ly/entcent
  14. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 14 An “entity-centric” model AmazonReviews (an event-centric index) reviews.csv loadEvents.sh Review event fields • rating • seller • reviewer • date AmazonReviewers (an entity-centric index) buildEntities.sh • Drops and creates reviewers index. • Uses Python client to query and scroll list of reviews sorted by reviewerId and time • Python pushes _update requests to ~400k “Reviewer” documents each containing bundles of their recent reviews using bulk indexing API • Shard-side Groovy script collapses the multiple reviews into a single reviewer JSON document summarising behaviour Reviewer entity fields • positivity • num sellers reviewed • last 50 reviews • profile (“newbie”, “fanboy” etc)
  15. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 15 Anatomy of an entity indexing groovy script Initialize  if  new  document Loop  to  consolidate  latest  events Re-­‐run  risk  profile  logic   Load  stored  state Store  the  script  in  ES_HOME/config/scripts/foo.groovy
  16. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 16 Insight: which sellers have a lot of fanboys? Seller  #187  has  more  than  his   fair  share  of  “fanboy”  reviewers   …
  17. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 17 Drilling down into seller #187’s fanboys Suspiciously   synchronised   behaviour
  18. www.elastic.co 18 UK 2013 car road worthiness tests Worked example
  19. www.elastic.co 19 • In the UK all vehicles must pass an annual roadworthiness test, called an MOT (named after the Ministry of Transport) • It is illegal to drive a car that has failed an MOT (unless driving home from a test or to a repair centre) • Taxis and other forms of public transport have to be tested more frequently - every 6 months. • All data is freely available from data.gov.uk but with anonymised vehicle ID and inexact test locations. Example background
  20. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 20 Example background MOTs mots.csv loadMOTs.sh Cars buildEntities.sh • Drops and creates mots index. • Uses Python client to bulk load all 37m road worthiness test results for 2013 (data source http://data.gov.uk/ • Drops and creates cars index. • Registers CarProfileUpdater.groovy as a stored script • Uses Python client to query and scroll list of mot test results sorted by vehicle ID and time • Python pushes _update requests to ~27m “Car” documents each containing bundles of related MOT test results using bulk indexing API • Shard-side Groovy script collapses the multiple tests into a single summary JSON document for a car, deriving summaries eg MOT event fields • result (pass/fail) • vehicle ID • Make + model + age • mileage • test date • test location Car entity fields • Make + model + age • last test result, date, location • miles driven while failed • days between fail and fix • complete test history • suspected bad mileometer readings
  21. www.elastic.co 21 Car attributes derived from 3 test result documents Data fusion logic 1 2 3 Test  date Mile-­‐o-­‐meter  reading daysForFix badReading? milesDrivenAfterFailure mile-o-meterRewind
  22. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 22 Insight: who is driving failed vehicles? Q: Why is there an unexpected peak in milesDrivenWithFailure around 6-months? A: Taxis
  23. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 23 Insight: Taxis keep on trucking after failures..
  24. www.elastic.co 24 A user-centric index as a recommendation engine Recycling user behaviours
  25. www.elastic.co 25 • A public dataset* of 10m movie ratings made by 71k users • One elasticsearch document per user with a list of their movie ratings Movielens data Example background *  http://files.grouplens.org/datasets/movielens/ml-­‐10m-­‐README.html
  26. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 26 “Uncommonly common”user behaviours
  27. www.elastic.co 27 Conclusions
  28. www.elastic.co 28 • Efficient and simple queries • Advanced analytics/insights • Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups) • Can reuse existing elasticsearch APIs or build entity documents using external technologies Entity centric indexing: Advantages
  29. www.elastic.co 29 • Avoid “fat entities” • Use forgetful collections: Priority queues, circular buffers, HyperLogLog • Avoid pointless updates • Use ctx.op=“none” to avoid writes of insignificant changes • Consider options for reducing event volumes: • Use of aggregations in gathering events • Reduce related events in event-gathering script that issues updates • Parallelise the pull of event information Entity centric indexing: tips
  30. www.elastic.co 30 • Incremental entity updates can be achieved by querying all events since the timestamp of the last run • Data integrity - implement policies for: • handling any failures in performing entity updates • retiring old entities (use of TTL?) Entity centric indexing
  31. www.elastic.co 31 @elasticmark Questions?
Advertisement