Sometimes we need to step back and take a look at the bigger picture - not just counting huge piles of individual log records, but reasoning about the behaviors of the people who are ultimately generating this firehose of data. While your DevOps folks care deeply about log records from a machine utlization perspective, marketing wants to know what these records tell us about the customers' needs. Elasticsearch Aggregations are a great feature but are not a panacea. We can happily use them to summarise complex things like the number of web requests per day broken down by geography and browser type on a busy website, but we would quickly run out of memory if we tried to calculate something as simple as a single number for the average duration of visitor web sessions when using the very same dataset. Why does this occur? A web session duration is an example of a behavioural attribute not held on any one log record; it has to be derived by finding the first and last records for each session in our weblogs, requiring some complex query expressions and a lot of memory to connect all the data points. We can maintain a more useful joined-up-picture if we run an ongoing background process to fuse related events from one index into ?entity-centric? summaries in another index e.g: • Web log events summarised into ?web session? entities • Road-worthiness test results summarised into ?car? entities • Reviews in a marketplace summarised into a ?reviewer? entity Using real data, this session will demonstrate how to incrementally build entity-centric indexes alongside event-centric indexes by using simple scripts to uncover interesting behaviours that accumulate over time. We'll explore: • Which cars are driven long distances after failing roadworthiness tests? • Which website visitors look to be behaving like ?bots?? • Which seller in my marketplace has employed an army of ?shills? to boost his feedback rating? Attendees will leave this session with all the tools required to begin building entity-centric indexes and using that data to derive richer business insights across every department in their organization.
3. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
3
A typical “event-centric” deployment
Time-based event indexesEvent stream
4. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
4
Problem: some aggregations are expensive
We need to join all event-level data together at query-time.
?Using web server log data,
answer the question:
"how long on average do
customers spend on my site?"
!
5. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
5
How to cripple elasticsearch with a bucket explosion:
1. Ask a question about values that needs to be derived from multiple
documents (e.g. deriving a web session’s duration)
2. Make the joining key a high cardinality field e.g. something like “IP
address”
3. Extra points if you use no routing of your documents so that related
content is spray-gunned across multiple shards
7. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
7
Solution: an “entity-centric” model
Usual stream of events
Time-based event indexes
Entity-based summary indexes
Periodic extracts sorted
by entity ID and time
8. www.elastic.co
8
• WebSessions
• "how long on average do my customers spend on my site?”
• “which users behave like bots?”
• “what is the most common exit page?”
• Bank Accounts
• "Does this new payment match the typical spending behaviour of bank account X?”
Entity-centric queries
9. www.elastic.co
9
• Buyers
• "What do the users who bought product X also buy?”
• “Which buyers behave like ‘shills’ and who are they promoting?”
• Cars
• “Which cars drove long distances after failing a road worthiness test?”
Entity-centric queries
11. www.elastic.co
11
• Analyses website traffic for retailers and manufacturers in the automotive
industry
• Summarising many behaviours over time e.g.
• unique numbers of visitors per month
• engagement: average session durations
• Faced scaling issues producing some results from raw events
Use case: GFORCES
12. www.elastic.co
12
• Data store contains 150m events generated by 26m user sessions
• Event-centric aggregations were taking ~25 seconds
• Equivalent entity-centric aggregations take <50ms
• Simplified queries for common entry pages, common exit pages etc
Results of moving to entity-centric indexing
14. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
14
An “entity-centric” model
AmazonReviews
(an event-centric index)
reviews.csv loadEvents.sh
Review event fields
• rating
• seller
• reviewer
• date
AmazonReviewers
(an entity-centric index)
buildEntities.sh
• Drops and creates reviewers index.
• Uses Python client to query and scroll list of
reviews sorted by reviewerId and time
• Python pushes _update requests to ~400k
“Reviewer” documents each containing
bundles of their recent reviews using bulk
indexing API
• Shard-side Groovy script collapses the
multiple reviews into a single reviewer JSON
document summarising behaviour
Reviewer entity fields
• positivity
• num sellers reviewed
• last 50 reviews
• profile (“newbie”, “fanboy” etc)
15. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
15
Anatomy of an entity indexing groovy script
Initialize
if
new
document
Loop
to
consolidate
latest
events
Re-‐run
risk
profile
logic
Load
stored
state
Store
the
script
in
ES_HOME/config/scripts/foo.groovy
16. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
16
Insight: which sellers have a lot of fanboys?
Seller
#187
has
more
than
his
fair
share
of
“fanboy”
reviewers
…
17. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
17
Drilling down into seller #187’s fanboys
Suspiciously
synchronised
behaviour
19. www.elastic.co
19
• In the UK all vehicles must pass an annual roadworthiness test, called an MOT
(named after the Ministry of Transport)
• It is illegal to drive a car that has failed an MOT (unless driving home from a
test or to a repair centre)
• Taxis and other forms of public transport have to be tested more frequently -
every 6 months.
• All data is freely available from data.gov.uk but with anonymised vehicle ID and
inexact test locations.
Example background
20. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
20
Example background
MOTs
mots.csv loadMOTs.sh
Cars
buildEntities.sh
• Drops and creates mots
index.
• Uses Python client to
bulk load all 37m road
worthiness test results
for 2013 (data source
http://data.gov.uk/
• Drops and creates cars index.
• Registers CarProfileUpdater.groovy as a
stored script
• Uses Python client to query and scroll list of
mot test results sorted by vehicle ID and
time
• Python pushes _update requests to ~27m
“Car” documents each containing bundles
of related MOT test results using bulk
indexing API
• Shard-side Groovy script collapses the
multiple tests into a single summary JSON
document for a car, deriving summaries eg
MOT event fields
• result (pass/fail)
• vehicle ID
• Make + model +
age
• mileage
• test date
• test location
Car entity fields
• Make + model + age
• last test result, date, location
• miles driven while failed
• days between fail and fix
• complete test history
• suspected bad mileometer
readings
21. www.elastic.co
21
Car attributes derived from 3 test result documents
Data fusion logic
1
2
3
Test
date
Mile-‐o-‐meter
reading
daysForFix
badReading?
milesDrivenAfterFailure
mile-o-meterRewind
22. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
22
Insight: who is driving failed vehicles?
Q: Why is there an
unexpected peak in
milesDrivenWithFailure
around 6-months?
A: Taxis
23. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
23
Insight: Taxis keep on trucking after failures..
25. www.elastic.co
25
• A public dataset* of 10m movie ratings made by 71k users
• One elasticsearch document per user with a list of their
movie ratings
Movielens data
Example background
*
http://files.grouplens.org/datasets/movielens/ml-‐10m-‐README.html
26. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
26
“Uncommonly common”user behaviours
28. www.elastic.co
28
• Efficient and simple queries
• Advanced analytics/insights
• Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups)
• Can reuse existing elasticsearch APIs or build entity documents using external
technologies
Entity centric indexing: Advantages
29. www.elastic.co
29
• Avoid “fat entities”
• Use forgetful collections: Priority queues, circular buffers, HyperLogLog
• Avoid pointless updates
• Use ctx.op=“none” to avoid writes of insignificant changes
• Consider options for reducing event volumes:
• Use of aggregations in gathering events
• Reduce related events in event-gathering script that issues updates
• Parallelise the pull of event information
Entity centric indexing: tips
30. www.elastic.co
30
• Incremental entity updates can be achieved by querying all events since the
timestamp of the last run
• Data integrity - implement policies for:
• handling any failures in performing entity updates
• retiring old entities (use of TTL?)
Entity centric indexing