• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Scaling Analytics with elasticsearch
 

Scaling Analytics with elasticsearch

on

  • 8,899 views

 

Statistics

Views

Total Views
8,899
Views on SlideShare
8,899
Embed Views
0

Actions

Likes
12
Downloads
72
Comments
2

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Dan! Really interesting presentation. Elasticsearch's facets seem to work pretty well for summary statistics, but I'm interested in doing some slightly more involved operations like logistic regression, and I'm not really sure which path to take. Do you have any suggestions or experiences to share along those lines?
    Are you sure you want to
    Your message goes here
    Processing…
  • how about performace?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Scaling Analytics with elasticsearch Scaling Analytics with elasticsearch Presentation Transcript

    • Scaling Analytics with elasticsearch Dan Noble @dwnoble
    • Background• Technologist at The HumanGeo• We use elasticsearch to build social media analysis tools• 100MM documents indexed• 600GB+ index size• Author of Python elasticsearch driver “rawes” https://github.com/humangeo/rawes
    • Overview• What is elasticsearch?• Scaling with elasticsearch• How can I use elasticsearch to help with analytics?• Use Case: Social Media Analytics
    • What is elasticsearch?
    • Search Engine• Open source• Distributed• Automatic failover• Crazy fast
    • Search Engine• Actively maintained• REST API• JSON messages• Lucene based
    • Search Elasticsearch “Cluster” Host Index: Articles• Simple case: one host• One index containing a set of articles
    • Distributed Search Elasticsearch “Cluster” Host Host Articles (a) Articles (b)• Too much data?• Add another host• Indices can be broken up into “shards” and live on different machines
    • Redundancy Elasticsearch Cluster Host Host Articles (a) Articles (b) Articles (b) Articles (a)• Shards can be replicated to improve availability
    • Node Auto Discovery Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles (b) Articles (a) Articles (a)• Say we add a third host• elasticsearch will automatically start moving shards to this new host to distribute load
    • Failover Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles (b) Articles (a) Articles (a)• Say a host goes down• Shards on that host are no longer available for search• Elasticsearch automatically rebuilds these two shards on other hosts
    • Querying Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles(a) Query: “Barack Obama”Can query against Client Search for articles any host (Web Application) Send request to other shards if needed
    • REST API• JSON query syntax• Developer friendly• Easy to get started
    • Python Exampleimport raweses = rawes.Elastic(elastic-00:9200)es.get(articles/_search, data={ "query": { "filtered" : { "query" : { "query_string" : { "query" : "Barack Obama" } } } }})
    • Community
    • Elasticsearch Summary• Scales horizontally• Redundancy• Configures itself automatically• Developer friendly
    • Analytics and elasticsearch• Date Histograms• Statistical facets• Geospatial queries• All with arbitrary search parameters• Again: Fast
    • Use Case: Social Media Analysis• Use social media APIs to search for data on a topic of interest• 100MM documents indexed• Sentiment analysis• Location extraction (“Geotagging”)
    • Sample Documentes.post(articles/facebook, data={ ”date": "2012-09-01 08:37:55", "tags": { "sentiment": { "positive": 0.36, "negative": 0.10 } "geotags": [{ "term" : "Cairo", "location" : "30.0566,31.2262”, “type” : “geo_point” }], "search_terms": [ "Mohamed Morsi" ] }, "item": { "publisher: "Facebook" "source_domain": "www.facebook.com", "author": "James Smith", "source_url": "http://www.facebook.com/5551231234/posts/414141414141", "content_text": "Mohamed Morsi visits Iran for first time since 1979 ....", "title": "James Smith posted a note to Facebook", "author_url: "http://www.facebook.com/profile.php?id=5551231234" }})
    • Analytical Queries
    • Date Histogram for Sentimentes.get(articles/_search, data={ "query" : { "query_string" : { "query" : "Mohamed Morsi" } }, "facets" : { "sentiment_histogram" : { "date_histogram" : { "key_field" : "date_of_information.$date", "value_field" : "tags.sentiment.positive", "interval" : "day" } } }})
    • Date Histogram for Sentiment
    • Statistical Facet for Sentiment: Queryes.get(articles/_search, data={ "query" : { "query_string" : { "query" : "Mohamed Morsi" } }, "facets" : { "sentiment_stats" : { "statistical" : { "field" : "tags.sentiment.positive" } } }})
    • Statistical Facet for Sentiment: Result{ "facets": { "sentiment_stats": { "_type": "statistical", "count": 8825, "max": 0.375, "mean": 0.008503991588291782, "min": 0.0, "std_deviation": 0.021251077265305472, "sum_of_squares": 4.623648343200283, "total": 75.04772576667497, "variance": 0.00045160828493598306 } }, "hits": { "hits": [], "max_score": 1.1120162, "total": 8825 }, "took": 60}
    • Top Keywordses.get(articles/_search, data={ "query" : { "match_all" : {} }, "facets" : { "search_terms" : { "terms" : { "field" : "tags.search_terms", "size" : 3 } } }})
    • Top Search Terms
    • Geospatial searches.get(articles/_search, data={ "query" : { "filtered" : { "filter" : { "geo_distance" : { "distance" : ”20km", "tags.geotags.location" : { "lat" : 30, "lon" : 31 } } } } }})
    • Questions