ElasticSearch in Production: lessons learned
Upcoming SlideShare
Loading in...5
×
 

ElasticSearch in Production: lessons learned

on

  • 35,667 views

With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have ...

With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.

Statistics

Views

Total Views
35,667
Views on SlideShare
35,400
Embed Views
267

Actions

Likes
42
Downloads
304
Comments
1

4 Embeds 267

https://twitter.com 139
http://wiki.ebaykorea.com 125
https://si0.twimg.com 2
https://sendtoinc.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • 1M records do not take 40 seconds. It depends on which version you were operating. Also, you can break your 1 M records in batches as it was already mentioned and make your requests parallel and target different nodes of Elastic Search. This will improve your insertion rate quite a bit.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    ElasticSearch in Production: lessons learned ElasticSearch in Production: lessons learned Presentation Transcript

    • ElasticSearch in Production lessons learnedAnne Veling, ApacheCon EU, November 6, 2012
    • agendaIntroductionElasticSearchUdiniUpcoming ToolLessons Learned
    • introductionAnne Veling, @annevelingSelf-employed contractor Software Architect Agile process management Performance optimization Lucene/SOLR/ElasticSearch implementations & training
    • ElasticSearchApache LuceneStarted in 2010 by Shay BanonOpen Source – Apache LicenseA company was formed in 2012: ElasticSearch Training, support and developmentCareful feature development vs. build because you can
    • ElasticSearchScalable Distributed, Node Discovery Automatic sharding Query distributionRESTful, HTTP API With API wrappers for Ruby, Java, Scala, … JSON in, JSON outDocument Model Maps book.author.lastName “schemaless” -> field type recognition Keeps source, keeps „version‟ number
    • ElasticSearchIntegrated faceting With statistical aggregates (sum/avg/…) for freeField types and analyzers String, numerics, geo, attachment, … Arrays, subdocuments, nested documentsIntegrated sharding Routing and alias Cross-index searching / multi-document type
    • udini.proquest.comProQuestThe World‟s Article StoreStack Amazon EC2 Scala with Unfiltered MongoDB, ElasticSearch
    • architecture Fulfillment Udini Summon ProvidersPDF pipeline mongo Elastic SOLR DB Search
    • SOLR at UdiniConnecting to Summon API 700M SOLR ClusterIn Udini, we serve a subset of 160M full text articles Including fulfillment mechanisms PDF and HTML5 viewing and annotation
    • ElasticSearch at UdiniLocal index to search your articlesMany small user libraries, searching only locally User-id as sharding key Include key in all queries
    • Exciting new productDeveloping for ProQuestExciting new research tool for scientific researchersCreating a large ElasticSearch index for journal articlecanonicalizationCurrently in private beta, launching in the coming months
    • Lessons Learned Very fast indexing Bulk indexing ftw Set up without replicas (replicas = 0, not 1) Play with bulk size Simple write to disk and CURL it in, is very fast 1M records in 40sfor f in ${BATCH_DIR}/batch-*.jsondo echo "about to index $f" curl --silent --show-error --request POST --data-binary @$f localhost:9200/_bulk > /dev/null echodone
    • Lessons learnedSchema(less)?Automatic field type recognition Can miss types Strict about types #duhMapping of subfields (doc.title vs doc.publication.title) Version dependentIn reality Schema still needed Mapping changes still non trivial
    • Lessons learnedLearn to trust ElasticSearch Analyzers: do not pretokenize queries yourself…Difference between “term” and “text” type queries tokenized or notElasticSearch probably already does what you want it todo Search for it Try it
    • Lessons learnedIssues with automated testing and node discovery/startupStart/stop hundreds of times during Jenkins test jobs ordevelopment boxes Takes time Locally sometimes picks up previous versionsMemory issues: ElasticSearch manages a large part of itsmemory outside of the heap Do not simply increase -Xmx
    • Lessons learnedNew tools every monthwaitForYellowStatusAliases, routing allow for clever control
    • APIElasticSearch is new, connection libraries still in infancy,documentation growingIssues using the Java API in ScalaHappy with Scalastic now synchronous asynchronous bulk prepare https://github.com/bsadeh/scalastic
    • #nodbElasticSearch used as a full nosql datastore?Using “version” and optimistic locking schemeCould replace MongoDb in our setupElasticSearch is actually a store optimized for getting stuffout, not for getting stuff in With free faceting Who needs multi-table transactions anyway?
    • SOLR vs ElasticSearchSOLR ElasticSearch Well-known, many New kid on the block tools, extensions Very easy to configure Feels clunky to Handles document to configure lucene mapping Manual document to Horizontally scalable lucene mapping Easy replication Replication and But: shard key indexing in a cluster non-trivial New schoolOld school ;-)
    • search evolution • Custom indexers • Inverted index • Segment merges • Custom analyzers • Faceting • Configuration of analyzers • Faceting, Geospatial • Document mapping • Sub-document queries • Replication • JSON document input • Faceting, complex queries just work
    • conclusionsElasticSearch benefits Easy to setup Very clever architectureDrawbacks Very new software, tool support limited But lots of movement Change sharding in a full index non-trivialElasticSearch Clever architecture, fast, stable Does exactly what you need
    • thank youAre you still using Solr?Come on, it’s 2012 already ;-) anne@beyondtrees.com @anneveling