Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a CRM on top of ElasticSearch

42,668 views

Published on

How EverTrue is building a donor CRM on top of ElasticSearch. We cover some of the issues around scaling ElasticSearch and which aspects of ElasticSearch we are using to deliver value to our customers.

Published in: Data & Analytics
  • Be the first to comment

Building a CRM on top of ElasticSearch

  1. 1. + How we’re building a CRM on top of ElasticSearch
  2. 2. About me (quickly) Mark Greene / @markjgreene Director of Engineering @ EverTrue Love distributed data stores, love them! Using ElasticSearch for ~1 year
  3. 3. What does EverTrue do? We help nonprofits raise more money by allowing them to identify and build relationships with potential donors
  4. 4. How do we do that? Resolving identities across third party data sources Obligatory database tube
  5. 5. Cluster Setup • 3 Masters, 2 data nodes, AZ aware • ~40m documents, ~25GB • 1 index, 7 types • 5 shards, 1 replica • Peak work loads equate to 4-5k ops/s • Using mostly default settings
  6. 6. Data Model • Mapping contains ~50 default fields. • Most fields are stored as both analyzed and not analyzed • Leverage dynamic templates for custom fields created by our customers • Each custom field is stored by as analyzed and not analyzed
  7. 7. Write Path SSSSQQQQSSSS BBaacckkggrroouunndd BBaacckkggrroouunndd JJoobbss JJoobbss
  8. 8. Read Path 1. Submit EverTrue CCoonnttaaccttss AAPPII CCoonnttaaccttss AAPPII 2. Translate to ES Query, returns contact Id’s SSeeaarrcchh AAPPII SSeeaarrcchh AAPPII DSL Query 3. Load full contact objects w/ meta Offline streaming jobs
  9. 9. Arbitrary field filtering Aggregations ES Hadoop Plugin
  10. 10. Filter Cache: Our first scaling issue Turns out field cache is unbounded by default...
  11. 11. First Solution • We set indices.fielddata.cache.size to 50% • No more OOME Crashes • Then something else happened....Really slow queries (Problem sign #1)
  12. 12. Slow Query?... More Hardware Right?! Type m1.xlarge r3.2xlarge r3.2xlarge Hardware 4 CPU 8 CPU 8 CPU 15GB RAM 60GB RAM 60GB RAM Round disk thingy SSD’s SSD’s ES Version v1.1.2 v1.1.2 v1.3.2 has_child query time 12-15s 6-8s ~100ms
  13. 13. Lessons Learned • Watch the release notes & GH issues like a hawk • Don’t fall to far behind w/r/t versions • We waited to long (6 months) • Keep ES fed with plenty of memory • Need monitoring to have any hope of understanding operational issues
  14. 14. Settings We Tweaked • indices.store.throttle.max_bytes_per_sec • Default 20mb -> 60mb (SSD’s can handle it) • indices.fielddata.cache.size • Set to 70% of heap
  15. 15. ES Hadoop Integration • We use it for a lot of our offline jobs • One map task per shard • Small shard deployments may underutilize your hadoop cluster • Mapper inputs do not contain meta fields like _version • Forces another read for write back scenarios
  16. 16. tail -f ~/questions

×