ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)

  • 6,472 views
Uploaded on

A case study describing how a major real estate portal in Greece (Spitogatos.gr / HomeGreekHome.com) deployed ElasticSearch on Amazon's cloud.

A case study describing how a major real estate portal in Greece (Spitogatos.gr / HomeGreekHome.com) deployed ElasticSearch on Amazon's cloud.

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,472
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
18

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ElasticSearch on AWS Real Estate portal Case Study (Spitogatos.gr) AWSUG GR meetup #7 27 September 2012 Andreas Chatzakis co-founder / IT Director – Spitogatos.grEvent sponsored by: @achatzakis on twitter
  • 2. http://geekandpoke.typepad.com/geekandpoke/2010/09/instant-search.html
  • 3. #about_us
  • 4. Helping you find a propertyFinding a property in Greece is complex, lacks transparency.We make life easier for househunters via:  Powerful search functionality  Web & Mobile  Location & Criteria  Quality content  Listings (we love photos)  Articles  mySpitogatos  Email alerts  Save your search  Favorite listings & notes  Contact the realtors 4
  • 5. Realtors love us too!Professionals need help in those turbulent times.We add value in multiple ways:  Cost effective promotion & high quality leads  Targeted channel (very)  Leads already filtered (we ve seen the fotos!)  Technology services for realtors  Turnkey web site solution  Listing synchronization web service  B2B via Spitogatos Network (SpiN) business network / collaboration tool for realtors  Channel for foreign buyers via the English version 5
  • 6. #background
  • 7. To Search is to FindSearch is central to what we do Users searching for property come with structured criteria of huge variety  Athens Center, residential - flat or studio, for sale, 100-150k €, 85-120 sq meter, with a garage  Athens Center & N.Kosmos, residential - flat, for sale, 75-100k €, 70-100 sq meter, 2+ bedrooms, only show listings with photos  Piraeus centre or Mikrolimano, commercial – store, for rent, 500-750 € per month, only listings with recently reduced price  Monetize: # of Listings grouped by paying member + above criteria  IPhone app → Listings within geo-rectangle + above criteria  As a result, caching is rarely our friend! We used to think Lucene/Solr, ElasticSearch, CloudSearch etc were only useful for text search, not adding value for structured search G Have been insisting on trying to optimize MySQL (multi column indices etc) N while throwing replicas to the problem. O R 7
  • 8. Why ElasticSearchSelected elasticSearch after a (very) brief research* on alternatives: AWSs own Cloudsearch:  Zero management service: nice!  Not available on eu-west-1  Currently lacks ES functionality (e.g. geospatial, non english analyzers) Sphinx  Easy MySQL integration  How do you scale it?* Solr  Industry standard  Seems like it is conceived as somehow harder to scale/operate*? ElasticSearch:  Piece of cake to setup on AWS (stay tuned!)  Super distributed, scales & is easy on IT ops (more on that later!) * Disclaimer: We did not go through a 8 detailed product selection process!
  • 9. #elasticsearch
  • 10. ElasticSearch basicsA distributed, RESTful Search engine built on top of Lucene Free Schema  JSON documents  Analyzers  Boost levels Easy & flexible Search  Lucene query string or JSON based search query DSL  Facets & Highlighting  Spatial search  Custom scripts Multi Tenancy  Store & search across multiple indices  Each with its own settings  Use-case: Logs – recent in memory, old on disk 10
  • 11. Scaling ElasticSearchDesigned from the ground up to be Scalable & Highly Available Distributed  Indices automatically broken into shards  Replicas for read performance & availability  Multiple cluster nodes, each hosting 1+ shards/replicas  peer2peer, each node can delegate operations to other nodes  Add,remove nodes at will  Rebalancing & routing automagically behind the scenes Discovery  Multicast or unicast (declarative) Gateway  Allows recovery in case all nodes go down  Local or shared storage  Async replication in case of shared storage 11
  • 12. A scale-up exampleAssume a cluster with 4 shards and 1 replica configuration 1 node example – Status Yellow 2 nodes example – Status Green 3 nodes example : Primary shard : Replica shard : Master node : Regular nodeMaster node maintains cluster state, acts if nodes join or leave the cluster by reassigning shards. 12
  • 13. ElasticSearch on AWS2 modules make deployment on AWS a breeze EC2 discovery  Filter by security group, AZ, tags  Requires IAM user with certain EC2 privileges: DescribeAvailabilityZones, DescribeInstances, DescribeRegions, DescribeSecurityGroups, DescribeTags  Very useful in autoscaling setups with ephemeral servers S3 gateway  Long term reliable async persistency of cluster state and indices  Allows deployment without EBS volumes  Still, local gateway with EBS volumes performs better (less network used, faster recovery)  Wont protect from accidental deletion of index (deletion will propagate to shared storage) 13
  • 14. #implementation
  • 15. IndexationIndexation of Spitogatos.gr ads DB is still the “source of truth”  We propagate DELETEs synchronously, INSERTs & UPDATEs asynchronously  KISS: Cron job (re) indexes never or least-recently indexed listings  ORM marks new/modified listings as never-indexed (so they go first) Location: Multivalue field instead of nested set model in the DB  e.g. this property is in Greece, Attica, Piraeus, Port of Piraeus  Property will be included in results when I search for any of the above. Flat schema  Searchable listing owner fields are included in the document (vs a JOIN in our DB)  Changes to other tables might lead to large # of listings requiring reindexation (e.g. real estate agent becomes a paying member) 15
  • 16. Index IntegrityMaking sure our index is consistent with the DB Scrutineer ( https://github.com/Aconex/scrutineer )  Compares DB and ElasticSearch index for mismatches  exists in ES but not on DB (or vice versa)  ES version not up to date  Relies on “_version” field - is incremented via our ORM onChange  When indexing we explicitly set versioning to “external”  Had to “hack” it as it doesnt work with EC2 discovery module  http://labs.spitogatos.gr/?p=45 16
  • 17. Search – Shards & RoutingHow does ElasticSearch decide in which shard to store a doc? By default this is done based on hash of document id Can be ovverriden while indexing and while searching (routing parameter) We shard based on hash of the id of area id - Most users search for listings within a specific area - We hit only a single shard for a large percentage of the searches. No routing Routing by specificed specific areaId 17
  • 18. Search – Flat Schema, Facets & ScoringWe rely a lot on ElasticSearchs Flat Schema, Facets & Scoring No joins due to flat schema => fast! Multivalue fields => fast filtering for listings in areas of various hierarchy levels Facets functionality returns list of paying agents with # listings matching criteria Old slow ranking algorithm replaced by elasticSearch scoring functionality  used to go through our DB and refresh score  ad age is part of the equation  Now ES computes this dynamically on every search  We use custom scoring  We can modify scoring algorithm and see changes instantly  no need to recalculate scores for all listings 18
  • 19. MonitoringSematext SPM offers a (currently free) ES monitoring solution Cluster Health  Search rate & latency  Disk Index Stats  Cache  Network Shard Stats  CPU & RAM  JVM & GC 19
  • 20. ToolingElasticSearch-Head is a GUI for browsing /interacting with a cluster 20
  • 21. Backups We take periodic copies from the Gateway  Cause the Gateway is no cure for accidental deletions or bugs  S3cmd syncs S3 gateway contents to local folder  Expect some errors here as files get deleted/modified  Disables snapshots to gateway  Syncs again (no errors this time and much faster)  Reenables snapshots to gateway  Zips local folder contents, splits into smaller files & uploads to secondary S3 bucketGet the script here: http://labs.spitogatos.gr/?p=17 21
  • 22. LearningsIssues & leasons learned: Faceted search can return wrong (smaller) results (on multiple shards)  Due to the way sorting/merging is done  Increase facet size field depending on cardinallity of faceted field We use Elastica – a PHP client for ElasticSearch - https://github.com/ruflin/Elastica  Lacking Document Routing and Version Type support  Our own Jerry Manolarakis on a pull request to add setRouting, setVersionType Filters vs queries (Query DSL)  Filters perform an order of magnitude better than plain queries since no scoring is performed and they are automatically cached. Do it! Your DB will thank youCPU Utilization Response time pattern 22
  • 23. Read more Useful resources: https://speakerdeck.com/u/jmikola/p/symfony-live-london-elasticsearch http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/ http://www.slideshare.net/elasticsearch/elasticsearch-at-berlinbuzzwords-2010 http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext Need help integrating ElasticSearch to your app? http://bacterials.net/ Follow us on twitter: @spitogatosLabs Check out our blog: http://labs.spitogatos.gr 23
  • 24. #questions