ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)

Uploaded on

A case study describing how a major real estate portal in Greece (Spitogatos.gr / HomeGreekHome.com) deployed ElasticSearch on Amazon's cloud.

A case study describing how a major real estate portal in Greece (Spitogatos.gr / HomeGreekHome.com) deployed ElasticSearch on Amazon's cloud.

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. ElasticSearch on AWS Real Estate portal Case Study (Spitogatos.gr) AWSUG GR meetup #7 27 September 2012 Andreas Chatzakis co-founder / IT Director – Spitogatos.grEvent sponsored by: @achatzakis on twitter
  • 2. http://geekandpoke.typepad.com/geekandpoke/2010/09/instant-search.html
  • 3. #about_us
  • 4. Helping you find a propertyFinding a property in Greece is complex, lacks transparency.We make life easier for househunters via:  Powerful search functionality  Web & Mobile  Location & Criteria  Quality content  Listings (we love photos)  Articles  mySpitogatos  Email alerts  Save your search  Favorite listings & notes  Contact the realtors 4
  • 5. Realtors love us too!Professionals need help in those turbulent times.We add value in multiple ways:  Cost effective promotion & high quality leads  Targeted channel (very)  Leads already filtered (we ve seen the fotos!)  Technology services for realtors  Turnkey web site solution  Listing synchronization web service  B2B via Spitogatos Network (SpiN) business network / collaboration tool for realtors  Channel for foreign buyers via the English version 5
  • 6. #background
  • 7. To Search is to FindSearch is central to what we do Users searching for property come with structured criteria of huge variety  Athens Center, residential - flat or studio, for sale, 100-150k €, 85-120 sq meter, with a garage  Athens Center & N.Kosmos, residential - flat, for sale, 75-100k €, 70-100 sq meter, 2+ bedrooms, only show listings with photos  Piraeus centre or Mikrolimano, commercial – store, for rent, 500-750 € per month, only listings with recently reduced price  Monetize: # of Listings grouped by paying member + above criteria  IPhone app → Listings within geo-rectangle + above criteria  As a result, caching is rarely our friend! We used to think Lucene/Solr, ElasticSearch, CloudSearch etc were only useful for text search, not adding value for structured search G Have been insisting on trying to optimize MySQL (multi column indices etc) N while throwing replicas to the problem. O R 7
  • 8. Why ElasticSearchSelected elasticSearch after a (very) brief research* on alternatives: AWSs own Cloudsearch:  Zero management service: nice!  Not available on eu-west-1  Currently lacks ES functionality (e.g. geospatial, non english analyzers) Sphinx  Easy MySQL integration  How do you scale it?* Solr  Industry standard  Seems like it is conceived as somehow harder to scale/operate*? ElasticSearch:  Piece of cake to setup on AWS (stay tuned!)  Super distributed, scales & is easy on IT ops (more on that later!) * Disclaimer: We did not go through a 8 detailed product selection process!
  • 9. #elasticsearch
  • 10. ElasticSearch basicsA distributed, RESTful Search engine built on top of Lucene Free Schema  JSON documents  Analyzers  Boost levels Easy & flexible Search  Lucene query string or JSON based search query DSL  Facets & Highlighting  Spatial search  Custom scripts Multi Tenancy  Store & search across multiple indices  Each with its own settings  Use-case: Logs – recent in memory, old on disk 10
  • 11. Scaling ElasticSearchDesigned from the ground up to be Scalable & Highly Available Distributed  Indices automatically broken into shards  Replicas for read performance & availability  Multiple cluster nodes, each hosting 1+ shards/replicas  peer2peer, each node can delegate operations to other nodes  Add,remove nodes at will  Rebalancing & routing automagically behind the scenes Discovery  Multicast or unicast (declarative) Gateway  Allows recovery in case all nodes go down  Local or shared storage  Async replication in case of shared storage 11
  • 12. A scale-up exampleAssume a cluster with 4 shards and 1 replica configuration 1 node example – Status Yellow 2 nodes example – Status Green 3 nodes example : Primary shard : Replica shard : Master node : Regular nodeMaster node maintains cluster state, acts if nodes join or leave the cluster by reassigning shards. 12
  • 13. ElasticSearch on AWS2 modules make deployment on AWS a breeze EC2 discovery  Filter by security group, AZ, tags  Requires IAM user with certain EC2 privileges: DescribeAvailabilityZones, DescribeInstances, DescribeRegions, DescribeSecurityGroups, DescribeTags  Very useful in autoscaling setups with ephemeral servers S3 gateway  Long term reliable async persistency of cluster state and indices  Allows deployment without EBS volumes  Still, local gateway with EBS volumes performs better (less network used, faster recovery)  Wont protect from accidental deletion of index (deletion will propagate to shared storage) 13
  • 14. #implementation
  • 15. IndexationIndexation of Spitogatos.gr ads DB is still the “source of truth”  We propagate DELETEs synchronously, INSERTs & UPDATEs asynchronously  KISS: Cron job (re) indexes never or least-recently indexed listings  ORM marks new/modified listings as never-indexed (so they go first) Location: Multivalue field instead of nested set model in the DB  e.g. this property is in Greece, Attica, Piraeus, Port of Piraeus  Property will be included in results when I search for any of the above. Flat schema  Searchable listing owner fields are included in the document (vs a JOIN in our DB)  Changes to other tables might lead to large # of listings requiring reindexation (e.g. real estate agent becomes a paying member) 15
  • 16. Index IntegrityMaking sure our index is consistent with the DB Scrutineer ( https://github.com/Aconex/scrutineer )  Compares DB and ElasticSearch index for mismatches  exists in ES but not on DB (or vice versa)  ES version not up to date  Relies on “_version” field - is incremented via our ORM onChange  When indexing we explicitly set versioning to “external”  Had to “hack” it as it doesnt work with EC2 discovery module  http://labs.spitogatos.gr/?p=45 16
  • 17. Search – Shards & RoutingHow does ElasticSearch decide in which shard to store a doc? By default this is done based on hash of document id Can be ovverriden while indexing and while searching (routing parameter) We shard based on hash of the id of area id - Most users search for listings within a specific area - We hit only a single shard for a large percentage of the searches. No routing Routing by specificed specific areaId 17
  • 18. Search – Flat Schema, Facets & ScoringWe rely a lot on ElasticSearchs Flat Schema, Facets & Scoring No joins due to flat schema => fast! Multivalue fields => fast filtering for listings in areas of various hierarchy levels Facets functionality returns list of paying agents with # listings matching criteria Old slow ranking algorithm replaced by elasticSearch scoring functionality  used to go through our DB and refresh score  ad age is part of the equation  Now ES computes this dynamically on every search  We use custom scoring  We can modify scoring algorithm and see changes instantly  no need to recalculate scores for all listings 18
  • 19. MonitoringSematext SPM offers a (currently free) ES monitoring solution Cluster Health  Search rate & latency  Disk Index Stats  Cache  Network Shard Stats  CPU & RAM  JVM & GC 19
  • 20. ToolingElasticSearch-Head is a GUI for browsing /interacting with a cluster 20
  • 21. Backups We take periodic copies from the Gateway  Cause the Gateway is no cure for accidental deletions or bugs  S3cmd syncs S3 gateway contents to local folder  Expect some errors here as files get deleted/modified  Disables snapshots to gateway  Syncs again (no errors this time and much faster)  Reenables snapshots to gateway  Zips local folder contents, splits into smaller files & uploads to secondary S3 bucketGet the script here: http://labs.spitogatos.gr/?p=17 21
  • 22. LearningsIssues & leasons learned: Faceted search can return wrong (smaller) results (on multiple shards)  Due to the way sorting/merging is done  Increase facet size field depending on cardinallity of faceted field We use Elastica – a PHP client for ElasticSearch - https://github.com/ruflin/Elastica  Lacking Document Routing and Version Type support  Our own Jerry Manolarakis on a pull request to add setRouting, setVersionType Filters vs queries (Query DSL)  Filters perform an order of magnitude better than plain queries since no scoring is performed and they are automatically cached. Do it! Your DB will thank youCPU Utilization Response time pattern 22
  • 23. Read more Useful resources: https://speakerdeck.com/u/jmikola/p/symfony-live-london-elasticsearch http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/ http://www.slideshare.net/elasticsearch/elasticsearch-at-berlinbuzzwords-2010 http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext Need help integrating ElasticSearch to your app? http://bacterials.net/ Follow us on twitter: @spitogatosLabs Check out our blog: http://labs.spitogatos.gr 23
  • 24. #questions