ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)
1. ElasticSearch on AWS
Real Estate portal Case Study (Spitogatos.gr)
AWSUG GR meetup #7
27 September 2012
Andreas Chatzakis
co-founder / IT Director β Spitogatos.gr
Event sponsored by: @achatzakis on twitter
4. Helping you find a property
Finding a property in Greece is complex, lacks transparency.
We make life easier for househunters via:
ξ Powerful search functionality
ξ Web & Mobile
ξ Location & Criteria
ξ Quality content
ξ Listings (we love photos)
ξ Articles
ξ mySpitogatos
ξ Email alerts
ξ Save your search
ξ Favorite listings & notes
ξ Contact the realtors
4
5. Realtors love us too!
Professionals need help in those turbulent times.
We add value in multiple ways:
ξ Cost effective promotion & high quality leads
ξ Targeted channel (very)
ξ Leads already filtered (we ve seen the fotos!)
ξ Technology services for realtors
ξ Turnkey web site solution
ξ Listing synchronization web service
ξ B2B via Spitogatos Network (SpiN) business
network / collaboration tool for realtors
ξ Channel for foreign buyers via the English version
5
7. To Search is to Find
Search is central to what we do
ξ Users searching for property come with structured criteria of huge variety
ξ Athens Center, residential - flat or studio, for sale, 100-150k β¬, 85-120 sq meter,
with a garage
ξ Athens Center & N.Kosmos, residential - flat, for sale, 75-100k β¬, 70-100 sq meter,
2+ bedrooms, only show listings with photos
ξ Piraeus centre or Mikrolimano, commercial β store, for rent, 500-750 β¬ per
month, only listings with recently reduced price
ξ Monetize: # of Listings grouped by paying member + above criteria
ξ IPhone app β Listings within geo-rectangle + above criteria
ξ As a result, caching is rarely our friend!
ξ We used to think Lucene/Solr, ElasticSearch, CloudSearch etc were only useful
for text search, not adding value for structured search
G
ξ Have been insisting on trying to optimize MySQL (multi column indices etc)
N
while throwing replicas to the problem.
O
R
7
8. Why ElasticSearch
Selected elasticSearch after a (very) brief research* on alternatives:
ξ AWS's own Cloudsearch:
ξ Zero management service: nice!
ξ Not available on eu-west-1
ξ Currently lacks ES functionality (e.g. geospatial, non english analyzers)
ξ Sphinx
ξ Easy MySQL integration
ξ How do you scale it?*
ξ Solr
ξ Industry standard
ξ Seems like it is conceived as somehow harder to scale/operate*?
ξ ElasticSearch:
ξ Piece of cake to setup on AWS (stay tuned!)
ξ Super distributed, scales & is easy on IT ops (more on that later!)
* Disclaimer: We did not go through a
8
detailed product selection process!
10. ElasticSearch basics
A distributed, RESTful Search engine built on top of Lucene
ξ Free Schema
ξ JSON documents
ξ Analyzers
ξ Boost levels
ξ Easy & flexible Search
ξ Lucene query string or JSON based search query DSL
ξ Facets & Highlighting
ξ Spatial search
ξ Custom scripts
ξ Multi Tenancy
ξ Store & search across multiple indices
ξ Each with its own settings
ξ Use-case: Logs β recent in memory, old on disk
10
11. Scaling ElasticSearch
Designed from the ground up to be Scalable & Highly Available
ξ Distributed
ξ Indices automatically broken into shards
ξ Replicas for read performance & availability
ξ Multiple cluster nodes, each hosting 1+ shards/replicas
ξ peer2peer, each node can delegate operations to other nodes
ξ Add,remove nodes at will
ξ Rebalancing & routing automagically behind the scenes
ξ Discovery
ξ Multicast or unicast (declarative)
ξ Gateway
ξ Allows recovery in case all nodes go down
ξ Local or shared storage
ξ Async replication in case of shared storage
11
12. A scale-up example
Assume a cluster with 4 shards and 1 replica configuration
ξ 1 node example β Status Yellow
ξ 2 nodes example β Status Green
ξ 3 nodes example
: Primary shard : Replica shard : Master node : Regular node
Master node maintains cluster state, acts if nodes join or leave the cluster by reassigning shards. 12
13. ElasticSearch on AWS
2 modules make deployment on AWS a breeze
ξ EC2 discovery
ξ Filter by security group, AZ, tags
ξ Requires IAM user with certain EC2 privileges:
DescribeAvailabilityZones, DescribeInstances, DescribeRegions,
DescribeSecurityGroups, DescribeTags
ξ Very useful in autoscaling setups with ephemeral servers
ξ S3 gateway
ξ Long term reliable async persistency of cluster state and indices
ξ Allows deployment without EBS volumes
ξ Still, local gateway with EBS volumes performs better (less network used,
faster recovery)
ξ Won't protect from accidental deletion of index (deletion will propagate to
shared storage)
13
15. Indexation
Indexation of Spitogatos.gr ads
ξ DB is still the βsource of truthβ
ξ We propagate DELETEs synchronously, INSERTs & UPDATEs asynchronously
ξ KISS: Cron job (re) indexes never or least-recently indexed listings
ξ ORM marks new/modified listings as never-indexed (so they go first)
ξ Location: Multivalue field instead of nested set model in the DB
ξ e.g. this property is in Greece, Attica, Piraeus, Port of Piraeus
ξ Property will be included in results when I search for any of the above.
ξ Flat schema
ξ Searchable listing owner fields are included in the document (vs a JOIN in our DB)
ξ Changes to other tables might lead to large # of listings requiring reindexation
(e.g. real estate agent becomes a paying member)
15
16. Index Integrity
Making sure our index is consistent with the DB
ξ Scrutineer ( https://github.com/Aconex/scrutineer )
ξ Compares DB and ElasticSearch index for mismatches
ξ exists in ES but not on DB (or vice versa)
ξ ES version not up to date
ξ Relies on β_versionβ field - is incremented via our ORM onChange
ξ When indexing we explicitly set versioning to βexternalβ
ξ Had to βhackβ it as it doesn't work with EC2 discovery module
ξ http://labs.spitogatos.gr/?p=45
16
17. Search β Shards & Routing
How does ElasticSearch decide in which shard to store a doc?
ξ By default this is done based on hash of document id
ξ Can be ovverriden while indexing and while searching (routing parameter)
ξ We shard based on hash of the id of area id
- Most users search for listings within a specific area
- We hit only a single shard for a large percentage of the searches.
No routing Routing by
specificed specific areaId
17
18. Search β Flat Schema, Facets & Scoring
We rely a lot on ElasticSearch's Flat Schema, Facets & Scoring
ξ No joins due to flat schema => fast!
ξ Multivalue fields => fast filtering for listings in areas of various hierarchy levels
ξ Facets functionality returns list of paying agents with # listings matching criteria
ξ Old slow ranking algorithm replaced by elasticSearch scoring functionality
ξ used to go through our DB and refresh score
ξ ad age is part of the equation
ξ Now ES computes this dynamically on every search
ξ We use custom scoring
ξ We can modify scoring algorithm and see changes instantly
ξ no need to recalculate scores for all listings
18
19. Monitoring
Sematext SPM offers a (currently free) ES monitoring solution
ξ Cluster Health ξ Search rate & latency ξ Disk
ξ Index Stats ξ Cache ξ Network
ξ Shard Stats ξ CPU & RAM ξ JVM & GC
19
21. Backups
We take periodic copies from the Gateway
ξ Cause the Gateway is no cure for accidental deletions or bugs
ξ S3cmd syncs S3 gateway contents to local folder
ξ Expect some errors here as files get deleted/modified
ξ Disables snapshots to gateway
ξ Syncs again (no errors this time and much faster)
ξ Reenables snapshots to gateway
ξ Zips local folder contents, splits into smaller files & uploads to secondary S3 bucket
Get the script here: http://labs.spitogatos.gr/?p=17
21
22. Learnings
Issues & leasons learned:
ξ Faceted search can return wrong (smaller) results (on multiple shards)
ξ Due to the way sorting/merging is done
ξ Increase facet size field depending on cardinallity of faceted field
ξ We use Elastica β a PHP client for ElasticSearch - https://github.com/ruflin/Elastica
ξ Lacking Document Routing and Version Type support
ξ Our own Jerry Manolarakis on a pull request to add setRouting, setVersionType
ξ Filters vs queries (Query DSL)
ξ Filters perform an order of magnitude better than plain queries since no scoring is
performed and they are automatically cached.
ξ Do it! Your DB will thank you
CPU Utilization Response time pattern
22
23. Read more
Useful resources:
ξ https://speakerdeck.com/u/jmikola/p/symfony-live-london-elasticsearch
ξ http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/
ξ http://www.slideshare.net/elasticsearch/elasticsearch-at-berlinbuzzwords-2010
ξ http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext
Need help integrating ElasticSearch to your app?
http://bacterials.net/
Follow us on twitter: @spitogatosLabs
Check out our blog: http://labs.spitogatos.gr
23