EyeEm

Lars Fronius
@LarsFronius
ElasticSearch in production at EyeEm
• H A P P Y G R U M P Y C AT O F
E Y E E M
• S TA R T E D A S A N O P S I N A
S C I E N T I F I C D ATA C E N T E R
• N O W D E V
• D E V E L O P E R S H AT E M E
S O M E T I M E S
M E
A B O U T M E
EyeEm is the world’s premier community and
marketplace for the photographer inside all of us
A P I S TA C K
• PHP
• MySQL (~10k commands per second)
• Memcached (~50k commands per second)
• Redis (~3k commands per second)
• S3 (~1k commands per second, 40m photos stored)
• Elasticsearch (~250 commands per second - elasticsearch-php)
• All writes are async
• Metrics everywhere
C U R R E N T C L U S T E R S P E C S
• 3 x m3.xlarge (4 cores, 15GiB Mem, 2 x 40GB SSD)
• cloud-aws plugin to interconnect.
• OpenJDK 1.6
• 60% heap size (9 GiB)
• 4 Indexes, 5 Shards each. From 1GB to 15GB
C U R R E N T P R O D U C T I O N U S E - C A S E S
C U R R E N T P R O D U C T I O N U S E - C A S E S
A L B U M S E A R C H
C U R R E N T P R O D U C T I O N U S E - C A S E S
P E O P L E S E A R C H
C U R R E N T P R O D U C T I O N U S E - C A S E S
• C I T Y- S E A R C H
• L I V E N E A R B Y
D I S C O V E R
C U R R E N T P R O D U C T I O N U S E - C A S E S
L I V E N E A R B Y
C U R R E N T B E TA U S E - C A S E S
C U R R E N T B E TA U S E - C A S E S
L O N G S T O RY
• MyISAM full-text search
• Album Search on one ElasticSearch node
• People Search added
• Scale-Out to 3 instances for Photo Search (+ Live
Nearby)
E L A S T I C S E A R C H - I N T E R N A L S
• Index
• What your application sees.
• View for a logical namespace inside ElasticSearch.
• Consists of a fixed number of shards
• “To Index” means to “put” your data into
ElasticSearch to make it available for search and for
persistence.
E L A S T I C S E A R C H - I N T E R N A L S
• Inverted-Index/Mapping
• The Mapping tells Lucene how to create the
inverted-index in order to make data searchable.
• e.g. “EyeEm” as an nGram{2,3} gets “indexed” as
[“Ey”,”ye”,”eE”,”Em”,”Eye”,”yeE”,”eEm”],

“yeah” would be [“ye”,”ah”,”yea”, “eah”]
E L A S T I C S E A R C H - I N T E R N A L S
• Inverted Index/Mapping by example
Ey 1
ye 1,2
eE 1
Em 1
Eye 1
yeE 1
eEm 1
ah 2
yea 2
eah 2
S C H E M A - L E S S O R W H AT ?
• Yes and No.
S C H E M A - L E S S O R W H AT ?
• Yes - You can put anything that can be formatted as a
JSON in your index, and you get a readable
document.
S C H E M A - L E S S O R W H AT ?
• No - you have to think first, because changing your
Mapping is expensive, since you have to reindex.
E L A S T I C S E A R C H - I N T E R N A L S
• Shard
• Instance of Lucene
• Consists of multiple Lucene segments
• Manages segments (Merging, fsync, deletion etc.)
E L A S T I C S E A R C H - I N T E R N A L S
segments API
http://example.es:9200/yourindex/_segments
indices: { eyephoto6: { shards: { 0: [!
{!
routing: {!
state: "STARTED",!
primary: true,!
node: "PiVDZW-VRYmeaVOy7afoWQ"!
},!
num_committed_segments: 2,!
num_search_segments: 3,!
segments: {!
_l: {!
generation: 21,!
num_docs: 13,!
deleted_docs: 0,!
size_in_bytes: 30810,!
memory_in_bytes: 589,!
committed: true,!
search: true,!
version: "4.7",!
compound: true!
},!
!
!
!
!
!
_m: {!
generation: 22,!
num_docs: 371,!
deleted_docs: 16,!
size_in_bytes: 408548,!
memory_in_bytes: 7365,!
committed: false,!
search: true,!
version: "4.7",!
compound: false!
},!
_n: {!
generation: 23,!
num_docs: 16,!
deleted_docs: 0,!
size_in_bytes: 38514,!
memory_in_bytes: 615,!
committed: false,!
search: true,!
version: "4.7",!
compound: true!
}!
}!
}!
],!
1: [!
E L A S T I C S E A R C H - I N T E R N A L S
• Segments
• Managed by ElasticSearch
• Is the storage for the inverted index
E L A S T I C S E A R C H - I N T E R N A L S
• Basically ElasticSearch is a Lucene cluster manager
and API
L E S S O N S L E A R N E D - S H A R D S /
S E G M E N T S
• Deletion does only mark documents as deleted and
does not delete them immediately.
• Updating a document does only create a new one and
marks old one as deleted.
• The actual cleanup process happens in background
and can result in nice performance surprises.
L E S S O N S L E A R N E D - S H A R D S /
S E G M E N T S
• Nested documents live in the same Lucene Segment.
• Can bloat up memory usage a lot.
• They are treated as every other document.
• If you don’t necessarily always have to search in them,
go for parent-child.
L E S S O N S L E A R N E D - E L A S T I C S E A R C H
• Start with more than one instance - just too simple
• Major upgrades are a pain (0.90 -> 1.1)
• PHP Client Libraries mostly do not handle connection
pools properly, use elasticsearch-php
• ‘connectionPoolClass' => ‘Elasticsearch
ConnectionPoolStaticConnectionPool'
• let an intermediate webserver handle it
L E S S O N S L E A R N E D - E L A S T I C S E A R C H
• You will index more than one time. Promise.

Be prepared.
• Rebalancing is smooth, don’t worry.
• Have your metrics ready.
• “You can have a good time with ElasticSearch, if you
don't ignore the complexity and internals of this
distributed database.”
L E S S O N S L E A R N E D - E L A S T I C S E A R C H
L E S S O N S L E A R N E D - E L A S T I C S E A R C H
L E S S O N S L E A R N E D - I N D E X / M A P P I N G
• Different analysers should go into separate fields
• Score individually - iterative optimisations possible
• Keep a raw field
• Use dynamic_templates if you found the holy grail of
field analysis.
• Filter first! Querying and scoring is expensive.
L E S S O N S L E A R N E D - I N D E X / M A P P I N G
L E S S O N S L E A R N E D - I N D E X / M A P P I N G
GET /eyephoto/_mapping!
{!
"eyephoto6": {!
"mappings": {!
"photo": {!
"dynamic_templates": [!
{!
"string": {!
"mapping": {!
"type": "string",!
"index_analyzer": "photo_names",!
"search_analyzer": "photo_standard",!
"fields": {!
"raw": {!
"type": "string",!
"index": "not_analyzed"!
},!
"split": {!
"type": "string",!
"analyzer": "standard"!
}!
}!
},!
"match": "*",!
"match_mapping_type": "string"!
}!
}!
]
• Different analysers should go into separate fields
L E S S O N S L E A R N E D - I N D E X / M A P P I N G
{!
"took": 18,!
"timed_out": false,!
"_shards": {!
##########!
},!
"hits": {!
"total": 125,!
"max_score": 6.44889,!
"hits": [!
{!
#####!
"_id": "167480",!
#####!
}!
}!
]!
},!
"facets": {!
"topic": {!
"_type": "terms",!
"missing": 0,!
"total": 138,!
"other": 57,!
"terms": [!
{!
"term": "Coffee",!
"count": 81!
}!
]!
}!
}!
}
• Different analysers should go

into separate fields
POST /eyephoto/photo/_search!
{!
"size": 1,!
"fields": [!
"id"!
],!
"query": {!
"multi_match": {!
"query": "coff",!
"fields": [!
"topics"!
]!
}!
},!
"facets": {!
"topic": {!
"terms": {!
"field": "topics.raw",!
"size": 1!
}!
}!
}!
}
L E S S O N S L E A R N E D - I N D E X / M A P P I N G
POST /eyephoto/photo/_search!
{!
"query": {!
"bool": {!
"should": [!
{!
"multi_match": {!
"query": "lars",!
"operator": "and",!
"fields": [!
“name.raw^3",!
“name.split^2”,!
“name"!
]!
}!
},!
{!
"multi_match": {!
"query": "lars",!
"fields": [!
“name.raw^3”,!
“name.split^2”,!
“name”!
]!
}!
}!
]!
}!
• Different analysers should go 

into separate fields
L E S S O N S L E A R N E D - I N D E X / M A P P I N G
• Read and write only to index aliases.
Index Name Index Aliases
eyephoto5 “eyephotoread”
eyephoto6 “eyephotowrite”
L E S S O N S L E A R N E D - I N D E X / M A P P I N G
• If you have a string or integer field, you can put an
array into it as well.
Ey 1
ye 1,2
eE 1
Em 1
Eye 1
yeE 1
eEm 1
ah 2
yea 2
eah 2
L E S S O N S L E A R N E D - I N D E X / M A P P I N G
• Use geohash wherever you query on lat/lng.
POST /eyephoto/photo/_search!
{!
"query": {!
"function_score": {!
"query": {!
"filtered": {!
"query": {!
"match_all": []!
},!
"filter": {!
"geohash_cell": {!
"location": {!
"lat": 52.5311,!
"lon": 13.404!
},!
"precision": 4,!
"neighbors": true!
} } } },!
"functions": [!
{!
"gauss": {!
"location": {!
"origin": "52.5311,13.404",!
"scale": "10km"!
}!
}!
},!
{!
"exp": {!
"uploaded": {!
"origin": "now",!
"scale": "2d"!
}!
}!
}!
L E S S O N S L E A R N E D - A G G R E G AT I O N S
• Aggregations give you recursive facets, handle with
care. "aggregations": {!
“user_fullname": {!
"filter": {!
"query": {!
"match": {!
"topics": {!
"query": "lars beer",!
"operator": "or"!
} } } },!
"aggs": {!
“user_fullname": {!
"terms": {!
"field": “user_fullname.raw”,!
"size": 3!
},!
"aggs": {!
“topics": {!
"filter": {!
"query": {!
"match": {!
“topics": {!
"query": "lars beer",!
"operator": "or"!
} } } },!
"aggs": {!
“topics": {!
"terms": {!
"field": “topics.raw”,!
"size": 3!
}!
}!
}!
},!
L E S S O N S L E A R N E D - A G G R E G AT I O N S
• Aggregations give you recursive facets, handle with
care. "user_fullname": {!
"doc_count": 678,!
"user_fullname": {!
"buckets": [!
{!
"key": "Lars 🍻 ",!
"doc_count": 678,!
"topics": {!
"doc_count": 5,!
"topics": {!
"buckets": [!
{!
"key": "Beer",!
"doc_count": 1!
},!
{!
"key": "BeerOps",!
"doc_count": 1!
},!
{!
"key": "Birthday beer in the snow",!
"doc_count": 1!
}!
]!
}!
}!
}!
]!
}!
O U T L O O K
O U T L O O K
• 1-liner search
• public release
• Localisation (snowball / stopwords)
• Keep indexed documents (e.g. albums) updated
N E X T I T E R AT I O N ( E VA L U AT I N G )
• Elasticsearch 1.1
• Oracle Java 1.8 (GC)
• more indexes and even more shards.
• restore API
46
WE
ARE
HIRING

Elasticsearch at EyeEm

  • 1.
  • 2.
    • H AP P Y G R U M P Y C AT O F E Y E E M • S TA R T E D A S A N O P S I N A S C I E N T I F I C D ATA C E N T E R • N O W D E V • D E V E L O P E R S H AT E M E S O M E T I M E S M E A B O U T M E
  • 3.
    EyeEm is theworld’s premier community and marketplace for the photographer inside all of us
  • 6.
    A P IS TA C K • PHP • MySQL (~10k commands per second) • Memcached (~50k commands per second) • Redis (~3k commands per second) • S3 (~1k commands per second, 40m photos stored) • Elasticsearch (~250 commands per second - elasticsearch-php) • All writes are async • Metrics everywhere
  • 7.
    C U RR E N T C L U S T E R S P E C S • 3 x m3.xlarge (4 cores, 15GiB Mem, 2 x 40GB SSD) • cloud-aws plugin to interconnect. • OpenJDK 1.6 • 60% heap size (9 GiB) • 4 Indexes, 5 Shards each. From 1GB to 15GB
  • 8.
    C U RR E N T P R O D U C T I O N U S E - C A S E S
  • 9.
    C U RR E N T P R O D U C T I O N U S E - C A S E S A L B U M S E A R C H
  • 10.
    C U RR E N T P R O D U C T I O N U S E - C A S E S P E O P L E S E A R C H
  • 11.
    C U RR E N T P R O D U C T I O N U S E - C A S E S • C I T Y- S E A R C H • L I V E N E A R B Y D I S C O V E R
  • 12.
    C U RR E N T P R O D U C T I O N U S E - C A S E S L I V E N E A R B Y
  • 13.
    C U RR E N T B E TA U S E - C A S E S
  • 14.
    C U RR E N T B E TA U S E - C A S E S
  • 15.
    L O NG S T O RY • MyISAM full-text search • Album Search on one ElasticSearch node • People Search added • Scale-Out to 3 instances for Photo Search (+ Live Nearby)
  • 16.
    E L AS T I C S E A R C H - I N T E R N A L S • Index • What your application sees. • View for a logical namespace inside ElasticSearch. • Consists of a fixed number of shards • “To Index” means to “put” your data into ElasticSearch to make it available for search and for persistence.
  • 17.
    E L AS T I C S E A R C H - I N T E R N A L S • Inverted-Index/Mapping • The Mapping tells Lucene how to create the inverted-index in order to make data searchable. • e.g. “EyeEm” as an nGram{2,3} gets “indexed” as [“Ey”,”ye”,”eE”,”Em”,”Eye”,”yeE”,”eEm”],
 “yeah” would be [“ye”,”ah”,”yea”, “eah”]
  • 18.
    E L AS T I C S E A R C H - I N T E R N A L S • Inverted Index/Mapping by example Ey 1 ye 1,2 eE 1 Em 1 Eye 1 yeE 1 eEm 1 ah 2 yea 2 eah 2
  • 19.
    S C HE M A - L E S S O R W H AT ? • Yes and No.
  • 20.
    S C HE M A - L E S S O R W H AT ? • Yes - You can put anything that can be formatted as a JSON in your index, and you get a readable document.
  • 21.
    S C HE M A - L E S S O R W H AT ? • No - you have to think first, because changing your Mapping is expensive, since you have to reindex.
  • 22.
    E L AS T I C S E A R C H - I N T E R N A L S • Shard • Instance of Lucene • Consists of multiple Lucene segments • Manages segments (Merging, fsync, deletion etc.)
  • 23.
    E L AS T I C S E A R C H - I N T E R N A L S segments API http://example.es:9200/yourindex/_segments indices: { eyephoto6: { shards: { 0: [! {! routing: {! state: "STARTED",! primary: true,! node: "PiVDZW-VRYmeaVOy7afoWQ"! },! num_committed_segments: 2,! num_search_segments: 3,! segments: {! _l: {! generation: 21,! num_docs: 13,! deleted_docs: 0,! size_in_bytes: 30810,! memory_in_bytes: 589,! committed: true,! search: true,! version: "4.7",! compound: true! },! ! ! ! ! ! _m: {! generation: 22,! num_docs: 371,! deleted_docs: 16,! size_in_bytes: 408548,! memory_in_bytes: 7365,! committed: false,! search: true,! version: "4.7",! compound: false! },! _n: {! generation: 23,! num_docs: 16,! deleted_docs: 0,! size_in_bytes: 38514,! memory_in_bytes: 615,! committed: false,! search: true,! version: "4.7",! compound: true! }! }! }! ],! 1: [!
  • 24.
    E L AS T I C S E A R C H - I N T E R N A L S • Segments • Managed by ElasticSearch • Is the storage for the inverted index
  • 25.
    E L AS T I C S E A R C H - I N T E R N A L S • Basically ElasticSearch is a Lucene cluster manager and API
  • 26.
    L E SS O N S L E A R N E D - S H A R D S / S E G M E N T S • Deletion does only mark documents as deleted and does not delete them immediately. • Updating a document does only create a new one and marks old one as deleted. • The actual cleanup process happens in background and can result in nice performance surprises.
  • 27.
    L E SS O N S L E A R N E D - S H A R D S / S E G M E N T S • Nested documents live in the same Lucene Segment. • Can bloat up memory usage a lot. • They are treated as every other document. • If you don’t necessarily always have to search in them, go for parent-child.
  • 28.
    L E SS O N S L E A R N E D - E L A S T I C S E A R C H • Start with more than one instance - just too simple • Major upgrades are a pain (0.90 -> 1.1) • PHP Client Libraries mostly do not handle connection pools properly, use elasticsearch-php • ‘connectionPoolClass' => ‘Elasticsearch ConnectionPoolStaticConnectionPool' • let an intermediate webserver handle it
  • 29.
    L E SS O N S L E A R N E D - E L A S T I C S E A R C H • You will index more than one time. Promise.
 Be prepared. • Rebalancing is smooth, don’t worry. • Have your metrics ready. • “You can have a good time with ElasticSearch, if you don't ignore the complexity and internals of this distributed database.”
  • 30.
    L E SS O N S L E A R N E D - E L A S T I C S E A R C H
  • 31.
    L E SS O N S L E A R N E D - E L A S T I C S E A R C H
  • 32.
    L E SS O N S L E A R N E D - I N D E X / M A P P I N G • Different analysers should go into separate fields • Score individually - iterative optimisations possible • Keep a raw field • Use dynamic_templates if you found the holy grail of field analysis. • Filter first! Querying and scoring is expensive.
  • 33.
    L E SS O N S L E A R N E D - I N D E X / M A P P I N G
  • 34.
    L E SS O N S L E A R N E D - I N D E X / M A P P I N G GET /eyephoto/_mapping! {! "eyephoto6": {! "mappings": {! "photo": {! "dynamic_templates": [! {! "string": {! "mapping": {! "type": "string",! "index_analyzer": "photo_names",! "search_analyzer": "photo_standard",! "fields": {! "raw": {! "type": "string",! "index": "not_analyzed"! },! "split": {! "type": "string",! "analyzer": "standard"! }! }! },! "match": "*",! "match_mapping_type": "string"! }! }! ] • Different analysers should go into separate fields
  • 35.
    L E SS O N S L E A R N E D - I N D E X / M A P P I N G {! "took": 18,! "timed_out": false,! "_shards": {! ##########! },! "hits": {! "total": 125,! "max_score": 6.44889,! "hits": [! {! #####! "_id": "167480",! #####! }! }! ]! },! "facets": {! "topic": {! "_type": "terms",! "missing": 0,! "total": 138,! "other": 57,! "terms": [! {! "term": "Coffee",! "count": 81! }! ]! }! }! } • Different analysers should go
 into separate fields POST /eyephoto/photo/_search! {! "size": 1,! "fields": [! "id"! ],! "query": {! "multi_match": {! "query": "coff",! "fields": [! "topics"! ]! }! },! "facets": {! "topic": {! "terms": {! "field": "topics.raw",! "size": 1! }! }! }! }
  • 36.
    L E SS O N S L E A R N E D - I N D E X / M A P P I N G POST /eyephoto/photo/_search! {! "query": {! "bool": {! "should": [! {! "multi_match": {! "query": "lars",! "operator": "and",! "fields": [! “name.raw^3",! “name.split^2”,! “name"! ]! }! },! {! "multi_match": {! "query": "lars",! "fields": [! “name.raw^3”,! “name.split^2”,! “name”! ]! }! }! ]! }! • Different analysers should go 
 into separate fields
  • 37.
    L E SS O N S L E A R N E D - I N D E X / M A P P I N G • Read and write only to index aliases. Index Name Index Aliases eyephoto5 “eyephotoread” eyephoto6 “eyephotowrite”
  • 38.
    L E SS O N S L E A R N E D - I N D E X / M A P P I N G • If you have a string or integer field, you can put an array into it as well. Ey 1 ye 1,2 eE 1 Em 1 Eye 1 yeE 1 eEm 1 ah 2 yea 2 eah 2
  • 39.
    L E SS O N S L E A R N E D - I N D E X / M A P P I N G • Use geohash wherever you query on lat/lng. POST /eyephoto/photo/_search! {! "query": {! "function_score": {! "query": {! "filtered": {! "query": {! "match_all": []! },! "filter": {! "geohash_cell": {! "location": {! "lat": 52.5311,! "lon": 13.404! },! "precision": 4,! "neighbors": true! } } } },! "functions": [! {! "gauss": {! "location": {! "origin": "52.5311,13.404",! "scale": "10km"! }! }! },! {! "exp": {! "uploaded": {! "origin": "now",! "scale": "2d"! }! }! }!
  • 40.
    L E SS O N S L E A R N E D - A G G R E G AT I O N S • Aggregations give you recursive facets, handle with care. "aggregations": {! “user_fullname": {! "filter": {! "query": {! "match": {! "topics": {! "query": "lars beer",! "operator": "or"! } } } },! "aggs": {! “user_fullname": {! "terms": {! "field": “user_fullname.raw”,! "size": 3! },! "aggs": {! “topics": {! "filter": {! "query": {! "match": {! “topics": {! "query": "lars beer",! "operator": "or"! } } } },! "aggs": {! “topics": {! "terms": {! "field": “topics.raw”,! "size": 3! }! }! }! },!
  • 41.
    L E SS O N S L E A R N E D - A G G R E G AT I O N S • Aggregations give you recursive facets, handle with care. "user_fullname": {! "doc_count": 678,! "user_fullname": {! "buckets": [! {! "key": "Lars 🍻 ",! "doc_count": 678,! "topics": {! "doc_count": 5,! "topics": {! "buckets": [! {! "key": "Beer",! "doc_count": 1! },! {! "key": "BeerOps",! "doc_count": 1! },! {! "key": "Birthday beer in the snow",! "doc_count": 1! }! ]! }! }! }! ]! }!
  • 42.
    O U TL O O K
  • 43.
    O U TL O O K • 1-liner search • public release • Localisation (snowball / stopwords) • Keep indexed documents (e.g. albums) updated
  • 44.
    N E XT I T E R AT I O N ( E VA L U AT I N G ) • Elasticsearch 1.1 • Oracle Java 1.8 (GC) • more indexes and even more shards. • restore API
  • 45.