at

Tuesday, February 25, 14
Greg
Ichneumon
Brown
Data Wrangler at Automattic
http://gibrown.wordpress.com
@gregibrown
greg@automattic.com

Tuesday, February 25, 14
Tuesday, February 25, 14
1 Billion Monthly
Uniques

Tuesday, February 25, 14
Elasticsearch Deployments
Internal Search
- 216 Internal Blogs - 750k docs [3 GB]
Support Documents
- KNN Link Prediction - 1.7m docs [14 GB]
Polldaddy
- Word Clouds/Freq Response - 39m docs [9 GB]
WordPress.com VIP Search
- KFF.org - 18m docs [99 MB]
- NY Post - 600k docs [2.3 GB]
WordPress.com - ~800m docs [4 TB]
- Related Posts - 48 mil reqs/day
- search.wordpress.com - 3 mil reqs/day
Tuesday, February 25, 14
Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues

Tuesday, February 25, 14
Related Posts

Search within just the one blog
Tuesday, February 25, 14
WordPress.com
Total Elasticsearch Operations

Operation
Routed Queries

23 mil

Global Queries

2 mil

Docs Indexed

13 mil

Docs Updated

10 mil

Docs Deleted

2.5 mil

Delete By Query

Tuesday, February 25, 14

Ops/Day

250k
Global Cluster
DC1
1 Master

DC2

DC3
1 Master

14 Data

Tuesday, February 25, 14

14 Data

1 Master

14 Data
Our Secret To Scaling
Routed Queries
All Posts for each Blog
are on the same Shard

Tuesday, February 25, 14
Global Index

7 Indices
10 mil Blogs per Index
25 Shards per Index
175 Shards Total
Tuesday, February 25, 14
Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues

Tuesday, February 25, 14
20% Improvements
Don’t solve scaling problems

Tuesday, February 25, 14
Indexing

Entangling Elasticsearch
with Existing Systems

Tuesday, February 25, 14
Bulk Indexing 1.0
44 Days to Index all Posts
(estimated)

Tuesday, February 25, 14
Bulk Indexing Problems
- Overhead: Spent too much time starting indexing jobs
WordPress.com has 500 mil MySQL tables.
- High DB Load: Corner Cases. Blogs with 1+ mil
followers.
- High DB Load: Indexing sequentially doesn’t spread
the load.
- High DB Load: Heavy load on archive DBs.

Tuesday, February 25, 14
Bulk Indexing Today
12.0?
4 Days to Index all Posts
(running right now)

Tuesday, February 25, 14
Real Time Indexing
The Hardest Part!

Tuesday, February 25, 14
Real Time Goals
1) Eventually Consistent
2) Minimize Bulk Re-indexing
3) Normally updated < 1 minute

Tuesday, February 25, 14
Real Time Goals
1) Eventually Consistent
2) Minimize Bulk Re-indexing
3) Normally updated < 1 minute
Bulk reindexed 3 times in 5 months.
One intentional,
Two during system upgrades.
Tuesday, February 25, 14
Stuff Fails
1) Humans
2) Hardware
3) Elasticsearch (steady improvements)
Combinations of the above.

Tuesday, February 25, 14
Hardware Problems
1) Detect and Track Down Servers
2) Prioritize Queries over Indexing
3) Throttle Indexing Jobs
- any issues: block bulk changes to blogs
- >10 min: block doc updates
- >20 min: block all indexing
Tuesday, February 25, 14
Real Time Failures
1) Auto Retry Failed Indexing Jobs
2) Indexing Queue for Failures
3) Scrolling Queries to Find Bad Docs

Tuesday, February 25, 14
Cluster Restarts
Indexing across replicas is
non-deterministic
Segments diverge
Slows Restart Time
Tuesday, February 25, 14
Simplistic Example
Docs

Shard 1
merges

Primary

Replica
Segments
w/ identical
checksums

Tuesday, February 25, 14

Only first
segment is
identical
After Bulk Index
Every segment is
out of sync!

Tuesday, February 25, 14
Our Bulk Indexing Procedure
1) Bulk Index All Docs
2) Optimize the index
3) Rolling Restart (sync segments)
4) Future restarts will be much faster.
- Play with recovery settings
- SSDs? => use noop Linux scheduling
Tuesday, February 25, 14
Indexing
It’s all about handling Failures

Tuesday, February 25, 14
Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues

Tuesday, February 25, 14
Querying
Test and Iterate

Tuesday, February 25, 14
Related Posts Query
Started with MoreLikeThis API.
Did not scale well enough.

Tuesday, February 25, 14
MLT API
1) Get Document
2) Analyze Document
3) Search for Similar Docs

Tuesday, February 25, 14
MLT API vs MLT Query
MLT API

MLT Query

147 req/sec

1062 req/sec

40% CPU

30% CPU

306 ms median latency 49.5 ms median latency
All processing by ES

Tuesday, February 25, 14

Build query in PHP
Related Posts Relevancy
Great With Long Content
{ "more_like_this":{
"fields":["mlt_content"],
"like_text":"Scaling Elasticsearch Part 1: Overview
ElasticSearch scaling Search We recently launched
Related Posts across WordPress.com, so its time to
pop the hood and take a look at what ended up in
our engine... ",
"percent_terms_to_match":0.08,
"boost_terms":5,
"analyzer": "en_analyzer"
}}
Tuesday, February 25, 14
MLT Query Relevancy
Use match or multi_match for
short content.

Average Related Posts CTR
Tuesday, February 25, 14
Language Analyzers
arabic, armenian, basque, brazilian, bulgarian,
catalan, chinese, czech, danish, dutch, english,
finnish, french, galician, german, greek, hindi,
hungarian, indonesian, italian, japanese, korean,
norwegian, persian, portuguese, romanian,
russian, spanish, swedish, turkish, thai

Tuesday, February 25, 14
Related Posts Relevancy
How Important is using the
correct Language Analyzer?

Tuesday, February 25, 14
Related Posts Relevancy
How Important is using the
correct Language Analyzer?
Doubled Click Through Rate
Tuesday, February 25, 14
Unfortunately
Increased Slow Queries
(>1 second)
by 10x
still worth it.
Tuesday, February 25, 14
Global Query Performance
search.wordpress.com

Tuesday, February 25, 14
Parent-Child Filtering
Blog Doc
public: true|false
Post Doc
title: “...”
content: “...”

Tuesday, February 25, 14
has_parent Filter
Querying Across All Shards
With has_parent

Without has_parent

7.6 req/sec

17.5 req/sec

75% CPU

50% CPU

503 ms median latency 207 ms median latency
Requires more Indexing

Tuesday, February 25, 14
Indexing:
Optimize to Handle Failures
Querying:
Test and Iterate
Tuesday, February 25, 14
Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues

Tuesday, February 25, 14
Open Issues
Slow Queries (> 1 second)

Getting Better. Shards are too big.
Tuesday, February 25, 14
Open Issues
What does it take to scale?
3x Data
5x Queries

Tuesday, February 25, 14
Open Issues
Elasticsearch for Natural
Language Processing?
At Scale.
On Live Data.

Tuesday, February 25, 14
http://gibrown.wordpress.com
@gregibrown

Feeling Inspired?
http://automattic.com/work-with-us/data-wrangler/

Tuesday, February 25, 14

Elasticsearch at Automattic