How we (Almost) Forgot Lambda Architecture and used Elasticsearch

How We
(Almost) Forgot
Lambda Architecture and
Used Elasticsearch

Michael Stockerl
Data Engineer
michael.stockerl@gutefrage.net
@stockerlm

Gutefrage is
the question-answering plattform,
where millions of people help each
other.

Views per month
140 Mio.
Registered Users
4 Mio.
Questions
16.5 Mio.
Answers
63 Mio.

gutefrage architecture:
The big picture

Lambda Architecture Example: Answer Score
● Better sorting
● Hide bad answers
● Google Thin Content (SEO)

Incoming Event
Batch view
Live view
Join when read
Implemented with Lambda Architecture

Incoming Event
Batch view
Live view
Join when read
The batch layer

Incoming Event
Batch view
Live view
Join when read
The Speed Layer

Incoming Event
Batch view
Live view
Join when read
The Serving Layer

Learnings
+ Reads are fast
+ Spark helps building a Lambda Architecture
- Still duplicate code and complexity
- Each change needs an update of the batch view

Recent problem:
A new point system with user ranking

Points based on Feedback
Like Question
Most Helpful Answer
Say Thank You Rate the Answer

Overall ranking with MySQL
SELECT
user_id,
SUM(points) as score
FROM event_log
WHERE created_at BETWEEN now() AND 90 Days ago
GROUP BY user_id
ORDER BY score DESC

First results of performance test
● Some queries were fast enough
● BUT: 17 - 20 seconds queries in worst case scenario

Solution:
Aggregations of Elasticsearch

Aggregations in Elasticsearch
The aggregations framework helps provide aggregated data based on a search
query. It is based on simple building blocks called aggregations, that can be
composed in order to build complex summaries of the data.
elasticsearch documentation

Aggregation for Top User List
"aggregations": {
"top_users": {
"terms": {
"field": "user_id",
"size": 100,
"shard_size": 2000,
"order": {
"total_score": "desc"
}
},
"total_score": {
"sum": {
"field": "score"
}
}
}
}

"aggregations": {
"top_users": {
"terms": {
"field": "user_id",
"size": 100,
"shard_size": 2000,
"order": {
}
},
"total_score": {
"sum": {
"field": "score"
}
}
}
}
groupBy

"aggregations": {
"top_users": {
"terms": {
"field": "user_id",
"size": 100,
"shard_size": 2000,
"order": {
}
},
"total_score": {
"sum": {
"field": "score"
}
}
}
}
order by

"aggregations": {
"top_users": {
"terms": {
"field": "user_id",
"size": 100,
"shard_size": 2000,
"order": {
}
},
"total_score": {
"sum": {
"field": "score"
}
}
}
}
tune accuracy

Solution:
Aggregations of Elasticsearch
Query response times:
Worst case 2.5 seconds (MySQL 17s)

Request cache
● Search on local shards
● Cache local
● Invalidated on changes
● Hits.total, aggregations and suggestions

Request cache
● Search on local shards
● Cache local
● Invalidated on changes
● Hits.total, aggregations and suggestions
➔ Too much updates
➔ A lot of cache misses

Split data:
● Data of today: use index template to create index with first event
● Historical data: index without changes
Incoming Event
historical data
data of today

Use filtered aliases to select data of time range
Incoming Event
historical data
data of today
today
90days
filtered alias

Use cached results from historical data
Incoming Event
historical data
data of today
today
90days
filtered alias
Cache
_search?request_cache=true
service

The next day
Incoming Event
historical data
data of yesterday
today
90days
filtered alias
Cache
service
data of today

Merge the old indices
Incoming Event
historical data
data of yesterday
today
90days
filtered alias
Cache
service
data of today

Warm cache already in merge job
Incoming Event
historical data
data of today
today
90days
filtered alias
Cache
service

Query response times:
Worst case 90ms (MySQL 17s)

Learnings:
● Improved internal reindex framework
● Alias are always your friends
● Request cache FTW
● Cache miss, when you use index name instead of alias (?)
● Results may not be 100% accurate (but no problem for us)

We’re hiring…
● Web-Developer
● We are looking for experts in the area of Search and
NLP interested in supporting us for a couple of days!
Please get in touch. :)

How we (Almost) Forgot Lambda Architecture and used Elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How we (Almost) Forgot Lambda Architecture and used Elasticsearch

Similar to How we (Almost) Forgot Lambda Architecture and used Elasticsearch (20)

Recently uploaded

Recently uploaded (20)

How we (Almost) Forgot Lambda Architecture and used Elasticsearch