Full scan frenzy at amadeus

Full scan frenzy at Amadeus
Unpredictable & interactive
analysis of terabytes of data
MongoDB World, June 1 2015
Laurent Dollé
Attila Tozser
Nicolas Motte
265ced1609a17cf1a5979880a2ad364653895ae8

Amadeus
today
1
265ced1609a17cf1a5979880a2ad364653895ae8

Amadeus
In a few words
Amadeus is a technology company dedicated to the
global travel industry.
We are present in 195 countries
with a worldwide team of more than 11,000 people.
Our solutions help improve the
business performance
of travel agencies, corporations, airlines,
airports, hotels, railways and more.
©2015AmadeusITGroupSA

Connecting
The travel industry
Cruiselines
Hotels
Car rental
Ground handlers
Ferry operators
Ground transportation
Airports
Travel agencies
Insurance companies
Airlines

Supporting
The traveler life cycle
Post-trip
On trip
Pre-trip Buy/Purchase
Search
Inspire

Robust
Global operations
We designed & own our Data Processing Centres
_ Central DC @ Erding, Germany
_ Remote DCs all over the globe
_ Recovery DC on standby in case of natural disasters
1.6+
billion
transactions
processed per day
526+
million
travel agency
bookings
processed in 2014
695+
million
Passengers
Boarded
in 2014
95%of the world’s
scheduled network
airline seats

Close
To our customers

Our commitment
To innovation
_ Amadeus has invested €3.5bn in
Research & Development
since 2004.
_ Nominated within “top 3” software
companies in 2014 European Union
Industrial R&D Investment Scorecard.

Amadeus
Revenue Accounting Search
2
265ced1609a17cf1a5979880a2ad364653895ae8

Revenue of a flight ticket
is shared
_ Travel agent
_ Governments
_ Airlines: many can be involved
(marketing & operating)
What for?
Passenger Revenue Accounting
Amadeus
Revenue Accounting
handles cash flows
on behalf of airlines
_ Tracking
_ Error handling & optimisation
_ Reporting: analysis & audit

One of our launch partners is a
large European airline
_ transporting 35m+
passengers a year
_ key player in the
revenue accounting industry
Business needs
Gathered from a Revenue Accounting launch partner
They requested a user-friendly way to query any data
in our main operational database
_ Unpredictable ad-hoc search
_ Many advanced reporting requirements
Migrating
_ from their
in-house data warehouse
_ to our
cloud-based solution

_Graphical
user interface
_ based on the SQL paradigm
_ to edit, import, save & share
queries
The main promises

_Data warehouse
fed in real time
4 years history (1.5bn documents, versioned)
_ Interactive response times
The main promises

Expecting fast answer
to unpredictable queries
No index, no hint (almost)
_ Fields to be scanned unknown
_ In-memory full scans to decrease response time
Need to use all the available hardware power
& scale out for sustainable performances
Support mainstream SQL DML statements
_ Aggregation
_ Cross-column comparison, Boolean logic
_ Sort

Technical
architecture
3
265ced1609a17cf1a5979880a2ad364653895ae8

6 physical data servers
_ Server
HP ProLiant DL580 Gen8
4 sockets, x86, rack
_ 4x CPU
Intel Xeon E7-4850 v2
2.30 GHz, 12 physical cores
_ RAM 512GB
40GB/s scanning speed
_ 2x flash cards
Fusion-io ioScale 3.2TB
1.5GB/s read
3 virtual config servers
_ RAM 8GB
Production cluster setup
Facts & figures
Overall cluster
_ 288 cores, 3TB RAM, 38.4TB flash card
storage
Currently 1 year of production data (4 expected)
_ 310m+ docs (1bn)
_ Data size 3,6TB (11TB)
_ Average object size 12,5KB
_ File size 4.8TB (16TB)

We have many cores, but only 6 boxes, if we would follow all the recommendations that
would end up in:
Microsharding coming from Microservices?
Enforce parallel processing
A MongoDB daemon (mongod) processes
each incoming query on a single thread.
_ It is not recommended to:
• Collocate many mongod processes on a single
box
Our online analytical processing
use-case implies:
_ full scans (ad-hoc queries)
_ limited concurrency for
queries (requests are from a
queue)
SHARD1
Node 1 Node 2 Node 4 Node 5 Node 6Node 3
SecondaryPrimary Secondary Secondary SecondaryPrimary
SHARD2
_ 2 cores running 286 idling
_ 2/3 of the memory idling
_ 4 flash cards working at
around 6% each and 8
idling
We need to go against some of the recommendations!

_ Queries either CPU or memory scanning speed bound
_ On a fix amount of shards, the speed scales linearly with the data size
Benchmarking
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
TIME
DATA SIZE
FULL SCAN
0
100
200
300
400
500
0 200 400 600 800 1000 1200
TIME
DATA SIZE
FULL SCAN WITH AGGREGATION
Behaviour reproduced for 2 shard distributions
24 & 48 shards on 6 physical servers, 100% in-memory

Microsharding coming from Microservices?
Enforce parallel processing
Problem Reason Solution
2 cores running
286 idling
2 primaries processing the
requests
We need more primaries
processing the requests
(to use all the 288 CPUs)
2/3 of the memory
idling
Primaries only on 2 nodes We need to run primaries
on all the available nodes
4 flash cards working
at around 6% each
and 8 idling
Only 2 threads used,
on 2 nodes
We need many threads
working on the cards (ideally
64 per box)

Validation, from 6 to 48 shards on 6 physical servers
for 2 selected fairly complex queries
The behavior is logarithmic as the assigned proportion of the data per shard changes
0
50
100
150
200
250
300
350
400
0 20 40 60
TIME
SHARDS
FULL SCAN
0
200
400
600
800
1000
1200
1400
1600
1800
0 10 20 30 40 50 60
TIME
SHARDS
FULL SCAN WITH AGGREGATION
Microsharding
Measure the benefit

arb
Microsharding (how to align the services ?)
265ced1609a17cf1a5979880a2ad364653895ae8
Node 1
Primary
Secondary
Arbiter
Shard, replicate & stripe
Node 2 Node 3 Node 4 Node 5 Node 6
1st 2nd
1st 2nd
1st 2nd
1st 2nd
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
1st
2nd
1st
1st
arb
arb
arb
arb

Interleaving has serious penalties on
the performance
never do this unless you do not care about
performance…
_ Depends on the HW but can be up to 50-60%
NUMA: Non-unified memory access
Fit your workload to modern HW
How modern hardware handles the memory?
_ Local memory access:
local memory access from a local thread
_ Remote memory access:
memory of a different socket from a local thread
_ Interleaving
force the HW to mimic UMA
_ Binding
force the tasks to use only given resources
Socket
Server
Core
L1
cache
L2
cache
Core
1
Core
2
Core
n
Main
memory
L3 cache
Socket 2
Socket 1Socket 0
Socket 3
QPI
QPI
QPI
QPI

NUMA: Non-unified memory access
The recommendation is to interleave, but:
Use node & memory binding!
numactl --physcpubind xx --localalloc mongod –f …
1
2
4
8
16
32
64
128
256
0.00049
0.00195
0.00391
0.00781
0.01562
0.03125
0.0625
0.125
0.25
0.5
1
2
4
8
16
32
64
128
256
512
Latency/ns
Dataset Size / MB
MEMORY LATENCY
186,943
229,000
191,303
49,378
61,919
43,124
1 DIMM PER
CHANNEL
2 DIMM PER
CHANNEL
3 DIMM PER
CHANNEL
BANDWIDTHMB/S
MEMORY BANDWITH
(STREAM TRIAD)
NUMA UMA

Tuning for better CPU utilization
Can be achieved with couple of small changes using sysctl:
kernel.sched_min_granularity_ns set 2-10 times bigger
kernel.sched_migration_cost set 2-10 times bigger
Tipp: Look for guidelines from your HW vendor, how to tune your BIOS settings for
latency
Kernel tuning
How Linux schedules the CPU workload
IO-intensive workload scheduling
_ Default in Linux
_ Small slices on the cores
_ Often migrations between cores
CPU-intensive workload scheduling (MongoDB)
_ Needs tuning/experimenting
_ Longer slices on the cores
_ Rare migrations between cores
Use /proc/sched_debug or Intel PCM or any
similar tool to find the optimal settings:

Cgroups
Light weight resource management
Mongod processes running on the
same hardware compete for resources
_ Memory
One big pool  competition for free
pages
_CPU
• Aggregation is really CPU intensive in
our case
• Often context switching
Above a certain size of memory
we had serious issues
Resource management for the services
_ Memory
Fine grained memory allocation limits
_ CPUset
CPU binding like in NUMA
_ CPU
Resource sharing between tasks (restrict
some resources for the operation system)

Cgroups
Tiered storage concept with resource management
_ MongoDB uses mmap to cache
data in memory (<3.0)
• No good influence on the caching
• Due to LRU works as a FIFO queue
in this case
_ Example:
• 1., We have 200GB data and 100GB
memory
• Or
• 2., 200GB data and 1GB memory
• The scanning speed is the same
_ With cgroups the first case could
be 40-50% faster.
Query 2 : progress at 70%
In cachePaging GAP
In cacheIn cache Paging GAP
In cache
_ 50% memory 2 subsequent queries
_ 100% paged in and out
1
2
3

Q 1
_ Using many shards instead of one divides
the work to smaller chunks
_ Define a high memory and a low memory
cgroup and assign the shards to them
_ 40% served from memory 60% from disk
_ The analogy can be applied for many tiers
• Memory -> SSD -> spinning disk
Q 1 Q 1 Q 1 Q 1
Q 1
Q 1 Q 1 Q 1 Q 1
In cache In cache
• High memory cgroup
All served from memory
• Low memory cgroup
All served from disk
Cgroups
Tiered storage concept with resource management

Microsharding is a powerful way to increase response times, what else can bring value?
Database customization
And its results
NUMA
Kernel
tuning
Striped
replica set
Cgroups
Cgroups
Prevent shards from competing for memory when data
does not fit into RAM – especially with microsharding.
Low-memory Cgroups may be compressed with zRAM/WiredTiger.
Kernel tuning
Optimize Linux in case of CPU-bound effort (vs. IO-bound):
small readahead, THP off, increase task scheduler.
NUMA
Restrict access to CPU & memory for secondary daemons.
Striped replica set
Span shards on all the available hardware, with secondary
daemons replicated on different nodes for smooth failover.

Production
benchmarks
4
265ced1609a17cf1a5979880a2ad364653895ae8

Full scan aggregation is CPU-bound,
with a fixed entry cost for unwinds.
_ no unwind 3s
_ 1, 2 or 3 unwinds 70s
_ additional cost if more unwinds
Interactive response times promise is complied with
on basic use-cases.
In the absence of concurrency, response times are
consistent across all tests.
Production response times
And their lessons learnt

Operability &
Monitoring Tools
5
265ced1609a17cf1a5979880a2ad364653895ae8

Operability & Monitoring
Tooling Architecture
Software Upgrade
Topology
Operability
Orchestrator
Alerting
Monitoring Data Store
Internal Tools

2.3 Puppet Setup
Orchestrator
1. Mount Servers
4. Install OS and NoSQL store
6. Ticket Tracker Setup
7. Tools Validation
8. Dev Validation
9. Handover to Ops
Only for Physical Node
Only for VM
Common for all Data Stores
2.2 Create VM2.1 Network Setup
3. Assign DNS names
System Setup
Application Setup
5. Monitoring Setup

Monitoring

Merger
MMS
MMS
Monitoring
Architecture
MMSParserMonitoring
REST
API
c
s
v
CGI
pythonc++

Monitoring
Demo

Alerting

Alerting
Architecture
MMS
MMS
MMS
REST
API
AlertingTicket Pinger
Configuration
shell
TCP

Operability

MMS
Operability Server
MMS
MMS
Operability
Architecture
MMS
Operability
Status
MCollective
Manual
Action
MMS
MMS
MongoDB
java
python
REST API
Active MQ
SSH SSH

Operability
Demo

You can follow us on:
AmadeusITGroup
amadeus.com/blog
amadeus.com
Thank you
265ced1609a17cf1a5979880a2ad364653895ae8

Full scan frenzy at amadeus

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Full scan frenzy at amadeus

Similar to Full scan frenzy at amadeus (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

Full scan frenzy at amadeus