Tweaking performance on high-load projects

Tweaking performance
on high-load projects

Dmitriy Dumanskiy
Cogniance, mGage project
Java Team Lead
Java blog : habrahabr.ru/users/doom369/topics

Project evolution
mGage
mobclix
XXXX

mGage delivery load
3 billions req/mon.
~8 c3.xLarge Amazon instances.
Average load : 3000 req/sec
Peak : x10

Mobclix delivery load
14 billions req/mon.
~16 c3.xLarge Amazon
instances.
Peak : x6

XXXX delivery Load
20 billions req/mon.
~14 c3.xLarge Amazon instances.
Peak : x6

Is it a lot?

Twitter : new tweets
15 billions a month
Peak : x30

Delivery load
Requests per
month
Max load
per
instance,
req/sec
Requirements
Servers,
AWS c3.
xLarge
mGage 3 billions 300 HTTP
Time 95% < 60ms 8
mobclix 14 billions 400 HTTP
Time 95% < 100ms 16
XXXX 20 billions 1200 HTTPS
Time 99% < 100ms 14

Delivery load
c3.XLarge - 4 vCPU, 2.8 GHz Intel Xeon E5-2680
LA - ~2-3
1-2 cores reserved for sudden peaks

BE tech stacks
mobclix :
Spring, iBatis, MySql, Solr, Vertica, Cascading, Tomcat
mGage :
Spring, Hibernate, Postgres, Distributed ehCache, Hadoop, Voldemort, Jboss
XXXX :
Spring, Hibernate, MySQL, Solr, Cascading, Redis, Tomcat

Initial problem
● ~1000 req/sec
● Peaks 6x
● 99% HTTPS with response time < 100ms

Real problem
● ~85 mln active users, ~115 mln registered users
● 11.5 messages per user per day
● ~11000 req/sec
● Peaks 6x
● 99% HTTPS with response time < 100ms
● Reliable and scalable for future grow up to 80k

Architecture
AdServer Console (UI)
Reporting

Architecture
SOLR Slave SOLR Slave SOLR Slave
SOLR Master
MySql
Console (UI)

SOLR? Why?
● Pros:
○ Quick search on complex queries
○ Has a lot of build-in features
(master-slave replication, RDBMS
integration)
● Cons:
○ Only HTTP, embedded performs
worth
○ Not easy for beginners
○ Max load is ~100 req/sec

“Simple” query
"-(-connectionTypes:"+"""+getConnectionType()+"""+" AND connectionTypes:[* TO
*]) AND "+"-connectionTypeExcludes:"+"""+getConnectionType()+"""+" AND " + "-(-
OSes:"+"(""+osQuery+"" OR ""+getOS()+"")"+" AND OSes:[* TO *]) AND " + "-
osExcludes:"+"(""+osQuery+"" OR ""+getOS()+"")" "AND (runOfNetwork:T OR
appIncludes:"+getAppId()+" OR pubIncludes:"+getPubId()+" OR categories:
("+categoryList+"))" +" AND -appExcludes:"+getAppId()+" AND -pubExcludes:"
+getPubId()+" AND -categoryExcludes:("+categoryList+") AND " + keywordQuery+" AND
" + "-(-devices:"+"""+getHandsetNormalized()+"""+" AND devices:[* TO *]) AND " +
"-deviceExcludes:"+"""+getHandsetNormalized()+"""+" AND " + "-(-carriers:"+"""
+getCarrier()+"""+" AND carriers:[* TO *]) AND " + "-carrierExcludes:"+"""
+getCarrier()+"""+" AND " + "-(-locales:"+"(""+locale+"" OR ""+langOnly+"")"
+" AND locales:[* TO *]) AND " + "-localeExcludes:"+"(""+locale+"" OR ""
+langOnly+"") AND " + "-(-segments:("+segmentQuery+") AND segments:[* TO *]) AND
" + "-segmentExcludes:("+segmentQuery+")" + " AND -(-geos:"+geoQuery+" AND geos:[*
TO *]) AND " + "-geosExcludes:"+geoQuery

Solr
Index size < 1 Gb - response time 20-30 ms
Index size < 100 Gb - response time 1-2 sec
Index size < 400 Gb - response time from 10 secs

AdServer - Solr Slave
Delivery:
volitile DeliveryData cache;
Cron Job:
DeliveryData tempCache = loadData();
cache = tempCache;

Architecture
AdServer
SOLR Slave
Solr Master
MySql
AdServer
SOLR Slave
AdServer
SOLR Slave
No-SQL

Why no-sql?
● Realtime data
● Quick response time
● Simple queries by key
● 1-2 queries to no-sql on every request. Average load
10-20k req/sec and >120k req/sec in peaks.
● Cheap solution

Why Redis? Pros
● Easy and light-weight
● Low latency. 99% is < 1ms
● Average latency is ~0.2ms
● Up to 100k 'get' commands per second
on c1.X-Large
● Cool features (atomic increments, sets,
hashes)
● Ready AWS service — ElastiCache

Why Redis? Cons
● Single-threaded from the box
● Utilize all cores - sharding/clustering
● Scaling/failover not easy
● Limited up to max instance memory (240GB largest
AWS)
● Persistence/swapping may delay response
● Cluster solution not production ready
● Possible data loss

DynamoDB vs Redis
Price per month Put, 95% Get, 95% Rec/sec
DynamoDB 58$ 300ms 150ms 50
Redis 200$ (c1.medium) 3ms <1ms 4000
ElastiCache 600$ (c1.xlarge) <1ms <1ms 10000

What about others?
● Cassandra
● Voldemort
● Memcached
● MongoDB

Redis RAM problem
● 1 user entry ~ from 80 bytes to 3kb
● ~85 mln users
● Required RAM ~ from 1 GB to 300 GB

Data compression
Json → Kryo binary → 4x times less data →
Gzipping → 2x times less data == 8x less data
Now we need < 40 GB
+ Less load on network stack

Redis Hashes
UID : adId
: freqNum
: created
: excluded
private long adId;
private short freqNum;
private int created;
private boolean excluded;

Redis Hashes
incrBy UID:freqNum 10;
instead of
get UID:freqNum;
incrBy 10;
set UID:FreqNum;

Redis Hashes
instead of
get UID:freqNum;
incrBy 10;
set UID:FreqNum;
BUT

Redis Hashes
instead of
get UID:freqNum;
incrBy 10;
set UID:FreqNum;
BUT
hGetAll UID = O(N), where N - number of fields

AdServer BE
Average response time — ~1.2 ms
Load — 1200 req/sec with LA ~4
c3.XLarge == 4 vCPU

AdServer BE
● Logging — 12% of time (5% on SSD);
● Response generation — 15% of time;
● Redis request — 50% of time;
● All business logic — 23% of time;

Reporting
S3 S3
AdServer Hadoop ETL
Console MySql
1 hour batch
Delivery logs Aggregated logs

Log structure
{ "uid":"test",
"platform":"android",
"app":"xxx",
"ts":1375952275223,
"pid":1,
"education":"Some-Highschool-or-less",
"type":"new",
"sh":1280,
"appver":"6.4.34",
"country":"AU",
"time":"Sat, 03 August 2013 10:30:39 +0200",
"deviceGroup":7,
"rid":"fc389d966438478e9554ed15d27713f51",
"responseCode":200,
"event":"ad",
"device":"N95",
"sw":768,
"ageGroup":"18-24",
"preferences":["beer","girls"] }

Log structure
● 1 mln. records == 0.6 GB.
● ~900 mln records a day == ~0.55 TB.
● 1 month up to 20 TB of data.
● Zipped data is 10 times less.

Reporting
Customer : “And we need fancy reporting”
But 20 TB of data per month is huge. So what
we can do?

Reporting
Dimensions:
device, os, osVer, screenWidth, screenHeight,
country, region, city, carrier, advertisingId,
preferences, gender, age, income, sector,
company, language, etc...
Use case:
I want to know how many users saw my ad in San-
Francisco.

Reporting
Geo table:
Country, City, Region, CampaignId, Date, counters;
Device table:
Device, Carrier, Platform, CampaignId, Date, counters;
Uniques table:
CampaignId, UID

Predefined report types → aggregation by
predefined dimensions → 500-1000 times less
data
20 TB per month → 40 GB per month

Of course - hadoop
● Pros:
○ Unlimited (depends) horizontal scaling
○ Amazon support
● Cons:
○ Not real-time
○ Processing time directly depends on quality code
and on infrastructure cost.
○ Not all input can be scaled
○ Cluster startup is so... long

Elastic MapReduce
● Easy setup
● Easy extend
● Easy to monitor

Alternatives?
● Storm
● Redshift
● Vertica
● Spark

Timing
● Hadoop (cascading) :
○ 25 GB in peak hour takes ~40min (-10 min). CSV
output 300MB. With cluster of 4 c3.xLarge.
● MySQL:
○ Put 300MB in DB with insert statements ~40 min.

Timing
● Hadoop (cascading) :
○ 25 GB in peak hour takes ~40min (-10 min). CSV
output 300MB. With cluster of 4 c3.xLarge.
● MySQL:
○ Put 300MB in DB with insert statements ~40 min.
● MySQL:
○ Put 300MB in DB with optimizations ~5 min.

Optimized are
● No “insert into”. Only “load data” - ~10 times faster
● “ENGINE=MyISAM“ vs “INNODB” when possible - ~5
times faster
● For “upsert” - temp table with “ENGINE=MEMORY” - IO
savings

Why cascading?
Hadoop Job 3
Hadoop Job 2
Hadoop Job 1
Result of one job should be processed by another job

Facts
● HTTP x2 faster HTTPS
● HTTPS keep-alive +80% performance
● Java 7 40% faster Java 6 (our case)
● All IO operations minimized
● Less OOP - better performance

Cost of IO
L1 cache 3 cycles
L2 cache 14 cycles
RAM 250 cycles
Disk 41 000 000 cycles
Network 240 000 000 cycles

Cost of IO
@Cacheable is everywhere

Java 7. Random
return items.get(new Random().nextInt(items.size()))

Java 7. Random
return items.get(ThreadLocalRandom().current().nextInt(items.
size()))
~3x

Java 7. Less garbage
new ArrayList():
this.elementData = {};
insteadOf
this.elementData = new Object
[10];
new HashMap():
Entry<K,V>[] table = {};
insteadOf
this.table = new Entry[16];

Java 7. Less garbage
Before:
class String {
int offset;
int count;
char value[];
int hash;
}
After:
class String {
char value[];
int hash;
}

Java 7. String
● Substring
● Split

Java 7. GC
200mb per second - 0.5% CPU time
Smaller heap - better performance

Small tweaks. Date
new Date()
vs
System.currentTimeMillis()

Small tweaks. SimpleDateFormat
return new SimpleDateFormat(“MMM yyyy HH:mm:ss Z”).parse
(dateString)
~0.5 kb

Small tweaks. SimpleDateFormat
● ThreadLocal
● Joda - threadsafe DateTimeFormat
● LocalDateTime - Java 8

Small tweaks. Pattern
public Item isValid(String ip) {
Pattern pattern = Pattern.compile("xxx");
Matcher matcher = pattern.matcher(ip);
return matcher.matches();
}

Small tweaks. Pattern
final Pattern pattern = Pattern.compile("xxx");
final Matcher matcher = pattern.matcher(“”);
public Item isValid(String ip) {
matcher.reset(ip);
return matcher.matches();
}

Small tweaks. String.split
item.getPreferences().split(“[_,;,-]”);

Small tweaks. String.split
item.getPreferences().split(“[_,;,-]”);
vs
static final Pattern PATTERN = Pattern.compile("[_,;,-]");
PATTERN.split(item.getPreferences()) - ~2x faster
vs
custom code - up to 5x faster

Small tweaks. FOR loop
for (A a : arrayListA) {
// do something
for (B b : arrayListB) {
// do something
for (C c : arrayListC) {
// do something
}
}
}

Small tweaks. FOR loop
for (Iterator<A> i = arrayListA.iterator(); i.hasNext();) {
a = i.next();
}
public Iterator<E> iterator() {
return new Itr();
}
private class Itr implements Iterator<E> {
int cursor = 0;
int lastRet = -1;
int expectedModCount = modCount;
}

Small tweaks. Primitives
double coord = Double.valueOf(textLine);
Double coord2 = Double.parseDouble(textLine);

Small tweaks. LinkedList
Just don’t use it

Small tweaks. Arrays
Array -> IntArrayList -> List

Small tweaks. JIT
You never know

Avoid concurrency
JedisPool.getResource() - sync
JedisPool.returnResource() - sync
OutputStreamWriter.write() - sync
UUID.randomUUID() - sync

Avoid concurrency
JedisPool.getResource()
JedisPool.returnResource()
replace with
ThreadLocal<JedisConnection>

Avoid concurrency
ThreadLocal<JedisConnection> - requires
~1000 open connections for Redis.
More connections — slower redis response.
Dead end.

Avoid concurrency
OutputStreamWriter.write()
● No flush() on every request and big buffered writer
● Async writer
No guarantee for no data loss.
Dead end.

Avoid concurrency
OutputStreamWriter.write()
Or buy SSD =)
+30-60% on disk IO

Use latest versions
Jedis 2.2.3 uses commons-pool 1.6
Jedis 2.3 uses commons-pool 2.0
commons-pool 2.0 - 2 times faster

Hadoop
Map input : 300 MB
Map output : 80 GB

Hadoop
● mapreduce.map.output.compress = true
● codecs: GZip, BZ2 - CPU intensive
● codecs: LZO, Snappy
● codecs: JNI
~x10

Hadoop
map(T value, ...) {
Log log = parse(value);
Data data = dbWrapper.getSomeMissingData(log.getCampId());
}

Hadoop
Missing data:
map(T value, ...) {
Data data = dbWrapper.getSomeMissingData(log.getCampId());
}
Wrong

Hadoop
map(T value, ...) {
Key resultKey = makeKey(log.getCampName(), ...);
output.collect(resultKey, resultValue);
}

Hadoop
Unnecessary data:
map(T value, ...) {
Key resultKey = makeKey(log.getCampName(), ...);
output.collect(resultKey, resultValue);
}
Wrong

Hadoop
RecordWriter.write(K key, V value) {
Entity entity = makeEntity(key, value);
dbWrapper.save(entity);
}

Hadoop
Minimize IO:
RecordWriter.write(K key, V value) {
Entity entity = makeEntity(key, value);
dbWrapper.save(entity);
}
Wrong

Hadoop
public boolean equals(Object obj) {
EqualsBuilder equalsBuilder = new EqualsBuilder();
equalsBuilder.append(id, otherKey.getId());
...
}
public int hashCode() {
HashCodeBuilder hashCodeBuilder = new HashCodeBuilder();
hashCodeBuilder.append(id);
...
}

Hadoop
public boolean equals(Object obj) {
EqualsBuilder equalsBuilder = new EqualsBuilder();
equalsBuilder.append(id, otherKey.getId());
...
}
public int hashCode() {
HashCodeBuilder hashCodeBuilder = new HashCodeBuilder();
hashCodeBuilder.append(id);
...
}
Wrong

Hadoop
public void map(...) {
…
for (String word : words) {
output.collect(new Text(word), new IntVal(1));
}
}

Hadoop
…
output.collect(new Text(word), new IntVal(1));
}
}
Wrong

Hadoop
class MyMapper extends Mapper {
Text word = new Text();
IntVal one = new IntVal(1);
word.set(word);
output.collect(word, one);
}
}
}

Network
Per 1 AdServer instance :
Income traffic : ~100Mb/sec
Outcome traffic : ~50Mb/sec
LB all traffic :
Almost 10 Gb/sec

AWS ElastiCache
SLOWLOG GET
1) 1) (integer) 35
2) (integer) 1391709950
3) (integer) 34155
4) 1) "GET"
2) "2ads10percent_rmywqesssitmfksetzvj"
2) 1) (integer) 34
2) (integer) 1391709830
3) (integer) 34863
4) 1) "GET"
2) "2ads10percent_tteeoomiimcgdzcocuqs"

AWS ElastiCache
35ms for GET? WTF?
Even java faster

AWS ElastiCache
● Strange timeouts (with SO_TIMEOUT 50ms)
● No replication for another cluster
● «Cluster» is not a cluster
● Cluster uses usual instances, so pay for 4
cores while using 1

AWS Limits. You never know where
● Network limit
● PPS rate limit
● LB limit
● Cluster start time up to 20 mins
● Scalability limits
● S3 is slow for many files

Tweaking performance on high-load projects

More Related Content

What's hot

Viewers also liked

Similar to Tweaking performance on high-load projects

Recently uploaded

Tweaking performance on high-load projects