Tweaking performance 
on high-load projects
Dmitriy Dumanskiy 
Cogniance, mGage project 
Java Team Lead 
Java blog : habrahabr.ru/users/doom369/topics
Project evolution 
mGage 
mobclix 
XXXX
mGage delivery load 
3 billions req/mon. 
~8 c3.xLarge Amazon instances. 
Average load : 3000 req/sec 
Peak : x10
Mobclix delivery load 
14 billions req/mon. 
~16 c3.xLarge Amazon 
instances. 
Average load : 6000 req/sec 
Peak : x6
XXXX delivery Load 
20 billions req/mon. 
~14 c3.xLarge Amazon instances. 
Average load : 11000 req/sec 
Peak : x6
Average load : 11000 req/sec 
Is it a lot?
Twitter : new tweets 
15 billions a month 
Average load : 5700 req/sec 
Peak : x30
Delivery load 
Requests per 
month 
Max load 
per 
instance, 
req/sec 
Requirements 
Servers, 
AWS c3. 
xLarge 
mGage 3 billions 300 HTTP 
Time 95% < 60ms 8 
mobclix 14 billions 400 HTTP 
Time 95% < 100ms 16 
XXXX 20 billions 1200 HTTPS 
Time 99% < 100ms 14
Delivery load 
c3.XLarge - 4 vCPU, 2.8 GHz Intel Xeon E5-2680 
LA - ~2-3 
1-2 cores reserved for sudden peaks
BE tech stacks 
mobclix : 
Spring, iBatis, MySql, Solr, Vertica, Cascading, Tomcat 
mGage : 
Spring, Hibernate, Postgres, Distributed ehCache, Hadoop, Voldemort, Jboss 
XXXX : 
Spring, Hibernate, MySQL, Solr, Cascading, Redis, Tomcat
Initial problem 
● ~1000 req/sec 
● Peaks 6x 
● 99% HTTPS with response time < 100ms
Real problem 
● ~85 mln active users, ~115 mln registered users 
● 11.5 messages per user per day 
● ~11000 req/sec 
● Peaks 6x 
● 99% HTTPS with response time < 100ms 
● Reliable and scalable for future grow up to 80k
Architecture 
AdServer Console (UI) 
Reporting
Architecture 
SOLR Slave SOLR Slave SOLR Slave 
SOLR Master 
MySql 
Console (UI)
SOLR? Why? 
● Pros: 
○ Quick search on complex queries 
○ Has a lot of build-in features 
(master-slave replication, RDBMS 
integration) 
● Cons: 
○ Only HTTP, embedded performs 
worth 
○ Not easy for beginners 
○ Max load is ~100 req/sec
“Simple” query 
"-(-connectionTypes:"+"""+getConnectionType()+"""+" AND connectionTypes:[* TO 
*]) AND "+"-connectionTypeExcludes:"+"""+getConnectionType()+"""+" AND " + "-(- 
OSes:"+"(""+osQuery+"" OR ""+getOS()+"")"+" AND OSes:[* TO *]) AND " + "- 
osExcludes:"+"(""+osQuery+"" OR ""+getOS()+"")" "AND (runOfNetwork:T OR 
appIncludes:"+getAppId()+" OR pubIncludes:"+getPubId()+" OR categories: 
("+categoryList+"))" +" AND -appExcludes:"+getAppId()+" AND -pubExcludes:" 
+getPubId()+" AND -categoryExcludes:("+categoryList+") AND " + keywordQuery+" AND 
" + "-(-devices:"+"""+getHandsetNormalized()+"""+" AND devices:[* TO *]) AND " + 
"-deviceExcludes:"+"""+getHandsetNormalized()+"""+" AND " + "-(-carriers:"+""" 
+getCarrier()+"""+" AND carriers:[* TO *]) AND " + "-carrierExcludes:"+""" 
+getCarrier()+"""+" AND " + "-(-locales:"+"(""+locale+"" OR ""+langOnly+"")" 
+" AND locales:[* TO *]) AND " + "-localeExcludes:"+"(""+locale+"" OR "" 
+langOnly+"") AND " + "-(-segments:("+segmentQuery+") AND segments:[* TO *]) AND 
" + "-segmentExcludes:("+segmentQuery+")" + " AND -(-geos:"+geoQuery+" AND geos:[* 
TO *]) AND " + "-geosExcludes:"+geoQuery
Solr 
Index size < 1 Gb - response time 20-30 ms 
Index size < 100 Gb - response time 1-2 sec 
Index size < 400 Gb - response time from 10 secs
AdServer - Solr Slave 
Delivery: 
volitile DeliveryData cache; 
Cron Job: 
DeliveryData tempCache = loadData(); 
cache = tempCache;
Architecture 
AdServer 
SOLR Slave 
Solr Master 
MySql 
AdServer 
SOLR Slave 
AdServer 
SOLR Slave 
No-SQL
Why no-sql? 
● Realtime data 
● Quick response time 
● Simple queries by key 
● 1-2 queries to no-sql on every request. Average load 
10-20k req/sec and >120k req/sec in peaks. 
● Cheap solution
Why Redis? Pros 
● Easy and light-weight 
● Low latency. 99% is < 1ms 
● Average latency is ~0.2ms 
● Up to 100k 'get' commands per second 
on c1.X-Large 
● Cool features (atomic increments, sets, 
hashes) 
● Ready AWS service — ElastiCache
Why Redis? Cons 
● Single-threaded from the box 
● Utilize all cores - sharding/clustering 
● Scaling/failover not easy 
● Limited up to max instance memory (240GB largest 
AWS) 
● Persistence/swapping may delay response 
● Cluster solution not production ready 
● Possible data loss
DynamoDB vs Redis 
Price per month Put, 95% Get, 95% Rec/sec 
DynamoDB 58$ 300ms 150ms 50 
DynamoDB 580$ 60ms 8ms 780 
DynamoDB 5800$ 16ms 8ms 1250 
Redis 200$ (c1.medium) 3ms <1ms 4000 
ElastiCache 600$ (c1.xlarge) <1ms <1ms 10000
What about others? 
● Cassandra 
● Voldemort 
● Memcached 
● MongoDB
Redis RAM problem 
● 1 user entry ~ from 80 bytes to 3kb 
● ~85 mln users 
● Required RAM ~ from 1 GB to 300 GB
Data compression 
Json → Kryo binary → 4x times less data → 
Gzipping → 2x times less data == 8x less data 
Now we need < 40 GB 
+ Less load on network stack
Redis Hashes 
UID : adId 
: freqNum 
: created 
: excluded 
private long adId; 
private short freqNum; 
private int created; 
private boolean excluded;
Redis Hashes 
incrBy UID:freqNum 10; 
instead of 
get UID:freqNum; 
incrBy 10; 
set UID:FreqNum;
Redis Hashes 
incrBy UID:freqNum 10; 
instead of 
get UID:freqNum; 
incrBy 10; 
set UID:FreqNum; 
BUT
Redis Hashes 
incrBy UID:freqNum 10; 
instead of 
get UID:freqNum; 
incrBy 10; 
set UID:FreqNum; 
BUT 
hGetAll UID = O(N), where N - number of fields
AdServer BE 
Average response time — ~1.2 ms 
Load — 1200 req/sec with LA ~4 
c3.XLarge == 4 vCPU
AdServer BE 
● Logging — 12% of time (5% on SSD); 
● Response generation — 15% of time; 
● Redis request — 50% of time; 
● All business logic — 23% of time;
Reporting 
S3 S3 
AdServer Hadoop ETL 
Console MySql 
1 hour batch 
Delivery logs Aggregated logs
Log structure 
{ "uid":"test", 
"platform":"android", 
"app":"xxx", 
"ts":1375952275223, 
"pid":1, 
"education":"Some-Highschool-or-less", 
"type":"new", 
"sh":1280, 
"appver":"6.4.34", 
"country":"AU", 
"time":"Sat, 03 August 2013 10:30:39 +0200", 
"deviceGroup":7, 
"rid":"fc389d966438478e9554ed15d27713f51", 
"responseCode":200, 
"event":"ad", 
"device":"N95", 
"sw":768, 
"ageGroup":"18-24", 
"preferences":["beer","girls"] }
Log structure 
● 1 mln. records == 0.6 GB. 
● ~900 mln records a day == ~0.55 TB. 
● 1 month up to 20 TB of data. 
● Zipped data is 10 times less.
Reporting 
Customer : “And we need fancy reporting” 
But 20 TB of data per month is huge. So what 
we can do?
Reporting 
Dimensions: 
device, os, osVer, screenWidth, screenHeight, 
country, region, city, carrier, advertisingId, 
preferences, gender, age, income, sector, 
company, language, etc... 
Use case: 
I want to know how many users saw my ad in San- 
Francisco.
Reporting 
Geo table: 
Country, City, Region, CampaignId, Date, counters; 
Device table: 
Device, Carrier, Platform, CampaignId, Date, counters; 
Uniques table: 
CampaignId, UID
Predefined report types → aggregation by 
predefined dimensions → 500-1000 times less 
data 
20 TB per month → 40 GB per month
Of course - hadoop 
● Pros: 
○ Unlimited (depends) horizontal scaling 
○ Amazon support 
● Cons: 
○ Not real-time 
○ Processing time directly depends on quality code 
and on infrastructure cost. 
○ Not all input can be scaled 
○ Cluster startup is so... long
Elastic MapReduce 
● Easy setup 
● Easy extend 
● Easy to monitor
Alternatives? 
● Storm 
● Redshift 
● Vertica 
● Spark
Timing 
● Hadoop (cascading) : 
○ 25 GB in peak hour takes ~40min (-10 min). CSV 
output 300MB. With cluster of 4 c3.xLarge. 
● MySQL: 
○ Put 300MB in DB with insert statements ~40 min.
Timing 
● Hadoop (cascading) : 
○ 25 GB in peak hour takes ~40min (-10 min). CSV 
output 300MB. With cluster of 4 c3.xLarge. 
● MySQL: 
○ Put 300MB in DB with insert statements ~40 min. 
● MySQL: 
○ Put 300MB in DB with optimizations ~5 min.
Optimized are 
● No “insert into”. Only “load data” - ~10 times faster 
● “ENGINE=MyISAM“ vs “INNODB” when possible - ~5 
times faster 
● For “upsert” - temp table with “ENGINE=MEMORY” - IO 
savings
Why cascading? 
Hadoop Job 3 
Hadoop Job 2 
Hadoop Job 1 
Result of one job should be processed by another job
Lessons Learned
Facts 
● HTTP x2 faster HTTPS 
● HTTPS keep-alive +80% performance 
● Java 7 40% faster Java 6 (our case) 
● All IO operations minimized 
● Less OOP - better performance
Cost of IO 
L1 cache 3 cycles 
L2 cache 14 cycles 
RAM 250 cycles 
Disk 41 000 000 cycles 
Network 240 000 000 cycles
Cost of IO 
@Cacheable is everywhere
Java 7. Random 
return items.get(new Random().nextInt(items.size()))
Java 7. Random 
return items.get(ThreadLocalRandom().current().nextInt(items. 
size())) 
~3x
Java 7. Less garbage 
new ArrayList(): 
this.elementData = {}; 
insteadOf 
this.elementData = new Object 
[10]; 
new HashMap(): 
Entry<K,V>[] table = {}; 
insteadOf 
this.table = new Entry[16];
Java 7. Less garbage 
Before: 
class String { 
int offset; 
int count; 
char value[]; 
int hash; 
} 
After: 
class String { 
char value[]; 
int hash; 
}
Java 7. String 
● Substring 
● Split
Java 7. GC 
200mb per second - 0.5% CPU time 
Smaller heap - better performance
Small tweaks. Date 
new Date() 
vs 
System.currentTimeMillis()
Small tweaks. SimpleDateFormat 
return new SimpleDateFormat(“MMM yyyy HH:mm:ss Z”).parse 
(dateString) 
~0.5 kb
Small tweaks. SimpleDateFormat 
● ThreadLocal 
● Joda - threadsafe DateTimeFormat 
● LocalDateTime - Java 8
Small tweaks. Pattern 
public Item isValid(String ip) { 
Pattern pattern = Pattern.compile("xxx"); 
Matcher matcher = pattern.matcher(ip); 
return matcher.matches(); 
}
Small tweaks. Pattern 
final Pattern pattern = Pattern.compile("xxx"); 
final Matcher matcher = pattern.matcher(“”); 
public Item isValid(String ip) { 
matcher.reset(ip); 
return matcher.matches(); 
}
Small tweaks. String.split 
item.getPreferences().split(“[_,;,-]”);
Small tweaks. String.split 
item.getPreferences().split(“[_,;,-]”); 
vs 
static final Pattern PATTERN = Pattern.compile("[_,;,-]"); 
PATTERN.split(item.getPreferences()) - ~2x faster 
vs 
custom code - up to 5x faster
Small tweaks. FOR loop 
for (A a : arrayListA) { 
// do something 
for (B b : arrayListB) { 
// do something 
for (C c : arrayListC) { 
// do something 
} 
} 
}
Small tweaks. FOR loop 
for (Iterator<A> i = arrayListA.iterator(); i.hasNext();) { 
a = i.next(); 
} 
public Iterator<E> iterator() { 
return new Itr(); 
} 
private class Itr implements Iterator<E> { 
int cursor = 0; 
int lastRet = -1; 
int expectedModCount = modCount; 
}
Small tweaks. FOR loop
Small tweaks. Primitives 
double coord = Double.valueOf(textLine); 
Double coord2 = Double.parseDouble(textLine);
Small tweaks. LinkedList 
Just don’t use it
Small tweaks. Arrays 
Array -> IntArrayList -> List
Small tweaks. JIT 
You never know
Avoid concurrency 
JedisPool.getResource() - sync 
JedisPool.returnResource() - sync 
OutputStreamWriter.write() - sync 
UUID.randomUUID() - sync
Avoid concurrency 
JedisPool.getResource() 
JedisPool.returnResource() 
replace with 
ThreadLocal<JedisConnection>
Avoid concurrency 
ThreadLocal<JedisConnection> - requires 
~1000 open connections for Redis. 
More connections — slower redis response. 
Dead end.
Avoid concurrency 
OutputStreamWriter.write() 
● No flush() on every request and big buffered writer 
● Async writer 
No guarantee for no data loss. 
Dead end.
Avoid concurrency 
OutputStreamWriter.write() 
Or buy SSD =) 
+30-60% on disk IO
Use latest versions 
Jedis 2.2.3 uses commons-pool 1.6 
Jedis 2.3 uses commons-pool 2.0 
commons-pool 2.0 - 2 times faster
Hadoop 
Map input : 300 MB 
Map output : 80 GB
Hadoop 
● mapreduce.map.output.compress = true 
● codecs: GZip, BZ2 - CPU intensive 
● codecs: LZO, Snappy 
● codecs: JNI 
~x10
Hadoop 
map(T value, ...) { 
Log log = parse(value); 
Data data = dbWrapper.getSomeMissingData(log.getCampId()); 
}
Hadoop 
Missing data: 
map(T value, ...) { 
Log log = parse(value); 
Data data = dbWrapper.getSomeMissingData(log.getCampId()); 
} 
Wrong
Hadoop 
map(T value, ...) { 
Log log = parse(value); 
Key resultKey = makeKey(log.getCampName(), ...); 
output.collect(resultKey, resultValue); 
}
Hadoop 
Unnecessary data: 
map(T value, ...) { 
Log log = parse(value); 
Key resultKey = makeKey(log.getCampName(), ...); 
output.collect(resultKey, resultValue); 
} 
Wrong
Hadoop 
RecordWriter.write(K key, V value) { 
Entity entity = makeEntity(key, value); 
dbWrapper.save(entity); 
}
Hadoop 
Minimize IO: 
RecordWriter.write(K key, V value) { 
Entity entity = makeEntity(key, value); 
dbWrapper.save(entity); 
} 
Wrong
Hadoop 
public boolean equals(Object obj) { 
EqualsBuilder equalsBuilder = new EqualsBuilder(); 
equalsBuilder.append(id, otherKey.getId()); 
... 
} 
public int hashCode() { 
HashCodeBuilder hashCodeBuilder = new HashCodeBuilder(); 
hashCodeBuilder.append(id); 
... 
}
Hadoop 
public boolean equals(Object obj) { 
EqualsBuilder equalsBuilder = new EqualsBuilder(); 
equalsBuilder.append(id, otherKey.getId()); 
... 
} 
public int hashCode() { 
HashCodeBuilder hashCodeBuilder = new HashCodeBuilder(); 
hashCodeBuilder.append(id); 
... 
} 
Wrong
Hadoop 
public void map(...) { 
… 
for (String word : words) { 
output.collect(new Text(word), new IntVal(1)); 
} 
}
Hadoop 
public void map(...) { 
… 
for (String word : words) { 
output.collect(new Text(word), new IntVal(1)); 
} 
} 
Wrong
Hadoop 
class MyMapper extends Mapper { 
Text word = new Text(); 
IntVal one = new IntVal(1); 
public void map(...) { 
for (String word : words) { 
word.set(word); 
output.collect(word, one); 
} 
} 
}
Network 
Per 1 AdServer instance : 
Income traffic : ~100Mb/sec 
Outcome traffic : ~50Mb/sec 
LB all traffic : 
Almost 10 Gb/sec
Amazon
AWS ElastiCache 
SLOWLOG GET 
1) 1) (integer) 35 
2) (integer) 1391709950 
3) (integer) 34155 
4) 1) "GET" 
2) "2ads10percent_rmywqesssitmfksetzvj" 
2) 1) (integer) 34 
2) (integer) 1391709830 
3) (integer) 34863 
4) 1) "GET" 
2) "2ads10percent_tteeoomiimcgdzcocuqs"
AWS ElastiCache 
35ms for GET? WTF? 
Even java faster
AWS ElastiCache 
● Strange timeouts (with SO_TIMEOUT 50ms) 
● No replication for another cluster 
● «Cluster» is not a cluster 
● Cluster uses usual instances, so pay for 4 
cores while using 1
AWS Limits. You never know where 
● Network limit 
● PPS rate limit 
● LB limit 
● Cluster start time up to 20 mins 
● Scalability limits 
● S3 is slow for many files

Tweaking performance on high-load projects

  • 1.
    Tweaking performance onhigh-load projects
  • 2.
    Dmitriy Dumanskiy Cogniance,mGage project Java Team Lead Java blog : habrahabr.ru/users/doom369/topics
  • 3.
  • 4.
    mGage delivery load 3 billions req/mon. ~8 c3.xLarge Amazon instances. Average load : 3000 req/sec Peak : x10
  • 5.
    Mobclix delivery load 14 billions req/mon. ~16 c3.xLarge Amazon instances. Average load : 6000 req/sec Peak : x6
  • 6.
    XXXX delivery Load 20 billions req/mon. ~14 c3.xLarge Amazon instances. Average load : 11000 req/sec Peak : x6
  • 7.
    Average load :11000 req/sec Is it a lot?
  • 8.
    Twitter : newtweets 15 billions a month Average load : 5700 req/sec Peak : x30
  • 9.
    Delivery load Requestsper month Max load per instance, req/sec Requirements Servers, AWS c3. xLarge mGage 3 billions 300 HTTP Time 95% < 60ms 8 mobclix 14 billions 400 HTTP Time 95% < 100ms 16 XXXX 20 billions 1200 HTTPS Time 99% < 100ms 14
  • 10.
    Delivery load c3.XLarge- 4 vCPU, 2.8 GHz Intel Xeon E5-2680 LA - ~2-3 1-2 cores reserved for sudden peaks
  • 11.
    BE tech stacks mobclix : Spring, iBatis, MySql, Solr, Vertica, Cascading, Tomcat mGage : Spring, Hibernate, Postgres, Distributed ehCache, Hadoop, Voldemort, Jboss XXXX : Spring, Hibernate, MySQL, Solr, Cascading, Redis, Tomcat
  • 12.
    Initial problem ●~1000 req/sec ● Peaks 6x ● 99% HTTPS with response time < 100ms
  • 13.
    Real problem ●~85 mln active users, ~115 mln registered users ● 11.5 messages per user per day ● ~11000 req/sec ● Peaks 6x ● 99% HTTPS with response time < 100ms ● Reliable and scalable for future grow up to 80k
  • 14.
  • 15.
    Architecture SOLR SlaveSOLR Slave SOLR Slave SOLR Master MySql Console (UI)
  • 16.
    SOLR? Why? ●Pros: ○ Quick search on complex queries ○ Has a lot of build-in features (master-slave replication, RDBMS integration) ● Cons: ○ Only HTTP, embedded performs worth ○ Not easy for beginners ○ Max load is ~100 req/sec
  • 17.
    “Simple” query "-(-connectionTypes:"+"""+getConnectionType()+"""+"AND connectionTypes:[* TO *]) AND "+"-connectionTypeExcludes:"+"""+getConnectionType()+"""+" AND " + "-(- OSes:"+"(""+osQuery+"" OR ""+getOS()+"")"+" AND OSes:[* TO *]) AND " + "- osExcludes:"+"(""+osQuery+"" OR ""+getOS()+"")" "AND (runOfNetwork:T OR appIncludes:"+getAppId()+" OR pubIncludes:"+getPubId()+" OR categories: ("+categoryList+"))" +" AND -appExcludes:"+getAppId()+" AND -pubExcludes:" +getPubId()+" AND -categoryExcludes:("+categoryList+") AND " + keywordQuery+" AND " + "-(-devices:"+"""+getHandsetNormalized()+"""+" AND devices:[* TO *]) AND " + "-deviceExcludes:"+"""+getHandsetNormalized()+"""+" AND " + "-(-carriers:"+""" +getCarrier()+"""+" AND carriers:[* TO *]) AND " + "-carrierExcludes:"+""" +getCarrier()+"""+" AND " + "-(-locales:"+"(""+locale+"" OR ""+langOnly+"")" +" AND locales:[* TO *]) AND " + "-localeExcludes:"+"(""+locale+"" OR "" +langOnly+"") AND " + "-(-segments:("+segmentQuery+") AND segments:[* TO *]) AND " + "-segmentExcludes:("+segmentQuery+")" + " AND -(-geos:"+geoQuery+" AND geos:[* TO *]) AND " + "-geosExcludes:"+geoQuery
  • 18.
    Solr Index size< 1 Gb - response time 20-30 ms Index size < 100 Gb - response time 1-2 sec Index size < 400 Gb - response time from 10 secs
  • 19.
    AdServer - SolrSlave Delivery: volitile DeliveryData cache; Cron Job: DeliveryData tempCache = loadData(); cache = tempCache;
  • 20.
    Architecture AdServer SOLRSlave Solr Master MySql AdServer SOLR Slave AdServer SOLR Slave No-SQL
  • 21.
    Why no-sql? ●Realtime data ● Quick response time ● Simple queries by key ● 1-2 queries to no-sql on every request. Average load 10-20k req/sec and >120k req/sec in peaks. ● Cheap solution
  • 22.
    Why Redis? Pros ● Easy and light-weight ● Low latency. 99% is < 1ms ● Average latency is ~0.2ms ● Up to 100k 'get' commands per second on c1.X-Large ● Cool features (atomic increments, sets, hashes) ● Ready AWS service — ElastiCache
  • 23.
    Why Redis? Cons ● Single-threaded from the box ● Utilize all cores - sharding/clustering ● Scaling/failover not easy ● Limited up to max instance memory (240GB largest AWS) ● Persistence/swapping may delay response ● Cluster solution not production ready ● Possible data loss
  • 24.
    DynamoDB vs Redis Price per month Put, 95% Get, 95% Rec/sec DynamoDB 58$ 300ms 150ms 50 DynamoDB 580$ 60ms 8ms 780 DynamoDB 5800$ 16ms 8ms 1250 Redis 200$ (c1.medium) 3ms <1ms 4000 ElastiCache 600$ (c1.xlarge) <1ms <1ms 10000
  • 25.
    What about others? ● Cassandra ● Voldemort ● Memcached ● MongoDB
  • 26.
    Redis RAM problem ● 1 user entry ~ from 80 bytes to 3kb ● ~85 mln users ● Required RAM ~ from 1 GB to 300 GB
  • 27.
    Data compression Json→ Kryo binary → 4x times less data → Gzipping → 2x times less data == 8x less data Now we need < 40 GB + Less load on network stack
  • 28.
    Redis Hashes UID: adId : freqNum : created : excluded private long adId; private short freqNum; private int created; private boolean excluded;
  • 29.
    Redis Hashes incrByUID:freqNum 10; instead of get UID:freqNum; incrBy 10; set UID:FreqNum;
  • 30.
    Redis Hashes incrByUID:freqNum 10; instead of get UID:freqNum; incrBy 10; set UID:FreqNum; BUT
  • 31.
    Redis Hashes incrByUID:freqNum 10; instead of get UID:freqNum; incrBy 10; set UID:FreqNum; BUT hGetAll UID = O(N), where N - number of fields
  • 32.
    AdServer BE Averageresponse time — ~1.2 ms Load — 1200 req/sec with LA ~4 c3.XLarge == 4 vCPU
  • 33.
    AdServer BE ●Logging — 12% of time (5% on SSD); ● Response generation — 15% of time; ● Redis request — 50% of time; ● All business logic — 23% of time;
  • 34.
    Reporting S3 S3 AdServer Hadoop ETL Console MySql 1 hour batch Delivery logs Aggregated logs
  • 35.
    Log structure {"uid":"test", "platform":"android", "app":"xxx", "ts":1375952275223, "pid":1, "education":"Some-Highschool-or-less", "type":"new", "sh":1280, "appver":"6.4.34", "country":"AU", "time":"Sat, 03 August 2013 10:30:39 +0200", "deviceGroup":7, "rid":"fc389d966438478e9554ed15d27713f51", "responseCode":200, "event":"ad", "device":"N95", "sw":768, "ageGroup":"18-24", "preferences":["beer","girls"] }
  • 36.
    Log structure ●1 mln. records == 0.6 GB. ● ~900 mln records a day == ~0.55 TB. ● 1 month up to 20 TB of data. ● Zipped data is 10 times less.
  • 37.
    Reporting Customer :“And we need fancy reporting” But 20 TB of data per month is huge. So what we can do?
  • 38.
    Reporting Dimensions: device,os, osVer, screenWidth, screenHeight, country, region, city, carrier, advertisingId, preferences, gender, age, income, sector, company, language, etc... Use case: I want to know how many users saw my ad in San- Francisco.
  • 39.
    Reporting Geo table: Country, City, Region, CampaignId, Date, counters; Device table: Device, Carrier, Platform, CampaignId, Date, counters; Uniques table: CampaignId, UID
  • 40.
    Predefined report types→ aggregation by predefined dimensions → 500-1000 times less data 20 TB per month → 40 GB per month
  • 41.
    Of course -hadoop ● Pros: ○ Unlimited (depends) horizontal scaling ○ Amazon support ● Cons: ○ Not real-time ○ Processing time directly depends on quality code and on infrastructure cost. ○ Not all input can be scaled ○ Cluster startup is so... long
  • 42.
    Elastic MapReduce ●Easy setup ● Easy extend ● Easy to monitor
  • 43.
    Alternatives? ● Storm ● Redshift ● Vertica ● Spark
  • 44.
    Timing ● Hadoop(cascading) : ○ 25 GB in peak hour takes ~40min (-10 min). CSV output 300MB. With cluster of 4 c3.xLarge. ● MySQL: ○ Put 300MB in DB with insert statements ~40 min.
  • 45.
    Timing ● Hadoop(cascading) : ○ 25 GB in peak hour takes ~40min (-10 min). CSV output 300MB. With cluster of 4 c3.xLarge. ● MySQL: ○ Put 300MB in DB with insert statements ~40 min. ● MySQL: ○ Put 300MB in DB with optimizations ~5 min.
  • 46.
    Optimized are ●No “insert into”. Only “load data” - ~10 times faster ● “ENGINE=MyISAM“ vs “INNODB” when possible - ~5 times faster ● For “upsert” - temp table with “ENGINE=MEMORY” - IO savings
  • 47.
    Why cascading? HadoopJob 3 Hadoop Job 2 Hadoop Job 1 Result of one job should be processed by another job
  • 48.
  • 49.
    Facts ● HTTPx2 faster HTTPS ● HTTPS keep-alive +80% performance ● Java 7 40% faster Java 6 (our case) ● All IO operations minimized ● Less OOP - better performance
  • 50.
    Cost of IO L1 cache 3 cycles L2 cache 14 cycles RAM 250 cycles Disk 41 000 000 cycles Network 240 000 000 cycles
  • 51.
    Cost of IO @Cacheable is everywhere
  • 52.
    Java 7. Random return items.get(new Random().nextInt(items.size()))
  • 53.
    Java 7. Random return items.get(ThreadLocalRandom().current().nextInt(items. size())) ~3x
  • 54.
    Java 7. Lessgarbage new ArrayList(): this.elementData = {}; insteadOf this.elementData = new Object [10]; new HashMap(): Entry<K,V>[] table = {}; insteadOf this.table = new Entry[16];
  • 55.
    Java 7. Lessgarbage Before: class String { int offset; int count; char value[]; int hash; } After: class String { char value[]; int hash; }
  • 56.
    Java 7. String ● Substring ● Split
  • 57.
    Java 7. GC 200mb per second - 0.5% CPU time Smaller heap - better performance
  • 58.
    Small tweaks. Date new Date() vs System.currentTimeMillis()
  • 59.
    Small tweaks. SimpleDateFormat return new SimpleDateFormat(“MMM yyyy HH:mm:ss Z”).parse (dateString) ~0.5 kb
  • 60.
    Small tweaks. SimpleDateFormat ● ThreadLocal ● Joda - threadsafe DateTimeFormat ● LocalDateTime - Java 8
  • 61.
    Small tweaks. Pattern public Item isValid(String ip) { Pattern pattern = Pattern.compile("xxx"); Matcher matcher = pattern.matcher(ip); return matcher.matches(); }
  • 62.
    Small tweaks. Pattern final Pattern pattern = Pattern.compile("xxx"); final Matcher matcher = pattern.matcher(“”); public Item isValid(String ip) { matcher.reset(ip); return matcher.matches(); }
  • 63.
    Small tweaks. String.split item.getPreferences().split(“[_,;,-]”);
  • 64.
    Small tweaks. String.split item.getPreferences().split(“[_,;,-]”); vs static final Pattern PATTERN = Pattern.compile("[_,;,-]"); PATTERN.split(item.getPreferences()) - ~2x faster vs custom code - up to 5x faster
  • 65.
    Small tweaks. FORloop for (A a : arrayListA) { // do something for (B b : arrayListB) { // do something for (C c : arrayListC) { // do something } } }
  • 66.
    Small tweaks. FORloop for (Iterator<A> i = arrayListA.iterator(); i.hasNext();) { a = i.next(); } public Iterator<E> iterator() { return new Itr(); } private class Itr implements Iterator<E> { int cursor = 0; int lastRet = -1; int expectedModCount = modCount; }
  • 67.
  • 68.
    Small tweaks. Primitives double coord = Double.valueOf(textLine); Double coord2 = Double.parseDouble(textLine);
  • 69.
    Small tweaks. LinkedList Just don’t use it
  • 70.
    Small tweaks. Arrays Array -> IntArrayList -> List
  • 71.
    Small tweaks. JIT You never know
  • 72.
    Avoid concurrency JedisPool.getResource()- sync JedisPool.returnResource() - sync OutputStreamWriter.write() - sync UUID.randomUUID() - sync
  • 73.
    Avoid concurrency JedisPool.getResource() JedisPool.returnResource() replace with ThreadLocal<JedisConnection>
  • 74.
    Avoid concurrency ThreadLocal<JedisConnection>- requires ~1000 open connections for Redis. More connections — slower redis response. Dead end.
  • 75.
    Avoid concurrency OutputStreamWriter.write() ● No flush() on every request and big buffered writer ● Async writer No guarantee for no data loss. Dead end.
  • 76.
    Avoid concurrency OutputStreamWriter.write() Or buy SSD =) +30-60% on disk IO
  • 77.
    Use latest versions Jedis 2.2.3 uses commons-pool 1.6 Jedis 2.3 uses commons-pool 2.0 commons-pool 2.0 - 2 times faster
  • 78.
    Hadoop Map input: 300 MB Map output : 80 GB
  • 79.
    Hadoop ● mapreduce.map.output.compress= true ● codecs: GZip, BZ2 - CPU intensive ● codecs: LZO, Snappy ● codecs: JNI ~x10
  • 80.
    Hadoop map(T value,...) { Log log = parse(value); Data data = dbWrapper.getSomeMissingData(log.getCampId()); }
  • 81.
    Hadoop Missing data: map(T value, ...) { Log log = parse(value); Data data = dbWrapper.getSomeMissingData(log.getCampId()); } Wrong
  • 82.
    Hadoop map(T value,...) { Log log = parse(value); Key resultKey = makeKey(log.getCampName(), ...); output.collect(resultKey, resultValue); }
  • 83.
    Hadoop Unnecessary data: map(T value, ...) { Log log = parse(value); Key resultKey = makeKey(log.getCampName(), ...); output.collect(resultKey, resultValue); } Wrong
  • 84.
    Hadoop RecordWriter.write(K key,V value) { Entity entity = makeEntity(key, value); dbWrapper.save(entity); }
  • 85.
    Hadoop Minimize IO: RecordWriter.write(K key, V value) { Entity entity = makeEntity(key, value); dbWrapper.save(entity); } Wrong
  • 86.
    Hadoop public booleanequals(Object obj) { EqualsBuilder equalsBuilder = new EqualsBuilder(); equalsBuilder.append(id, otherKey.getId()); ... } public int hashCode() { HashCodeBuilder hashCodeBuilder = new HashCodeBuilder(); hashCodeBuilder.append(id); ... }
  • 87.
    Hadoop public booleanequals(Object obj) { EqualsBuilder equalsBuilder = new EqualsBuilder(); equalsBuilder.append(id, otherKey.getId()); ... } public int hashCode() { HashCodeBuilder hashCodeBuilder = new HashCodeBuilder(); hashCodeBuilder.append(id); ... } Wrong
  • 88.
    Hadoop public voidmap(...) { … for (String word : words) { output.collect(new Text(word), new IntVal(1)); } }
  • 89.
    Hadoop public voidmap(...) { … for (String word : words) { output.collect(new Text(word), new IntVal(1)); } } Wrong
  • 90.
    Hadoop class MyMapperextends Mapper { Text word = new Text(); IntVal one = new IntVal(1); public void map(...) { for (String word : words) { word.set(word); output.collect(word, one); } } }
  • 91.
    Network Per 1AdServer instance : Income traffic : ~100Mb/sec Outcome traffic : ~50Mb/sec LB all traffic : Almost 10 Gb/sec
  • 92.
  • 93.
    AWS ElastiCache SLOWLOGGET 1) 1) (integer) 35 2) (integer) 1391709950 3) (integer) 34155 4) 1) "GET" 2) "2ads10percent_rmywqesssitmfksetzvj" 2) 1) (integer) 34 2) (integer) 1391709830 3) (integer) 34863 4) 1) "GET" 2) "2ads10percent_tteeoomiimcgdzcocuqs"
  • 94.
    AWS ElastiCache 35msfor GET? WTF? Even java faster
  • 95.
    AWS ElastiCache ●Strange timeouts (with SO_TIMEOUT 50ms) ● No replication for another cluster ● «Cluster» is not a cluster ● Cluster uses usual instances, so pay for 4 cores while using 1
  • 96.
    AWS Limits. Younever know where ● Network limit ● PPS rate limit ● LB limit ● Cluster start time up to 20 mins ● Scalability limits ● S3 is slow for many files