3. Problem
Update
Deduping Add
Daily Incremental: 1 million POI MasterDB: 23 million POI
4. Problem
• Process
POI (target)
1) Get Candidate {POI: distance < 100 meter} from Master DB
a. Use Grid index
2) Compare target with Candidates
5. Problem
• DB is time-consuming according to Content
Team experience
10ms/POI, 1 million POI need 2.7 hour (DB Query)
100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!
6. Proposal
• Build Local Cache
• Multiple-Thread (Multiple-Boxes, Map-
Reduce)
• DB Query and Dedup computation separation
• Single SQL Tuning
7. Single SQL Running: DAL VS JDBC
//DAL
CpPoiWorkDao dao = CpPoiWorkDao.getInstance();
List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive);
//JDBC
Statement statement = connect.createStatement();
ResultSet rs = statement.executeQuery("select * from cs_1");
//running
com.telenav.content.impl.JdbcPoiLoader 0:00:04.062 42985
com.telenav.content.impl.PoiLoader 0:00:10.969 42985
9. The truth
• DAL need ‘warm up’ (one more query)
select id as id, table_set_name as table_set_name, current_work_suffix as current_work_suffix,
current_live_suffix as current_live_suffix, table_set_size as table_set_size, update_time as update_time,
create_time as create_time from active_table where table_set_name=?
JDBC DAL
First run 0:00:04.125 0:00:09.360
2 3187 4797
3 3297 4672
4 3265 4828
5 3297 4828
6 3344 4891
10. Second SQL Running
select POI_RECORD_ID, POI_ID, LATITUDE, …, locality, locale
from us_ta_1
where node_index in ( ?, ?, … ? )
JDBC DAL
First run 375 1156
2 406 313
3 375 281
4 391 375
5 375 266
6 406 297
First Declaration: DAL is slower than JDBC, there are performance loss in DAL
11. Benchmark Data
• It’s slow, how is it slow ?
– Single SQL is smoke test, we want real data
12. Benchmark Data
• Test Case
•Running 10k POI, for each POI
•DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching
•IDeDuper.getDuplications(target, candidate) to find matching from candidate
•100 meter
•Repeat the test for 3 times
• Test Result
Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent
size
0:01:46, 10ms 4ms 6ms 51 0.63
6387 POI has been matched
14. The truth
• DB is fast because of cache
# distance Process Time DB Parameters DB Time Dedup Time Dedup candidate POI size matched POI Percent
100 total 2min30s, 14ms 4 4ms 9ms 80 0.68
1 500 total 30m, 180ms 37 128ms 51ms 474 0.79
2 500 total 11m38s, 69ms 18ms 51ms
37 node in single query each POI need compare with 474 candidates
Second (latter) run is much faster than first run
15. The truth
• Clean Mysql cache & Restart Mysql
– key_buffer_size 500m -> 8 byte
– query_cache_size 64m -> 0
• No effect, the db query is still fast.
– The first run time can not be reproduced for the
same data set.
16. The truth
• Clean OS (linux) file cache
– echo 3 > /proc/sys/vm/drop_caches
• Test Result
Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent
size
0:01:46, 10ms 4ms 6ms 51 0.63
0:22:58.844 (db only) 137ms / (removed) 51 /
30 times slower when OS file cache is cleaned
Second Declaration:
Dedup is the most important factor in the process, db is not the botteneck
17. Mysql Index Preloading
• Mysql Index Preloading
– key_buffer_size 4096m
– load index into cache us_ta_1
(INDEX_NODEX_INDEX);
• Nearly No effect, the db query is nearly same.
18. Data file is bottleneck
• It seems key index does not help, the
bottleneck is in data file reading (an
assumption) ?
• Verify
– 1) Reorder 23 million records using Hilbert, let
neighboring POI also adjacent in disk, reduce disk
seek times
– 2) Build a new table, each row is <node, POI in the
node>, reduce io times for one node POI reading
19. Data file is bottleneck
• Re-order POI in DB
insert into us_ta_2 (select * from us_ta_1 order by node_index)
• Test Result
Process Time DB Time Dedup Time Dedup candidate matched POI
POI size Percent
0:01:46, 10ms 4ms 6ms 51 0.63
First run 0:22:58.844 (db 137ms / (removed) 51 /
only)
First run 0:03:10.985(db 19ms / 51 /
only)
0:00:46.360 (db 4ms / (removed) 51 /
only)
20. Multiple-Thread
• DB
Process Time(db DB Time Dedup Time Dedup candidate matched POI
only) POI size Percent
1 Thread 0:03:10.985 19ms / 51 /
4 Thread 0:01:05.406 24ms / 51 /
8 Thread 0:00:38.328 29ms / 51 /
• DB & Dedup
Process Time DB Time Dedup Time Dedup candidate matched POI
POI size Percent
1 Thread 0:04:07.125 18ms 5ms 51 0.6387
4 Thread db, 2 thread dedup 0:01:11.328 25ms 9ms 51 0.6387
4 Thread db, 1 thread dedup 0:01:22.953 28ms 7ms 51 0.6387
21. Another assumption
Assumption :
Build a local cache, and process POI in Hilbert Curve order would do great help
Cache:
<node, POI in the node>
DB Query:
Get POI in given nodes
Query:
- Pick up nodes which has local cache
- DB Query : nodes which has no local
cache
22. Hilbert Curve
give a mapping between 1D and 2D space that fairly well preserves locality.
23. Hilbert Curve
5k POI
DB Ordering Hilbert Curve Ordering
24. The truth
os file cache is not cleaned
# distance Total DB Parameters DB Time Dedup candidate POI cache hit ratio
Time size
first run 100 47s 4 4.7ms 80
100, cache 41s 4 4.1ms 80 5% (1679/40986)
first run 100, cache 48s 4 4.8ms 80 5%
100 41s 4 4.1ms 80
500, cache 37 11ms 474 11%
500 37 18ms 474
Assumption :
Build a local cache, and process POI in Hilbert Curve order would do some help
great
when data is not so sparse
25. Summary
• SQL itself is very simple, no tuning point ?
select * from us_ta_1 where node_index in ( ?, ? , ?...)
• Multiple-Thread is necessary to increase
throughput
– Separate Dedup and DB Query (Dedup is also
time-consuming when candidate size is big)
26. Jump out of box
• A new <node, POI> table
• No-Sql Storage with spatial support <node, POI>
• CoSE to search candidates
• Hadoop(Map-Reduce)
27. Performance Tuning Tips
• Test to verify assumption
• Make the environments as close to real as
possible
– Do not Mock
– Do not talk with US DB in CN
• Repeat test to get a coherent result (result can be
reproduced)
• Do not miss any exception case (First run is
slower than latter)
• Consider both (Mysql) client/server side