Mysql Story in POI Dedup
Outline• Problem• Proposal• Test & Verify
Problem                               Update               Deduping        AddDaily Incremental: 1 million POI        Mast...
Problem• Process  POI (target)      1) Get Candidate {POI: distance < 100 meter} from Master DB           a. Use Grid inde...
Problem• DB is time-consuming according to Content  Team experience10ms/POI, 1 million POI need 2.7 hour (DB Query)100ms/P...
Proposal• Build Local Cache• Multiple-Thread (Multiple-Boxes, Map-  Reduce)• DB Query and Dedup computation separation• Si...
Single SQL Running: DAL VS JDBC//DALCpPoiWorkDao dao = CpPoiWorkDao.getInstance();List<IPoi> poi = dao.getAllPois(PoiDataS...
First DeclarationFirst Declaration: DAL is slower than JDBC, there are performance loss in DAL
The truth   • DAL need ‘warm up’ (one more query)select     id as id,    table_set_name as table_set_name,    current_work...
Second SQL Runningselect     POI_RECORD_ID, POI_ID,    LATITUDE, …, locality,           localefrom               us_ta_1wh...
Benchmark Data• It’s slow, how is it slow ?   – Single SQL is smoke test, we want real data
Benchmark Data• Test Case•Running 10k POI, for each POI     •DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI fo...
Second DeclarationSecond Declaration:Dedup is the most important factor in the process, db is not the botteneck
The truth    • DB is fast because of cache#    distance   Process Time          DB Parameters             DB Time   Dedup ...
The truth• Clean Mysql cache & Restart Mysql   – key_buffer_size 500m -> 8 byte   – query_cache_size 64m -> 0• No effect, ...
The truth• Clean OS (linux) file cache  – echo 3 > /proc/sys/vm/drop_caches• Test Result    Process Time            DB Tim...
Mysql Index Preloading• Mysql Index Preloading  – key_buffer_size 4096m  – load index into cache us_ta_1    (INDEX_NODEX_I...
Data file is bottleneck• It seems key index does not help, the  bottleneck is in data file reading (an  assumption) ?• Ver...
Data file is bottleneck• Re-order POI in DB   insert into us_ta_2 (select * from us_ta_1 order by node_index)• Test Result...
Multiple-Thread• DB                     Process Time(db    DB Time          Dedup Time     Dedup candidate   matched POI  ...
Another assumptionAssumption :Build a local cache, and process POI in Hilbert Curve order would do great helpCache:<node, ...
Hilbert Curvegive a mapping between 1D and 2D space that fairly well preserves locality.
Hilbert Curve                  5k POIDB Ordering                Hilbert Curve Ordering
The truthos file cache is not cleaned#              distance        Total   DB Parameters   DB Time   Dedup candidate POI ...
Summary• SQL itself is very simple, no tuning point ?              select * from us_ta_1 where node_index in ( ?, ? , ?......
Jump out of box•   A new <node, POI> table•   No-Sql Storage with spatial support <node, POI>•   CoSE to search candidates...
Performance Tuning Tips• Test to verify assumption• Make the environments as close to real as  possible   – Do not Mock   ...
Upcoming SlideShare
Loading in …5
×

Mysql story in poi dedup

684 views

Published on

A Sql tuning case

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
684
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mysql story in poi dedup

  1. 1. Mysql Story in POI Dedup
  2. 2. Outline• Problem• Proposal• Test & Verify
  3. 3. Problem Update Deduping AddDaily Incremental: 1 million POI MasterDB: 23 million POI
  4. 4. Problem• Process POI (target) 1) Get Candidate {POI: distance < 100 meter} from Master DB a. Use Grid index 2) Compare target with Candidates
  5. 5. Problem• DB is time-consuming according to Content Team experience10ms/POI, 1 million POI need 2.7 hour (DB Query)100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!
  6. 6. Proposal• Build Local Cache• Multiple-Thread (Multiple-Boxes, Map- Reduce)• DB Query and Dedup computation separation• Single SQL Tuning
  7. 7. Single SQL Running: DAL VS JDBC//DALCpPoiWorkDao dao = CpPoiWorkDao.getInstance();List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive);//JDBCStatement statement = connect.createStatement();ResultSet rs = statement.executeQuery("select * from cs_1");//runningcom.telenav.content.impl.JdbcPoiLoader 0:00:04.062 42985com.telenav.content.impl.PoiLoader 0:00:10.969 42985
  8. 8. First DeclarationFirst Declaration: DAL is slower than JDBC, there are performance loss in DAL
  9. 9. The truth • DAL need ‘warm up’ (one more query)select id as id, table_set_name as table_set_name, current_work_suffix as current_work_suffix,current_live_suffix as current_live_suffix, table_set_size as table_set_size, update_time as update_time, create_time as create_time from active_table where table_set_name=? JDBC DAL First run 0:00:04.125 0:00:09.360 2 3187 4797 3 3297 4672 4 3265 4828 5 3297 4828 6 3344 4891
  10. 10. Second SQL Runningselect POI_RECORD_ID, POI_ID, LATITUDE, …, locality, localefrom us_ta_1where node_index in ( ?, ?, … ? ) JDBC DAL First run 375 1156 2 406 313 3 375 281 4 391 375 5 375 266 6 406 297First Declaration: DAL is slower than JDBC, there are performance loss in DAL
  11. 11. Benchmark Data• It’s slow, how is it slow ? – Single SQL is smoke test, we want real data
  12. 12. Benchmark Data• Test Case•Running 10k POI, for each POI •DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching •IDeDuper.getDuplications(target, candidate) to find matching from candidate•100 meter•Repeat the test for 3 times• Test Result Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent size 0:01:46, 10ms 4ms 6ms 51 0.63 6387 POI has been matched
  13. 13. Second DeclarationSecond Declaration:Dedup is the most important factor in the process, db is not the botteneck
  14. 14. The truth • DB is fast because of cache# distance Process Time DB Parameters DB Time Dedup Time Dedup candidate POI size matched POI Percent 100 total 2min30s, 14ms 4 4ms 9ms 80 0.681 500 total 30m, 180ms 37 128ms 51ms 474 0.792 500 total 11m38s, 69ms 18ms 51ms 37 node in single query each POI need compare with 474 candidates Second (latter) run is much faster than first run
  15. 15. The truth• Clean Mysql cache & Restart Mysql – key_buffer_size 500m -> 8 byte – query_cache_size 64m -> 0• No effect, the db query is still fast. – The first run time can not be reproduced for the same data set.
  16. 16. The truth• Clean OS (linux) file cache – echo 3 > /proc/sys/vm/drop_caches• Test Result Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent size 0:01:46, 10ms 4ms 6ms 51 0.63 0:22:58.844 (db only) 137ms / (removed) 51 / 30 times slower when OS file cache is cleaned Second Declaration: Dedup is the most important factor in the process, db is not the botteneck
  17. 17. Mysql Index Preloading• Mysql Index Preloading – key_buffer_size 4096m – load index into cache us_ta_1 (INDEX_NODEX_INDEX);• Nearly No effect, the db query is nearly same.
  18. 18. Data file is bottleneck• It seems key index does not help, the bottleneck is in data file reading (an assumption) ?• Verify – 1) Reorder 23 million records using Hilbert, let neighboring POI also adjacent in disk, reduce disk seek times – 2) Build a new table, each row is <node, POI in the node>, reduce io times for one node POI reading
  19. 19. Data file is bottleneck• Re-order POI in DB insert into us_ta_2 (select * from us_ta_1 order by node_index)• Test Result Process Time DB Time Dedup Time Dedup candidate matched POI POI size Percent 0:01:46, 10ms 4ms 6ms 51 0.63 First run 0:22:58.844 (db 137ms / (removed) 51 / only) First run 0:03:10.985(db 19ms / 51 / only) 0:00:46.360 (db 4ms / (removed) 51 / only)
  20. 20. Multiple-Thread• DB Process Time(db DB Time Dedup Time Dedup candidate matched POI only) POI size Percent 1 Thread 0:03:10.985 19ms / 51 / 4 Thread 0:01:05.406 24ms / 51 / 8 Thread 0:00:38.328 29ms / 51 /• DB & Dedup Process Time DB Time Dedup Time Dedup candidate matched POI POI size Percent 1 Thread 0:04:07.125 18ms 5ms 51 0.6387 4 Thread db, 2 thread dedup 0:01:11.328 25ms 9ms 51 0.6387 4 Thread db, 1 thread dedup 0:01:22.953 28ms 7ms 51 0.6387
  21. 21. Another assumptionAssumption :Build a local cache, and process POI in Hilbert Curve order would do great helpCache:<node, POI in the node>DB Query:Get POI in given nodesQuery:- Pick up nodes which has local cache- DB Query : nodes which has no localcache
  22. 22. Hilbert Curvegive a mapping between 1D and 2D space that fairly well preserves locality.
  23. 23. Hilbert Curve 5k POIDB Ordering Hilbert Curve Ordering
  24. 24. The truthos file cache is not cleaned# distance Total DB Parameters DB Time Dedup candidate POI cache hit ratio Time sizefirst run 100 47s 4 4.7ms 80 100, cache 41s 4 4.1ms 80 5% (1679/40986)first run 100, cache 48s 4 4.8ms 80 5% 100 41s 4 4.1ms 80 500, cache 37 11ms 474 11% 500 37 18ms 474Assumption :Build a local cache, and process POI in Hilbert Curve order would do some help greatwhen data is not so sparse
  25. 25. Summary• SQL itself is very simple, no tuning point ? select * from us_ta_1 where node_index in ( ?, ? , ?...)• Multiple-Thread is necessary to increase throughput – Separate Dedup and DB Query (Dedup is also time-consuming when candidate size is big)
  26. 26. Jump out of box• A new <node, POI> table• No-Sql Storage with spatial support <node, POI>• CoSE to search candidates• Hadoop(Map-Reduce)
  27. 27. Performance Tuning Tips• Test to verify assumption• Make the environments as close to real as possible – Do not Mock – Do not talk with US DB in CN• Repeat test to get a coherent result (result can be reproduced)• Do not miss any exception case (First run is slower than latter)• Consider both (Mysql) client/server side

×