Mysql story in poi dedup

Outline
• Problem
• Proposal
• Test & Verify

Problem

Update

Deduping Add

Daily Incremental: 1 million POI MasterDB: 23 million POI

Problem
• Process
POI (target)
1) Get Candidate {POI: distance < 100 meter} from Master DB
a. Use Grid index
2) Compare target with Candidates

Problem
• DB is time-consuming according to Content
Team experience

10ms/POI, 1 million POI need 2.7 hour (DB Query)
100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!

Proposal
• Build Local Cache
• Multiple-Thread (Multiple-Boxes, Map-
Reduce)
• DB Query and Dedup computation separation
• Single SQL Tuning

Single SQL Running: DAL VS JDBC
//DAL
CpPoiWorkDao dao = CpPoiWorkDao.getInstance();
List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive);

//JDBC
Statement statement = connect.createStatement();
ResultSet rs = statement.executeQuery("select * from cs_1");

//running
com.telenav.content.impl.JdbcPoiLoader 0:00:04.062 42985
com.telenav.content.impl.PoiLoader 0:00:10.969 42985

First Declaration

First Declaration: DAL is slower than JDBC, there are performance loss in DAL

The truth
• DAL need ‘warm up’ (one more query)
select id as id, table_set_name as table_set_name, current_work_suffix as current_work_suffix,
current_live_suffix as current_live_suffix, table_set_size as table_set_size, update_time as update_time,
create_time as create_time from active_table where table_set_name=?

JDBC DAL
First run 0:00:04.125 0:00:09.360
2 3187 4797
3 3297 4672
4 3265 4828
5 3297 4828
6 3344 4891

Second SQL Running
select POI_RECORD_ID, POI_ID, LATITUDE, …, locality, locale
from us_ta_1
where node_index in ( ?, ?, … ? )

JDBC DAL
First run 375 1156
2 406 313
3 375 281
4 391 375
5 375 266
6 406 297

First Declaration: DAL is slower than JDBC, there are performance loss in DAL

Benchmark Data
• It’s slow, how is it slow ?
– Single SQL is smoke test, we want real data

Benchmark Data
• Test Case
•Running 10k POI, for each POI
•DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching
•IDeDuper.getDuplications(target, candidate) to find matching from candidate
•100 meter
•Repeat the test for 3 times

• Test Result

Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent
size

0:01:46, 10ms 4ms 6ms 51 0.63

6387 POI has been matched

Second Declaration

Second Declaration:
Dedup is the most important factor in the process, db is not the botteneck

The truth
• DB is fast because of cache
# distance Process Time DB Parameters DB Time Dedup Time Dedup candidate POI size matched POI Percent

100 total 2min30s, 14ms 4 4ms 9ms 80 0.68

1 500 total 30m, 180ms 37 128ms 51ms 474 0.79

2 500 total 11m38s, 69ms 18ms 51ms

37 node in single query each POI need compare with 474 candidates

Second (latter) run is much faster than first run

The truth
• Clean Mysql cache & Restart Mysql
– key_buffer_size 500m -> 8 byte
– query_cache_size 64m -> 0

• No effect, the db query is still fast.
– The first run time can not be reproduced for the
same data set.

The truth
• Clean OS (linux) file cache
– echo 3 > /proc/sys/vm/drop_caches

• Test Result
Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent
size

0:01:46, 10ms 4ms 6ms 51 0.63

0:22:58.844 (db only) 137ms / (removed) 51 /

30 times slower when OS file cache is cleaned
Second Declaration:
Dedup is the most important factor in the process, db is not the botteneck

Mysql Index Preloading
• Mysql Index Preloading
– key_buffer_size 4096m
– load index into cache us_ta_1
(INDEX_NODEX_INDEX);

• Nearly No effect, the db query is nearly same.

Data file is bottleneck
• It seems key index does not help, the
bottleneck is in data file reading (an
assumption) ?
• Verify
– 1) Reorder 23 million records using Hilbert, let
neighboring POI also adjacent in disk, reduce disk
seek times
– 2) Build a new table, each row is <node, POI in the
node>, reduce io times for one node POI reading

Data file is bottleneck
• Re-order POI in DB
insert into us_ta_2 (select * from us_ta_1 order by node_index)

• Test Result
Process Time DB Time Dedup Time Dedup candidate matched POI
POI size Percent

0:01:46, 10ms 4ms 6ms 51 0.63

First run 0:22:58.844 (db 137ms / (removed) 51 /
only)

First run 0:03:10.985(db 19ms / 51 /
only)

0:00:46.360 (db 4ms / (removed) 51 /
only)

Multiple-Thread
• DB
Process Time(db DB Time Dedup Time Dedup candidate matched POI
only) POI size Percent

1 Thread 0:03:10.985 19ms / 51 /

4 Thread 0:01:05.406 24ms / 51 /

8 Thread 0:00:38.328 29ms / 51 /

• DB & Dedup
Process Time DB Time Dedup Time Dedup candidate matched POI
POI size Percent

1 Thread 0:04:07.125 18ms 5ms 51 0.6387

4 Thread db, 2 thread dedup 0:01:11.328 25ms 9ms 51 0.6387

4 Thread db, 1 thread dedup 0:01:22.953 28ms 7ms 51 0.6387

Another assumption

Assumption :
Build a local cache, and process POI in Hilbert Curve order would do great help

Cache:
<node, POI in the node>

DB Query:
Get POI in given nodes

Query:
- Pick up nodes which has local cache
- DB Query : nodes which has no local
cache

Hilbert Curve

give a mapping between 1D and 2D space that fairly well preserves locality.

Hilbert Curve
5k POI

DB Ordering Hilbert Curve Ordering

The truth
os file cache is not cleaned

# distance Total DB Parameters DB Time Dedup candidate POI cache hit ratio
Time size

first run 100 47s 4 4.7ms 80

100, cache 41s 4 4.1ms 80 5% (1679/40986)

first run 100, cache 48s 4 4.8ms 80 5%

100 41s 4 4.1ms 80

500, cache 37 11ms 474 11%

500 37 18ms 474

Assumption :
Build a local cache, and process POI in Hilbert Curve order would do some help
great
when data is not so sparse

Summary
• SQL itself is very simple, no tuning point ?
select * from us_ta_1 where node_index in ( ?, ? , ?...)

• Multiple-Thread is necessary to increase
throughput
– Separate Dedup and DB Query (Dedup is also
time-consuming when candidate size is big)

Jump out of box
• A new <node, POI> table
• No-Sql Storage with spatial support <node, POI>
• CoSE to search candidates
• Hadoop(Map-Reduce)

Performance Tuning Tips
• Test to verify assumption
• Make the environments as close to real as
possible
– Do not Mock
– Do not talk with US DB in CN
• Repeat test to get a coherent result (result can be
reproduced)
• Do not miss any exception case (First run is
slower than latter)
• Consider both (Mysql) client/server side

Mysql story in poi dedup

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to Mysql story in poi dedup

Similar to Mysql story in poi dedup (20)

Recently uploaded

Recently uploaded (20)

Mysql story in poi dedup