SlideShare a Scribd company logo
1 of 27
Mysql Story in POI Dedup
Outline
• Problem
• Proposal
• Test & Verify
Problem


                               Update

               Deduping        Add



Daily Incremental: 1 million POI        MasterDB: 23 million POI
Problem
• Process
  POI (target)
      1) Get Candidate {POI: distance < 100 meter} from Master DB
           a. Use Grid index
      2) Compare target with Candidates
Problem
• DB is time-consuming according to Content
  Team experience


10ms/POI, 1 million POI need 2.7 hour (DB Query)
100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!
Proposal
• Build Local Cache
• Multiple-Thread (Multiple-Boxes, Map-
  Reduce)
• DB Query and Dedup computation separation
• Single SQL Tuning
Single SQL Running: DAL VS JDBC
//DAL
CpPoiWorkDao dao = CpPoiWorkDao.getInstance();
List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive);




//JDBC
Statement statement = connect.createStatement();
ResultSet rs = statement.executeQuery("select * from cs_1");




//running
com.telenav.content.impl.JdbcPoiLoader   0:00:04.062 42985
com.telenav.content.impl.PoiLoader       0:00:10.969 42985
First Declaration



First Declaration: DAL is slower than JDBC, there are performance loss in DAL
The truth
   • DAL need ‘warm up’ (one more query)
select     id as id,    table_set_name as table_set_name,    current_work_suffix as current_work_suffix,
current_live_suffix as current_live_suffix,    table_set_size as table_set_size,    update_time as update_time,
 create_time as create_time   from     active_table   where     table_set_name=?




                                                JDBC               DAL
                             First run          0:00:04.125        0:00:09.360
                             2                  3187               4797
                             3                  3297               4672
                             4                  3265               4828
                             5                  3297               4828
                             6                  3344               4891
Second SQL Running
select     POI_RECORD_ID, POI_ID,    LATITUDE, …, locality,           locale
from               us_ta_1
where                node_index in (   ?, ?, … ? )


                                JDBC       DAL
                    First run   375        1156
                    2           406        313
                    3           375        281
                    4           391        375
                    5           375        266
                    6           406        297

First Declaration: DAL is slower than JDBC, there are performance loss in DAL
Benchmark Data
• It’s slow, how is it slow ?
   – Single SQL is smoke test, we want real data
Benchmark Data
• Test Case
•Running 10k POI, for each POI
     •DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching
     •IDeDuper.getDuplications(target, candidate) to find matching from candidate
•100 meter
•Repeat the test for 3 times



• Test Result

      Process Time    DB Time   Dedup Time   Dedup candidate POI    matched POI Percent
                                             size

      0:01:46, 10ms   4ms       6ms          51                      0.63

                                                                   6387 POI has been matched
Second Declaration


Second Declaration:
Dedup is the most important factor in the process, db is not the botteneck
The truth
    • DB is fast because of cache
#    distance   Process Time          DB Parameters             DB Time   Dedup Time      Dedup candidate POI size             matched POI Percent



     100        total 2min30s, 14ms   4                         4ms       9ms             80                                   0.68


1    500        total 30m, 180ms      37                        128ms     51ms            474                                  0.79


2    500        total 11m38s, 69ms                              18ms      51ms


                                      37 node in single query                          each POI need compare with 474 candidates




                      Second (latter) run is much faster than first run
The truth
• Clean Mysql cache & Restart Mysql
   – key_buffer_size 500m -> 8 byte
   – query_cache_size 64m -> 0




• No effect, the db query is still fast.
   – The first run time can not be reproduced for the
     same data set.
The truth
• Clean OS (linux) file cache
  – echo 3 > /proc/sys/vm/drop_caches


• Test Result
    Process Time            DB Time   Dedup Time    Dedup candidate POI   matched POI Percent
                                                    size

    0:01:46, 10ms           4ms       6ms           51                    0.63

    0:22:58.844 (db only)   137ms     / (removed)   51                    /




           30 times slower when OS file cache is cleaned
           Second Declaration:
           Dedup is the most important factor in the process, db is not the botteneck
Mysql Index Preloading
• Mysql Index Preloading
  – key_buffer_size 4096m
  – load index into cache us_ta_1
    (INDEX_NODEX_INDEX);


• Nearly No effect, the db query is nearly same.
Data file is bottleneck
• It seems key index does not help, the
  bottleneck is in data file reading (an
  assumption) ?
• Verify
  – 1) Reorder 23 million records using Hilbert, let
    neighboring POI also adjacent in disk, reduce disk
    seek times
  – 2) Build a new table, each row is <node, POI in the
    node>, reduce io times for one node POI reading
Data file is bottleneck
• Re-order POI in DB
   insert into us_ta_2 (select * from us_ta_1 order by node_index)




• Test Result
                  Process Time      DB Time   Dedup Time    Dedup candidate   matched POI
                                                            POI size          Percent

                  0:01:46, 10ms     4ms       6ms           51                0.63

     First run    0:22:58.844 (db   137ms     / (removed)   51                /
                  only)


     First run    0:03:10.985(db    19ms      /             51                /
                  only)

                  0:00:46.360 (db   4ms       / (removed)   51                /
                  only)
Multiple-Thread
• DB
                     Process Time(db    DB Time          Dedup Time     Dedup candidate   matched POI
                     only)                                              POI size          Percent

   1 Thread          0:03:10.985        19ms             /              51                /

   4 Thread          0:01:05.406        24ms             /              51                /

   8 Thread          0:00:38.328        29ms             /              51                /




• DB & Dedup
                                 Process Time     DB Time         Dedup Time      Dedup candidate   matched POI
                                                                                  POI size          Percent

   1 Thread                      0:04:07.125      18ms            5ms             51                0.6387

   4 Thread db, 2 thread dedup   0:01:11.328      25ms            9ms             51                0.6387

   4 Thread db, 1 thread dedup   0:01:22.953      28ms            7ms             51                0.6387
Another assumption

Assumption :
Build a local cache, and process POI in Hilbert Curve order would do great help



Cache:
<node, POI in the node>

DB Query:
Get POI in given nodes

Query:
- Pick up nodes which has local cache
- DB Query : nodes which has no local
cache
Hilbert Curve




give a mapping between 1D and 2D space that fairly well preserves locality.
Hilbert Curve
                  5k POI




DB Ordering                Hilbert Curve Ordering
The truth
os file cache is not cleaned


#              distance        Total   DB Parameters   DB Time   Dedup candidate POI   cache hit ratio
                               Time                              size

first run      100             47s     4               4.7ms     80

               100, cache      41s     4               4.1ms     80                    5% (1679/40986)

first run      100, cache      48s     4               4.8ms     80                     5%

               100             41s     4               4.1ms     80


               500, cache              37              11ms      474                   11%

               500                     37              18ms      474



Assumption :
Build a local cache, and process POI in Hilbert Curve order would do some help
                                                                     great
when data is not so sparse
Summary
• SQL itself is very simple, no tuning point ?
              select * from us_ta_1 where node_index in ( ?, ? , ?...)




• Multiple-Thread is necessary to increase
  throughput
  – Separate Dedup and DB Query (Dedup is also
    time-consuming when candidate size is big)
Jump out of box
•   A new <node, POI> table
•   No-Sql Storage with spatial support <node, POI>
•   CoSE to search candidates
•   Hadoop(Map-Reduce)
Performance Tuning Tips
• Test to verify assumption
• Make the environments as close to real as
  possible
   – Do not Mock
   – Do not talk with US DB in CN
• Repeat test to get a coherent result (result can be
  reproduced)
• Do not miss any exception case (First run is
  slower than latter)
• Consider both (Mysql) client/server side

More Related Content

Viewers also liked

Planning and Research Presentation
Planning and Research PresentationPlanning and Research Presentation
Planning and Research PresentationPNakan
 
Sleepingbeauty
SleepingbeautySleepingbeauty
SleepingbeautyPNakan
 
Tinkertailorsoldierspy 1
Tinkertailorsoldierspy 1Tinkertailorsoldierspy 1
Tinkertailorsoldierspy 1PNakan
 
Media evaluation
Media evaluationMedia evaluation
Media evaluationPNakan
 
Moodboard
MoodboardMoodboard
MoodboardPNakan
 
Planning and Research presentation
Planning and Research presentationPlanning and Research presentation
Planning and Research presentationPNakan
 
Penny Nakan
Penny NakanPenny Nakan
Penny NakanPNakan
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装feng lee
 
Womaninblack 1
Womaninblack 1Womaninblack 1
Womaninblack 1PNakan
 
DPS Inspiration
DPS InspirationDPS Inspiration
DPS InspirationPNakan
 
Guice in athena
Guice in athenaGuice in athena
Guice in athenafeng lee
 
Axis2 client memory leak
Axis2 client memory leakAxis2 client memory leak
Axis2 client memory leakfeng lee
 
Bloom filter
Bloom filterBloom filter
Bloom filterfeng lee
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrencyfeng lee
 

Viewers also liked (17)

Planning and Research Presentation
Planning and Research PresentationPlanning and Research Presentation
Planning and Research Presentation
 
Sleepingbeauty
SleepingbeautySleepingbeauty
Sleepingbeauty
 
Tinkertailorsoldierspy 1
Tinkertailorsoldierspy 1Tinkertailorsoldierspy 1
Tinkertailorsoldierspy 1
 
Media evaluation
Media evaluationMedia evaluation
Media evaluation
 
Moodboard
MoodboardMoodboard
Moodboard
 
Planning and Research presentation
Planning and Research presentationPlanning and Research presentation
Planning and Research presentation
 
Penny Nakan
Penny NakanPenny Nakan
Penny Nakan
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装
 
Papio ON la frumusete
Papio ON la frumusetePapio ON la frumusete
Papio ON la frumusete
 
Womaninblack 1
Womaninblack 1Womaninblack 1
Womaninblack 1
 
DPS Inspiration
DPS InspirationDPS Inspiration
DPS Inspiration
 
Guice in athena
Guice in athenaGuice in athena
Guice in athena
 
Axis2 client memory leak
Axis2 client memory leakAxis2 client memory leak
Axis2 client memory leak
 
Maven
MavenMaven
Maven
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrency
 
GIS is dead, long live GIS!
GIS is dead, long live GIS!GIS is dead, long live GIS!
GIS is dead, long live GIS!
 

Similar to Mysql story in poi dedup

Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkXiaoxi Chen
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB
 
Celery: The Distributed Task Queue
Celery: The Distributed Task QueueCelery: The Distributed Task Queue
Celery: The Distributed Task QueueRichard Leland
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
Geoindexing with MongoDB
Geoindexing with MongoDBGeoindexing with MongoDB
Geoindexing with MongoDBleafnode
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red_Hat_Storage
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonAlluxio, Inc.
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMShaoshan Liu
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
 

Similar to Mysql story in poi dedup (20)

Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 
Tempdb3
Tempdb3Tempdb3
Tempdb3
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation Performance
 
Celery: The Distributed Task Queue
Celery: The Distributed Task QueueCelery: The Distributed Task Queue
Celery: The Distributed Task Queue
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Geoindexing with MongoDB
Geoindexing with MongoDBGeoindexing with MongoDB
Geoindexing with MongoDB
 
Tempdb, More permanent than you think
Tempdb, More permanent than you thinkTempdb, More permanent than you think
Tempdb, More permanent than you think
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on Tachyon
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBM
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Mysql story in poi dedup

  • 1. Mysql Story in POI Dedup
  • 3. Problem Update Deduping Add Daily Incremental: 1 million POI MasterDB: 23 million POI
  • 4. Problem • Process POI (target) 1) Get Candidate {POI: distance < 100 meter} from Master DB a. Use Grid index 2) Compare target with Candidates
  • 5. Problem • DB is time-consuming according to Content Team experience 10ms/POI, 1 million POI need 2.7 hour (DB Query) 100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!
  • 6. Proposal • Build Local Cache • Multiple-Thread (Multiple-Boxes, Map- Reduce) • DB Query and Dedup computation separation • Single SQL Tuning
  • 7. Single SQL Running: DAL VS JDBC //DAL CpPoiWorkDao dao = CpPoiWorkDao.getInstance(); List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive); //JDBC Statement statement = connect.createStatement(); ResultSet rs = statement.executeQuery("select * from cs_1"); //running com.telenav.content.impl.JdbcPoiLoader 0:00:04.062 42985 com.telenav.content.impl.PoiLoader 0:00:10.969 42985
  • 8. First Declaration First Declaration: DAL is slower than JDBC, there are performance loss in DAL
  • 9. The truth • DAL need ‘warm up’ (one more query) select id as id, table_set_name as table_set_name, current_work_suffix as current_work_suffix, current_live_suffix as current_live_suffix, table_set_size as table_set_size, update_time as update_time, create_time as create_time from active_table where table_set_name=? JDBC DAL First run 0:00:04.125 0:00:09.360 2 3187 4797 3 3297 4672 4 3265 4828 5 3297 4828 6 3344 4891
  • 10. Second SQL Running select POI_RECORD_ID, POI_ID, LATITUDE, …, locality, locale from us_ta_1 where node_index in ( ?, ?, … ? ) JDBC DAL First run 375 1156 2 406 313 3 375 281 4 391 375 5 375 266 6 406 297 First Declaration: DAL is slower than JDBC, there are performance loss in DAL
  • 11. Benchmark Data • It’s slow, how is it slow ? – Single SQL is smoke test, we want real data
  • 12. Benchmark Data • Test Case •Running 10k POI, for each POI •DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching •IDeDuper.getDuplications(target, candidate) to find matching from candidate •100 meter •Repeat the test for 3 times • Test Result Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent size 0:01:46, 10ms 4ms 6ms 51 0.63 6387 POI has been matched
  • 13. Second Declaration Second Declaration: Dedup is the most important factor in the process, db is not the botteneck
  • 14. The truth • DB is fast because of cache # distance Process Time DB Parameters DB Time Dedup Time Dedup candidate POI size matched POI Percent 100 total 2min30s, 14ms 4 4ms 9ms 80 0.68 1 500 total 30m, 180ms 37 128ms 51ms 474 0.79 2 500 total 11m38s, 69ms 18ms 51ms 37 node in single query each POI need compare with 474 candidates Second (latter) run is much faster than first run
  • 15. The truth • Clean Mysql cache & Restart Mysql – key_buffer_size 500m -> 8 byte – query_cache_size 64m -> 0 • No effect, the db query is still fast. – The first run time can not be reproduced for the same data set.
  • 16. The truth • Clean OS (linux) file cache – echo 3 > /proc/sys/vm/drop_caches • Test Result Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent size 0:01:46, 10ms 4ms 6ms 51 0.63 0:22:58.844 (db only) 137ms / (removed) 51 / 30 times slower when OS file cache is cleaned Second Declaration: Dedup is the most important factor in the process, db is not the botteneck
  • 17. Mysql Index Preloading • Mysql Index Preloading – key_buffer_size 4096m – load index into cache us_ta_1 (INDEX_NODEX_INDEX); • Nearly No effect, the db query is nearly same.
  • 18. Data file is bottleneck • It seems key index does not help, the bottleneck is in data file reading (an assumption) ? • Verify – 1) Reorder 23 million records using Hilbert, let neighboring POI also adjacent in disk, reduce disk seek times – 2) Build a new table, each row is <node, POI in the node>, reduce io times for one node POI reading
  • 19. Data file is bottleneck • Re-order POI in DB insert into us_ta_2 (select * from us_ta_1 order by node_index) • Test Result Process Time DB Time Dedup Time Dedup candidate matched POI POI size Percent 0:01:46, 10ms 4ms 6ms 51 0.63 First run 0:22:58.844 (db 137ms / (removed) 51 / only) First run 0:03:10.985(db 19ms / 51 / only) 0:00:46.360 (db 4ms / (removed) 51 / only)
  • 20. Multiple-Thread • DB Process Time(db DB Time Dedup Time Dedup candidate matched POI only) POI size Percent 1 Thread 0:03:10.985 19ms / 51 / 4 Thread 0:01:05.406 24ms / 51 / 8 Thread 0:00:38.328 29ms / 51 / • DB & Dedup Process Time DB Time Dedup Time Dedup candidate matched POI POI size Percent 1 Thread 0:04:07.125 18ms 5ms 51 0.6387 4 Thread db, 2 thread dedup 0:01:11.328 25ms 9ms 51 0.6387 4 Thread db, 1 thread dedup 0:01:22.953 28ms 7ms 51 0.6387
  • 21. Another assumption Assumption : Build a local cache, and process POI in Hilbert Curve order would do great help Cache: <node, POI in the node> DB Query: Get POI in given nodes Query: - Pick up nodes which has local cache - DB Query : nodes which has no local cache
  • 22. Hilbert Curve give a mapping between 1D and 2D space that fairly well preserves locality.
  • 23. Hilbert Curve 5k POI DB Ordering Hilbert Curve Ordering
  • 24. The truth os file cache is not cleaned # distance Total DB Parameters DB Time Dedup candidate POI cache hit ratio Time size first run 100 47s 4 4.7ms 80 100, cache 41s 4 4.1ms 80 5% (1679/40986) first run 100, cache 48s 4 4.8ms 80 5% 100 41s 4 4.1ms 80 500, cache 37 11ms 474 11% 500 37 18ms 474 Assumption : Build a local cache, and process POI in Hilbert Curve order would do some help great when data is not so sparse
  • 25. Summary • SQL itself is very simple, no tuning point ? select * from us_ta_1 where node_index in ( ?, ? , ?...) • Multiple-Thread is necessary to increase throughput – Separate Dedup and DB Query (Dedup is also time-consuming when candidate size is big)
  • 26. Jump out of box • A new <node, POI> table • No-Sql Storage with spatial support <node, POI> • CoSE to search candidates • Hadoop(Map-Reduce)
  • 27. Performance Tuning Tips • Test to verify assumption • Make the environments as close to real as possible – Do not Mock – Do not talk with US DB in CN • Repeat test to get a coherent result (result can be reproduced) • Do not miss any exception case (First run is slower than latter) • Consider both (Mysql) client/server side