SlideShare a Scribd company logo
1 of 27
Mysql Story in POI Dedup
Outline
• Problem
• Proposal
• Test & Verify
Problem


                               Update

               Deduping        Add



Daily Incremental: 1 million POI        MasterDB: 23 million POI
Problem
• Process
  POI (target)
      1) Get Candidate {POI: distance < 100 meter} from Master DB
           a. Use Grid index
      2) Compare target with Candidates
Problem
• DB is time-consuming according to Content
  Team experience


10ms/POI, 1 million POI need 2.7 hour (DB Query)
100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!
Proposal
• Build Local Cache
• Multiple-Thread (Multiple-Boxes, Map-
  Reduce)
• DB Query and Dedup computation separation
• Single SQL Tuning
Single SQL Running: DAL VS JDBC
//DAL
CpPoiWorkDao dao = CpPoiWorkDao.getInstance();
List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive);




//JDBC
Statement statement = connect.createStatement();
ResultSet rs = statement.executeQuery("select * from cs_1");




//running
com.telenav.content.impl.JdbcPoiLoader   0:00:04.062 42985
com.telenav.content.impl.PoiLoader       0:00:10.969 42985
First Declaration



First Declaration: DAL is slower than JDBC, there are performance loss in DAL
The truth
   • DAL need ‘warm up’ (one more query)
select     id as id,    table_set_name as table_set_name,    current_work_suffix as current_work_suffix,
current_live_suffix as current_live_suffix,    table_set_size as table_set_size,    update_time as update_time,
 create_time as create_time   from     active_table   where     table_set_name=?




                                                JDBC               DAL
                             First run          0:00:04.125        0:00:09.360
                             2                  3187               4797
                             3                  3297               4672
                             4                  3265               4828
                             5                  3297               4828
                             6                  3344               4891
Second SQL Running
select     POI_RECORD_ID, POI_ID,    LATITUDE, …, locality,           locale
from               us_ta_1
where                node_index in (   ?, ?, … ? )


                                JDBC       DAL
                    First run   375        1156
                    2           406        313
                    3           375        281
                    4           391        375
                    5           375        266
                    6           406        297

First Declaration: DAL is slower than JDBC, there are performance loss in DAL
Benchmark Data
• It’s slow, how is it slow ?
   – Single SQL is smoke test, we want real data
Benchmark Data
• Test Case
•Running 10k POI, for each POI
     •DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching
     •IDeDuper.getDuplications(target, candidate) to find matching from candidate
•100 meter
•Repeat the test for 3 times



• Test Result

      Process Time    DB Time   Dedup Time   Dedup candidate POI    matched POI Percent
                                             size

      0:01:46, 10ms   4ms       6ms          51                      0.63

                                                                   6387 POI has been matched
Second Declaration


Second Declaration:
Dedup is the most important factor in the process, db is not the botteneck
The truth
    • DB is fast because of cache
#    distance   Process Time          DB Parameters             DB Time   Dedup Time      Dedup candidate POI size             matched POI Percent



     100        total 2min30s, 14ms   4                         4ms       9ms             80                                   0.68


1    500        total 30m, 180ms      37                        128ms     51ms            474                                  0.79


2    500        total 11m38s, 69ms                              18ms      51ms


                                      37 node in single query                          each POI need compare with 474 candidates




                      Second (latter) run is much faster than first run
The truth
• Clean Mysql cache & Restart Mysql
   – key_buffer_size 500m -> 8 byte
   – query_cache_size 64m -> 0




• No effect, the db query is still fast.
   – The first run time can not be reproduced for the
     same data set.
The truth
• Clean OS (linux) file cache
  – echo 3 > /proc/sys/vm/drop_caches


• Test Result
    Process Time            DB Time   Dedup Time    Dedup candidate POI   matched POI Percent
                                                    size

    0:01:46, 10ms           4ms       6ms           51                    0.63

    0:22:58.844 (db only)   137ms     / (removed)   51                    /




           30 times slower when OS file cache is cleaned
           Second Declaration:
           Dedup is the most important factor in the process, db is not the botteneck
Mysql Index Preloading
• Mysql Index Preloading
  – key_buffer_size 4096m
  – load index into cache us_ta_1
    (INDEX_NODEX_INDEX);


• Nearly No effect, the db query is nearly same.
Data file is bottleneck
• It seems key index does not help, the
  bottleneck is in data file reading (an
  assumption) ?
• Verify
  – 1) Reorder 23 million records using Hilbert, let
    neighboring POI also adjacent in disk, reduce disk
    seek times
  – 2) Build a new table, each row is <node, POI in the
    node>, reduce io times for one node POI reading
Data file is bottleneck
• Re-order POI in DB
   insert into us_ta_2 (select * from us_ta_1 order by node_index)




• Test Result
                  Process Time      DB Time   Dedup Time    Dedup candidate   matched POI
                                                            POI size          Percent

                  0:01:46, 10ms     4ms       6ms           51                0.63

     First run    0:22:58.844 (db   137ms     / (removed)   51                /
                  only)


     First run    0:03:10.985(db    19ms      /             51                /
                  only)

                  0:00:46.360 (db   4ms       / (removed)   51                /
                  only)
Multiple-Thread
• DB
                     Process Time(db    DB Time          Dedup Time     Dedup candidate   matched POI
                     only)                                              POI size          Percent

   1 Thread          0:03:10.985        19ms             /              51                /

   4 Thread          0:01:05.406        24ms             /              51                /

   8 Thread          0:00:38.328        29ms             /              51                /




• DB & Dedup
                                 Process Time     DB Time         Dedup Time      Dedup candidate   matched POI
                                                                                  POI size          Percent

   1 Thread                      0:04:07.125      18ms            5ms             51                0.6387

   4 Thread db, 2 thread dedup   0:01:11.328      25ms            9ms             51                0.6387

   4 Thread db, 1 thread dedup   0:01:22.953      28ms            7ms             51                0.6387
Another assumption

Assumption :
Build a local cache, and process POI in Hilbert Curve order would do great help



Cache:
<node, POI in the node>

DB Query:
Get POI in given nodes

Query:
- Pick up nodes which has local cache
- DB Query : nodes which has no local
cache
Hilbert Curve




give a mapping between 1D and 2D space that fairly well preserves locality.
Hilbert Curve
                  5k POI




DB Ordering                Hilbert Curve Ordering
The truth
os file cache is not cleaned


#              distance        Total   DB Parameters   DB Time   Dedup candidate POI   cache hit ratio
                               Time                              size

first run      100             47s     4               4.7ms     80

               100, cache      41s     4               4.1ms     80                    5% (1679/40986)

first run      100, cache      48s     4               4.8ms     80                     5%

               100             41s     4               4.1ms     80


               500, cache              37              11ms      474                   11%

               500                     37              18ms      474



Assumption :
Build a local cache, and process POI in Hilbert Curve order would do some help
                                                                     great
when data is not so sparse
Summary
• SQL itself is very simple, no tuning point ?
              select * from us_ta_1 where node_index in ( ?, ? , ?...)




• Multiple-Thread is necessary to increase
  throughput
  – Separate Dedup and DB Query (Dedup is also
    time-consuming when candidate size is big)
Jump out of box
•   A new <node, POI> table
•   No-Sql Storage with spatial support <node, POI>
•   CoSE to search candidates
•   Hadoop(Map-Reduce)
Performance Tuning Tips
• Test to verify assumption
• Make the environments as close to real as
  possible
   – Do not Mock
   – Do not talk with US DB in CN
• Repeat test to get a coherent result (result can be
  reproduced)
• Do not miss any exception case (First run is
  slower than latter)
• Consider both (Mysql) client/server side

More Related Content

Viewers also liked

Planning and Research Presentation
Planning and Research PresentationPlanning and Research Presentation
Planning and Research PresentationPNakan
 
Sleepingbeauty
SleepingbeautySleepingbeauty
SleepingbeautyPNakan
 
Tinkertailorsoldierspy 1
Tinkertailorsoldierspy 1Tinkertailorsoldierspy 1
Tinkertailorsoldierspy 1PNakan
 
Media evaluation
Media evaluationMedia evaluation
Media evaluationPNakan
 
Moodboard
MoodboardMoodboard
MoodboardPNakan
 
Planning and Research presentation
Planning and Research presentationPlanning and Research presentation
Planning and Research presentationPNakan
 
Penny Nakan
Penny NakanPenny Nakan
Penny NakanPNakan
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装feng lee
 
Womaninblack 1
Womaninblack 1Womaninblack 1
Womaninblack 1PNakan
 
DPS Inspiration
DPS InspirationDPS Inspiration
DPS InspirationPNakan
 
Guice in athena
Guice in athenaGuice in athena
Guice in athenafeng lee
 
Axis2 client memory leak
Axis2 client memory leakAxis2 client memory leak
Axis2 client memory leakfeng lee
 
Bloom filter
Bloom filterBloom filter
Bloom filterfeng lee
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrencyfeng lee
 

Viewers also liked (17)

Planning and Research Presentation
Planning and Research PresentationPlanning and Research Presentation
Planning and Research Presentation
 
Sleepingbeauty
SleepingbeautySleepingbeauty
Sleepingbeauty
 
Tinkertailorsoldierspy 1
Tinkertailorsoldierspy 1Tinkertailorsoldierspy 1
Tinkertailorsoldierspy 1
 
Media evaluation
Media evaluationMedia evaluation
Media evaluation
 
Moodboard
MoodboardMoodboard
Moodboard
 
Planning and Research presentation
Planning and Research presentationPlanning and Research presentation
Planning and Research presentation
 
Penny Nakan
Penny NakanPenny Nakan
Penny Nakan
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装
 
Papio ON la frumusete
Papio ON la frumusetePapio ON la frumusete
Papio ON la frumusete
 
Womaninblack 1
Womaninblack 1Womaninblack 1
Womaninblack 1
 
DPS Inspiration
DPS InspirationDPS Inspiration
DPS Inspiration
 
Guice in athena
Guice in athenaGuice in athena
Guice in athena
 
Axis2 client memory leak
Axis2 client memory leakAxis2 client memory leak
Axis2 client memory leak
 
Maven
MavenMaven
Maven
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrency
 
GIS is dead, long live GIS!
GIS is dead, long live GIS!GIS is dead, long live GIS!
GIS is dead, long live GIS!
 

Similar to Mysql story in poi dedup

Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkXiaoxi Chen
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB
 
Celery: The Distributed Task Queue
Celery: The Distributed Task QueueCelery: The Distributed Task Queue
Celery: The Distributed Task QueueRichard Leland
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
Geoindexing with MongoDB
Geoindexing with MongoDBGeoindexing with MongoDB
Geoindexing with MongoDBleafnode
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red_Hat_Storage
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonAlluxio, Inc.
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMShaoshan Liu
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
 

Similar to Mysql story in poi dedup (20)

Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 
Tempdb3
Tempdb3Tempdb3
Tempdb3
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation Performance
 
Celery: The Distributed Task Queue
Celery: The Distributed Task QueueCelery: The Distributed Task Queue
Celery: The Distributed Task Queue
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Geoindexing with MongoDB
Geoindexing with MongoDBGeoindexing with MongoDB
Geoindexing with MongoDB
 
Tempdb, More permanent than you think
Tempdb, More permanent than you thinkTempdb, More permanent than you think
Tempdb, More permanent than you think
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on Tachyon
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBM
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Mysql story in poi dedup

  • 1. Mysql Story in POI Dedup
  • 3. Problem Update Deduping Add Daily Incremental: 1 million POI MasterDB: 23 million POI
  • 4. Problem • Process POI (target) 1) Get Candidate {POI: distance < 100 meter} from Master DB a. Use Grid index 2) Compare target with Candidates
  • 5. Problem • DB is time-consuming according to Content Team experience 10ms/POI, 1 million POI need 2.7 hour (DB Query) 100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!
  • 6. Proposal • Build Local Cache • Multiple-Thread (Multiple-Boxes, Map- Reduce) • DB Query and Dedup computation separation • Single SQL Tuning
  • 7. Single SQL Running: DAL VS JDBC //DAL CpPoiWorkDao dao = CpPoiWorkDao.getInstance(); List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive); //JDBC Statement statement = connect.createStatement(); ResultSet rs = statement.executeQuery("select * from cs_1"); //running com.telenav.content.impl.JdbcPoiLoader 0:00:04.062 42985 com.telenav.content.impl.PoiLoader 0:00:10.969 42985
  • 8. First Declaration First Declaration: DAL is slower than JDBC, there are performance loss in DAL
  • 9. The truth • DAL need ‘warm up’ (one more query) select id as id, table_set_name as table_set_name, current_work_suffix as current_work_suffix, current_live_suffix as current_live_suffix, table_set_size as table_set_size, update_time as update_time, create_time as create_time from active_table where table_set_name=? JDBC DAL First run 0:00:04.125 0:00:09.360 2 3187 4797 3 3297 4672 4 3265 4828 5 3297 4828 6 3344 4891
  • 10. Second SQL Running select POI_RECORD_ID, POI_ID, LATITUDE, …, locality, locale from us_ta_1 where node_index in ( ?, ?, … ? ) JDBC DAL First run 375 1156 2 406 313 3 375 281 4 391 375 5 375 266 6 406 297 First Declaration: DAL is slower than JDBC, there are performance loss in DAL
  • 11. Benchmark Data • It’s slow, how is it slow ? – Single SQL is smoke test, we want real data
  • 12. Benchmark Data • Test Case •Running 10k POI, for each POI •DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching •IDeDuper.getDuplications(target, candidate) to find matching from candidate •100 meter •Repeat the test for 3 times • Test Result Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent size 0:01:46, 10ms 4ms 6ms 51 0.63 6387 POI has been matched
  • 13. Second Declaration Second Declaration: Dedup is the most important factor in the process, db is not the botteneck
  • 14. The truth • DB is fast because of cache # distance Process Time DB Parameters DB Time Dedup Time Dedup candidate POI size matched POI Percent 100 total 2min30s, 14ms 4 4ms 9ms 80 0.68 1 500 total 30m, 180ms 37 128ms 51ms 474 0.79 2 500 total 11m38s, 69ms 18ms 51ms 37 node in single query each POI need compare with 474 candidates Second (latter) run is much faster than first run
  • 15. The truth • Clean Mysql cache & Restart Mysql – key_buffer_size 500m -> 8 byte – query_cache_size 64m -> 0 • No effect, the db query is still fast. – The first run time can not be reproduced for the same data set.
  • 16. The truth • Clean OS (linux) file cache – echo 3 > /proc/sys/vm/drop_caches • Test Result Process Time DB Time Dedup Time Dedup candidate POI matched POI Percent size 0:01:46, 10ms 4ms 6ms 51 0.63 0:22:58.844 (db only) 137ms / (removed) 51 / 30 times slower when OS file cache is cleaned Second Declaration: Dedup is the most important factor in the process, db is not the botteneck
  • 17. Mysql Index Preloading • Mysql Index Preloading – key_buffer_size 4096m – load index into cache us_ta_1 (INDEX_NODEX_INDEX); • Nearly No effect, the db query is nearly same.
  • 18. Data file is bottleneck • It seems key index does not help, the bottleneck is in data file reading (an assumption) ? • Verify – 1) Reorder 23 million records using Hilbert, let neighboring POI also adjacent in disk, reduce disk seek times – 2) Build a new table, each row is <node, POI in the node>, reduce io times for one node POI reading
  • 19. Data file is bottleneck • Re-order POI in DB insert into us_ta_2 (select * from us_ta_1 order by node_index) • Test Result Process Time DB Time Dedup Time Dedup candidate matched POI POI size Percent 0:01:46, 10ms 4ms 6ms 51 0.63 First run 0:22:58.844 (db 137ms / (removed) 51 / only) First run 0:03:10.985(db 19ms / 51 / only) 0:00:46.360 (db 4ms / (removed) 51 / only)
  • 20. Multiple-Thread • DB Process Time(db DB Time Dedup Time Dedup candidate matched POI only) POI size Percent 1 Thread 0:03:10.985 19ms / 51 / 4 Thread 0:01:05.406 24ms / 51 / 8 Thread 0:00:38.328 29ms / 51 / • DB & Dedup Process Time DB Time Dedup Time Dedup candidate matched POI POI size Percent 1 Thread 0:04:07.125 18ms 5ms 51 0.6387 4 Thread db, 2 thread dedup 0:01:11.328 25ms 9ms 51 0.6387 4 Thread db, 1 thread dedup 0:01:22.953 28ms 7ms 51 0.6387
  • 21. Another assumption Assumption : Build a local cache, and process POI in Hilbert Curve order would do great help Cache: <node, POI in the node> DB Query: Get POI in given nodes Query: - Pick up nodes which has local cache - DB Query : nodes which has no local cache
  • 22. Hilbert Curve give a mapping between 1D and 2D space that fairly well preserves locality.
  • 23. Hilbert Curve 5k POI DB Ordering Hilbert Curve Ordering
  • 24. The truth os file cache is not cleaned # distance Total DB Parameters DB Time Dedup candidate POI cache hit ratio Time size first run 100 47s 4 4.7ms 80 100, cache 41s 4 4.1ms 80 5% (1679/40986) first run 100, cache 48s 4 4.8ms 80 5% 100 41s 4 4.1ms 80 500, cache 37 11ms 474 11% 500 37 18ms 474 Assumption : Build a local cache, and process POI in Hilbert Curve order would do some help great when data is not so sparse
  • 25. Summary • SQL itself is very simple, no tuning point ? select * from us_ta_1 where node_index in ( ?, ? , ?...) • Multiple-Thread is necessary to increase throughput – Separate Dedup and DB Query (Dedup is also time-consuming when candidate size is big)
  • 26. Jump out of box • A new <node, POI> table • No-Sql Storage with spatial support <node, POI> • CoSE to search candidates • Hadoop(Map-Reduce)
  • 27. Performance Tuning Tips • Test to verify assumption • Make the environments as close to real as possible – Do not Mock – Do not talk with US DB in CN • Repeat test to get a coherent result (result can be reproduced) • Do not miss any exception case (First run is slower than latter) • Consider both (Mysql) client/server side