SlideShare a Scribd company logo
1 of 16
Data migration into EAV
model
by Oleg Kulik, Gorilla Group
The Problem
• Our task is importing ~1m entities into EAV model.
• Standard imports add high overload over the course of
processing each line item:
o app validates entity
o app creates import directive
o mysql parses query
o mysql validates row (constraints)
• The above works good for small number of import items
(~10k). It works bad for big number of items (>100k)
What do we want?
• remove as much validation as possible without harming
the database integrity
• minimize the app usage to remove possible memory
leaks and time required to assemble the import directive
• still have app decide how to process our file without
need of manual pre-processing
How to achieve our goals
• Move out of the Resource save schema
• Use bulk data loading
• Trust our sources
• Create a mechanism of connecting data after bulk loads
EAV Resource save
Data validation, assembling insert
queries
Insert query parsing, constraints
validation
Data load on row level
Loading data with “Load data infile”
No validating or assembling
layer
Bulk data loads
Less query parsing, leave
constraints for data integrity
Pros
• we use tool that was designed for bulk import from files
• it is tuned to work fast with big amount of data
• we have some control over the data integrity on MySQL
level
Cons
• no control over the incoming data quality (it can be
added as a pre-processing step)
• high possibility of duplicating data/losing integrity (again
can be added as a post-processing step - but adds
much time)
• this puts into question working with this method if we
have unpredictable data source
Getting your hands dirty
• test app @ https://github.com/SlayerBirden/migration.git
• 2 tables: actor_entity, actor_data; unique files “uin”
• foreign key from actor_data to actor_entity
• file columns: uin, name, lastname, age, movie
Some test results
System info:
memory: 2 banks of DIMM
DDR3 Synchronous 1333
MHz (0.8 ns) 4GB
cpu: Intel(R) Core(TM)
i5-3330 CPU @ 3.00GHz
MySql version: 5.5.35-
0ubuntu0.12.04.2
100k rows
oleg@oleg-Aspire-XC600:/var/www/migration$ php importer.php -h xxxx -u xxxx -p xxxx -
db test -f test.txt
100000 Entity rows imported.
IMPORT ENTITY TIME: 6.5943 seconds
100000 Data rows imported.
IMPORT DATA TIME: 10.9832 seconds
PROCESS TIME: 24.3128 seconds
PHP MEMORY USED: 1.13 kB
PHP MEMORY PEAK: 294.98 kB
oleg@oleg-Aspire-XC600:/var/www/migration$
1M rows
oleg@oleg-Aspire-XC600:/var/www/migration$ php
importer.php -h 172.20.3.227 -u oleg -p test123 -db
test -f test.txt
1000000 Entity rows imported.
IMPORT ENTITY TIME: 141.5386 seconds
1000000 Data rows imported.
IMPORT DATA TIME: 168.1476 seconds
PROCESS TIME: 363.1716 seconds
oleg@oleg-Aspire-XC600:/var/www/migration$
5m rows was a fail :)
mysqld started
swapping
Some more test results for a
stronger machine
System info:
memory: 3 banks of
DIMM DDR3 1600 MHz
8GB (2) and 4GB (1)
cpu: Intel(R)
Core(TM) i7-3610QM
CPU @ 2.30GHz
MySql version: 5.6.13-
log
SSD: OCZ-VECTOR
100k rows
c:apachehtdocsmigration>php importer.php -h localhost -u root -db test -f test.txt
100000 Entity rows imported.
IMPORT ENTITY TIME: 1.1041 seconds
100000 Data rows imported.
IMPORT DATA TIME: 1.1321 seconds
PROCESS TIME: 5.7513 seconds
1M rows
c:apachehtdocsmigration>php importer.php -h localhost
-u root -db test -f test.txt
1000000 Entity rows imported.
IMPORT ENTITY TIME: 14.2068 seconds
1000000 Data rows imported.
IMPORT DATA TIME: 10.5776 seconds
PROCESS TIME: 60.2454 seconds
5M rows
c:apachehtdocsmigration>php importer.php -h localhost
-u root -db test -f test.txt
5000000 Entity rows imported.
IMPORT ENTITY TIME: 89.3361 seconds
5000000 Data rows imported.
IMPORT DATA TIME: 62.1726 seconds
PROCESS TIME: 325.9186 seconds
Playing with
innodb_io_capacity
500k rows
innodb_io_capacity=200, innodb_io_capacity_max=2000
500000 Entity rows imported.
IMPORT ENTITY TIME: 18.9711 seconds
500000 Data rows imported.
IMPORT DATA TIME: 11.8517 seconds
PROCESS TIME: 48.3198 seconds
innodb_io_capacity=2000, innodb_io_capacity_max=20000
500000 Entity rows imported.
IMPORT ENTITY TIME: 7.6654 seconds
500000 Data rows imported.
IMPORT DATA TIME: 4.3602 seconds
PROCESS TIME: 29.8597 seconds
innodb_io_capacity=20000, innodb_io_capacity_max=30000
500000 Entity rows imported.
IMPORT ENTITY TIME: 7.6674 seconds
500000 Data rows imported.
IMPORT DATA TIME: 4.3112 seconds
PROCESS TIME: 29.6327 seconds
Tests for Resource-type save (for
comparison)
System info:
memory: 3 banks of
DIMM DDR3 1600 MHz
8GB (2) and 4GB (1)
cpu: Intel(R)
Core(TM) i7-3610QM
CPU @ 2.30GHz
MySql version: 5.6.13-
log
SSD: OCZ-VECTOR
50k rows
c:apachehtdocsmigration>php resource.php -h localhost -u root -db test -f test.txt
All rows imported
PROCESS TIME: 196.1622 seconds
MEMORY USED: 0.80 kB
MEMORY PEAK: 186.94 kB
Conclusion
Use this method if
• huge data amount (> 100k rows)
• performance is keypoint
• data source is predictable
• data integrity is not an absolute requirement
(for EAV)
Thank you!

More Related Content

What's hot

Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
Advanced VCL: how to use restart
Advanced VCL: how to use restartAdvanced VCL: how to use restart
Advanced VCL: how to use restartFastly
 
Building Distributed System with Celery on Docker Swarm
Building Distributed System with Celery on Docker SwarmBuilding Distributed System with Celery on Docker Swarm
Building Distributed System with Celery on Docker SwarmWei Lin
 
Async - react, don't wait - PingConf
Async - react, don't wait - PingConfAsync - react, don't wait - PingConf
Async - react, don't wait - PingConfJohan Andrén
 
Python in the database
Python in the databasePython in the database
Python in the databasepybcn
 
Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper David Paquette
 
Advanced data access with Dapper
Advanced data access with DapperAdvanced data access with Dapper
Advanced data access with DapperDavid Paquette
 
An Overview of Node.js
An Overview of Node.jsAn Overview of Node.js
An Overview of Node.jsAyush Mishra
 
Apache spark with akka couchbase code by bhawani
Apache spark with akka couchbase code by bhawaniApache spark with akka couchbase code by bhawani
Apache spark with akka couchbase code by bhawaniBhawani N Prasad
 
Altitude SF 2017: Debugging Fastly VCL 101
Altitude SF 2017: Debugging Fastly VCL 101Altitude SF 2017: Debugging Fastly VCL 101
Altitude SF 2017: Debugging Fastly VCL 101Fastly
 
Behind modern concurrency primitives
Behind modern concurrency primitivesBehind modern concurrency primitives
Behind modern concurrency primitivesBartosz Sypytkowski
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesLaurent Bernaille
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダSadayuki Furuhashi
 

What's hot (20)

Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
Advanced VCL: how to use restart
Advanced VCL: how to use restartAdvanced VCL: how to use restart
Advanced VCL: how to use restart
 
Building Distributed System with Celery on Docker Swarm
Building Distributed System with Celery on Docker SwarmBuilding Distributed System with Celery on Docker Swarm
Building Distributed System with Celery on Docker Swarm
 
Async - react, don't wait - PingConf
Async - react, don't wait - PingConfAsync - react, don't wait - PingConf
Async - react, don't wait - PingConf
 
Nginx
NginxNginx
Nginx
 
Python in the database
Python in the databasePython in the database
Python in the database
 
Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper
 
Dapper performance
Dapper performanceDapper performance
Dapper performance
 
Advanced data access with Dapper
Advanced data access with DapperAdvanced data access with Dapper
Advanced data access with Dapper
 
An Overview of Node.js
An Overview of Node.jsAn Overview of Node.js
An Overview of Node.js
 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
 
Apache spark with akka couchbase code by bhawani
Apache spark with akka couchbase code by bhawaniApache spark with akka couchbase code by bhawani
Apache spark with akka couchbase code by bhawani
 
Altitude SF 2017: Debugging Fastly VCL 101
Altitude SF 2017: Debugging Fastly VCL 101Altitude SF 2017: Debugging Fastly VCL 101
Altitude SF 2017: Debugging Fastly VCL 101
 
Dapper
DapperDapper
Dapper
 
Behind modern concurrency primitives
Behind modern concurrency primitivesBehind modern concurrency primitives
Behind modern concurrency primitives
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror Stories
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 

Similar to Data migration into eav model

The rice and fail of an IoT solution
The rice and fail of an IoT solutionThe rice and fail of an IoT solution
The rice and fail of an IoT solutionRadu Vunvulea
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Martin Traverso
 
ShmooCON 2009 : Re-playing with (Blind) SQL Injection
ShmooCON 2009 : Re-playing with (Blind) SQL InjectionShmooCON 2009 : Re-playing with (Blind) SQL Injection
ShmooCON 2009 : Re-playing with (Blind) SQL InjectionChema Alonso
 
Shuttle: Intrusion Recovery in Paas
Shuttle: Intrusion Recovery in PaasShuttle: Intrusion Recovery in Paas
Shuttle: Intrusion Recovery in PaasDário Nascimento
 
Secrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsSecrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsTarik Essawi
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
High Volume Payments using Mule
High Volume Payments using MuleHigh Volume Payments using Mule
High Volume Payments using MuleAdhish Pendharkar
 
Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0Yogi Kulkarni
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without InterferenceTony Tam
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneNoriaki Tatsumi
 
Oracle Drivers configuration for High Availability
Oracle Drivers configuration for High AvailabilityOracle Drivers configuration for High Availability
Oracle Drivers configuration for High AvailabilityLudovico Caldara
 
Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Kaiyao Huang
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkKarthik Deivasigamani
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)KafkaZone
 
Scaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersScaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersoazabir
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal
 
Web hacking series part 3
Web hacking series part 3Web hacking series part 3
Web hacking series part 3Aditya Kamat
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data WarehousesConnor McDonald
 

Similar to Data migration into eav model (20)

The rice and fail of an IoT solution
The rice and fail of an IoT solutionThe rice and fail of an IoT solution
The rice and fail of an IoT solution
 
Load Data Fast!
Load Data Fast!Load Data Fast!
Load Data Fast!
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
 
ShmooCON 2009 : Re-playing with (Blind) SQL Injection
ShmooCON 2009 : Re-playing with (Blind) SQL InjectionShmooCON 2009 : Re-playing with (Blind) SQL Injection
ShmooCON 2009 : Re-playing with (Blind) SQL Injection
 
Shuttle: Intrusion Recovery in Paas
Shuttle: Intrusion Recovery in PaasShuttle: Intrusion Recovery in Paas
Shuttle: Intrusion Recovery in Paas
 
Iac d.damyanov 4.pptx
Iac d.damyanov 4.pptxIac d.damyanov 4.pptx
Iac d.damyanov 4.pptx
 
Secrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsSecrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archs
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
High Volume Payments using Mule
High Volume Payments using MuleHigh Volume Payments using Mule
High Volume Payments using Mule
 
Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without Interference
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital One
 
Oracle Drivers configuration for High Availability
Oracle Drivers configuration for High AvailabilityOracle Drivers configuration for High Availability
Oracle Drivers configuration for High Availability
 
Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
Scaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersScaling asp.net websites to millions of users
Scaling asp.net websites to millions of users
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
 
Web hacking series part 3
Web hacking series part 3Web hacking series part 3
Web hacking series part 3
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 

More from Magento Dev

Yurii Hryhoriev "Php storm tips&tricks"
Yurii Hryhoriev "Php storm tips&tricks"Yurii Hryhoriev "Php storm tips&tricks"
Yurii Hryhoriev "Php storm tips&tricks"Magento Dev
 
DevHub 3 - Composer plus Magento
DevHub 3 - Composer plus MagentoDevHub 3 - Composer plus Magento
DevHub 3 - Composer plus MagentoMagento Dev
 
DevHub 3 - Pricing
DevHub 3 - PricingDevHub 3 - Pricing
DevHub 3 - PricingMagento Dev
 
Magento2 airplane
Magento2 airplaneMagento2 airplane
Magento2 airplaneMagento Dev
 
Imagine recap-devhub
Imagine recap-devhubImagine recap-devhub
Imagine recap-devhubMagento Dev
 
Разработка на стероидах или как я перестал бояться и полюбил свою IDE
Разработка на стероидах или как я перестал бояться и полюбил свою IDEРазработка на стероидах или как я перестал бояться и полюбил свою IDE
Разработка на стероидах или как я перестал бояться и полюбил свою IDEMagento Dev
 
Top 5 magento secure coding best practices Alex Zarichnyi
Top 5 magento secure coding best practices   Alex ZarichnyiTop 5 magento secure coding best practices   Alex Zarichnyi
Top 5 magento secure coding best practices Alex ZarichnyiMagento Dev
 
Magento 2 Page Cache
Magento 2 Page CacheMagento 2 Page Cache
Magento 2 Page CacheMagento Dev
 
Gearman jobqueue
Gearman jobqueueGearman jobqueue
Gearman jobqueueMagento Dev
 
Choreography of web-services
Choreography of web-servicesChoreography of web-services
Choreography of web-servicesMagento Dev
 
Take more from Jquery
Take more from JqueryTake more from Jquery
Take more from JqueryMagento Dev
 

More from Magento Dev (17)

Yurii Hryhoriev "Php storm tips&tricks"
Yurii Hryhoriev "Php storm tips&tricks"Yurii Hryhoriev "Php storm tips&tricks"
Yurii Hryhoriev "Php storm tips&tricks"
 
DevHub 3 - Composer plus Magento
DevHub 3 - Composer plus MagentoDevHub 3 - Composer plus Magento
DevHub 3 - Composer plus Magento
 
DevHub 3 - Pricing
DevHub 3 - PricingDevHub 3 - Pricing
DevHub 3 - Pricing
 
DevHub 3 - CVS
DevHub 3 - CVSDevHub 3 - CVS
DevHub 3 - CVS
 
Magento2 airplane
Magento2 airplaneMagento2 airplane
Magento2 airplane
 
Imagine recap-devhub
Imagine recap-devhubImagine recap-devhub
Imagine recap-devhub
 
Разработка на стероидах или как я перестал бояться и полюбил свою IDE
Разработка на стероидах или как я перестал бояться и полюбил свою IDEРазработка на стероидах или как я перестал бояться и полюбил свою IDE
Разработка на стероидах или как я перестал бояться и полюбил свою IDE
 
Top 5 magento secure coding best practices Alex Zarichnyi
Top 5 magento secure coding best practices   Alex ZarichnyiTop 5 magento secure coding best practices   Alex Zarichnyi
Top 5 magento secure coding best practices Alex Zarichnyi
 
Magento 2 Page Cache
Magento 2 Page CacheMagento 2 Page Cache
Magento 2 Page Cache
 
Magento devhub
Magento devhubMagento devhub
Magento devhub
 
Php + erlang
Php + erlangPhp + erlang
Php + erlang
 
Tdd php
Tdd phpTdd php
Tdd php
 
Gearman jobqueue
Gearman jobqueueGearman jobqueue
Gearman jobqueue
 
Autotest
AutotestAutotest
Autotest
 
Choreography of web-services
Choreography of web-servicesChoreography of web-services
Choreography of web-services
 
Security in PHP
Security in PHPSecurity in PHP
Security in PHP
 
Take more from Jquery
Take more from JqueryTake more from Jquery
Take more from Jquery
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Data migration into eav model

  • 1. Data migration into EAV model by Oleg Kulik, Gorilla Group
  • 2. The Problem • Our task is importing ~1m entities into EAV model. • Standard imports add high overload over the course of processing each line item: o app validates entity o app creates import directive o mysql parses query o mysql validates row (constraints) • The above works good for small number of import items (~10k). It works bad for big number of items (>100k)
  • 3. What do we want? • remove as much validation as possible without harming the database integrity • minimize the app usage to remove possible memory leaks and time required to assemble the import directive • still have app decide how to process our file without need of manual pre-processing
  • 4. How to achieve our goals • Move out of the Resource save schema • Use bulk data loading • Trust our sources • Create a mechanism of connecting data after bulk loads
  • 5. EAV Resource save Data validation, assembling insert queries Insert query parsing, constraints validation Data load on row level
  • 6. Loading data with “Load data infile” No validating or assembling layer Bulk data loads Less query parsing, leave constraints for data integrity
  • 7. Pros • we use tool that was designed for bulk import from files • it is tuned to work fast with big amount of data • we have some control over the data integrity on MySQL level
  • 8. Cons • no control over the incoming data quality (it can be added as a pre-processing step) • high possibility of duplicating data/losing integrity (again can be added as a post-processing step - but adds much time) • this puts into question working with this method if we have unpredictable data source
  • 9. Getting your hands dirty • test app @ https://github.com/SlayerBirden/migration.git • 2 tables: actor_entity, actor_data; unique files “uin” • foreign key from actor_data to actor_entity • file columns: uin, name, lastname, age, movie
  • 10. Some test results System info: memory: 2 banks of DIMM DDR3 Synchronous 1333 MHz (0.8 ns) 4GB cpu: Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz MySql version: 5.5.35- 0ubuntu0.12.04.2 100k rows oleg@oleg-Aspire-XC600:/var/www/migration$ php importer.php -h xxxx -u xxxx -p xxxx - db test -f test.txt 100000 Entity rows imported. IMPORT ENTITY TIME: 6.5943 seconds 100000 Data rows imported. IMPORT DATA TIME: 10.9832 seconds PROCESS TIME: 24.3128 seconds PHP MEMORY USED: 1.13 kB PHP MEMORY PEAK: 294.98 kB oleg@oleg-Aspire-XC600:/var/www/migration$
  • 11. 1M rows oleg@oleg-Aspire-XC600:/var/www/migration$ php importer.php -h 172.20.3.227 -u oleg -p test123 -db test -f test.txt 1000000 Entity rows imported. IMPORT ENTITY TIME: 141.5386 seconds 1000000 Data rows imported. IMPORT DATA TIME: 168.1476 seconds PROCESS TIME: 363.1716 seconds oleg@oleg-Aspire-XC600:/var/www/migration$ 5m rows was a fail :) mysqld started swapping
  • 12. Some more test results for a stronger machine System info: memory: 3 banks of DIMM DDR3 1600 MHz 8GB (2) and 4GB (1) cpu: Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz MySql version: 5.6.13- log SSD: OCZ-VECTOR 100k rows c:apachehtdocsmigration>php importer.php -h localhost -u root -db test -f test.txt 100000 Entity rows imported. IMPORT ENTITY TIME: 1.1041 seconds 100000 Data rows imported. IMPORT DATA TIME: 1.1321 seconds PROCESS TIME: 5.7513 seconds
  • 13. 1M rows c:apachehtdocsmigration>php importer.php -h localhost -u root -db test -f test.txt 1000000 Entity rows imported. IMPORT ENTITY TIME: 14.2068 seconds 1000000 Data rows imported. IMPORT DATA TIME: 10.5776 seconds PROCESS TIME: 60.2454 seconds 5M rows c:apachehtdocsmigration>php importer.php -h localhost -u root -db test -f test.txt 5000000 Entity rows imported. IMPORT ENTITY TIME: 89.3361 seconds 5000000 Data rows imported. IMPORT DATA TIME: 62.1726 seconds PROCESS TIME: 325.9186 seconds Playing with innodb_io_capacity 500k rows innodb_io_capacity=200, innodb_io_capacity_max=2000 500000 Entity rows imported. IMPORT ENTITY TIME: 18.9711 seconds 500000 Data rows imported. IMPORT DATA TIME: 11.8517 seconds PROCESS TIME: 48.3198 seconds innodb_io_capacity=2000, innodb_io_capacity_max=20000 500000 Entity rows imported. IMPORT ENTITY TIME: 7.6654 seconds 500000 Data rows imported. IMPORT DATA TIME: 4.3602 seconds PROCESS TIME: 29.8597 seconds innodb_io_capacity=20000, innodb_io_capacity_max=30000 500000 Entity rows imported. IMPORT ENTITY TIME: 7.6674 seconds 500000 Data rows imported. IMPORT DATA TIME: 4.3112 seconds PROCESS TIME: 29.6327 seconds
  • 14. Tests for Resource-type save (for comparison) System info: memory: 3 banks of DIMM DDR3 1600 MHz 8GB (2) and 4GB (1) cpu: Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz MySql version: 5.6.13- log SSD: OCZ-VECTOR 50k rows c:apachehtdocsmigration>php resource.php -h localhost -u root -db test -f test.txt All rows imported PROCESS TIME: 196.1622 seconds MEMORY USED: 0.80 kB MEMORY PEAK: 186.94 kB
  • 15. Conclusion Use this method if • huge data amount (> 100k rows) • performance is keypoint • data source is predictable • data integrity is not an absolute requirement (for EAV)