2. The Problem
• Our task is importing ~1m entities into EAV model.
• Standard imports add high overload over the course of
processing each line item:
o app validates entity
o app creates import directive
o mysql parses query
o mysql validates row (constraints)
• The above works good for small number of import items
(~10k). It works bad for big number of items (>100k)
3. What do we want?
• remove as much validation as possible without harming
the database integrity
• minimize the app usage to remove possible memory
leaks and time required to assemble the import directive
• still have app decide how to process our file without
need of manual pre-processing
4. How to achieve our goals
• Move out of the Resource save schema
• Use bulk data loading
• Trust our sources
• Create a mechanism of connecting data after bulk loads
5. EAV Resource save
Data validation, assembling insert
queries
Insert query parsing, constraints
validation
Data load on row level
6. Loading data with “Load data infile”
No validating or assembling
layer
Bulk data loads
Less query parsing, leave
constraints for data integrity
7. Pros
• we use tool that was designed for bulk import from files
• it is tuned to work fast with big amount of data
• we have some control over the data integrity on MySQL
level
8. Cons
• no control over the incoming data quality (it can be
added as a pre-processing step)
• high possibility of duplicating data/losing integrity (again
can be added as a post-processing step - but adds
much time)
• this puts into question working with this method if we
have unpredictable data source
9. Getting your hands dirty
• test app @ https://github.com/SlayerBirden/migration.git
• 2 tables: actor_entity, actor_data; unique files “uin”
• foreign key from actor_data to actor_entity
• file columns: uin, name, lastname, age, movie
10. Some test results
System info:
memory: 2 banks of DIMM
DDR3 Synchronous 1333
MHz (0.8 ns) 4GB
cpu: Intel(R) Core(TM)
i5-3330 CPU @ 3.00GHz
MySql version: 5.5.35-
0ubuntu0.12.04.2
100k rows
oleg@oleg-Aspire-XC600:/var/www/migration$ php importer.php -h xxxx -u xxxx -p xxxx -
db test -f test.txt
100000 Entity rows imported.
IMPORT ENTITY TIME: 6.5943 seconds
100000 Data rows imported.
IMPORT DATA TIME: 10.9832 seconds
PROCESS TIME: 24.3128 seconds
PHP MEMORY USED: 1.13 kB
PHP MEMORY PEAK: 294.98 kB
oleg@oleg-Aspire-XC600:/var/www/migration$
11. 1M rows
oleg@oleg-Aspire-XC600:/var/www/migration$ php
importer.php -h 172.20.3.227 -u oleg -p test123 -db
test -f test.txt
1000000 Entity rows imported.
IMPORT ENTITY TIME: 141.5386 seconds
1000000 Data rows imported.
IMPORT DATA TIME: 168.1476 seconds
PROCESS TIME: 363.1716 seconds
oleg@oleg-Aspire-XC600:/var/www/migration$
5m rows was a fail :)
mysqld started
swapping
12. Some more test results for a
stronger machine
System info:
memory: 3 banks of
DIMM DDR3 1600 MHz
8GB (2) and 4GB (1)
cpu: Intel(R)
Core(TM) i7-3610QM
CPU @ 2.30GHz
MySql version: 5.6.13-
log
SSD: OCZ-VECTOR
100k rows
c:apachehtdocsmigration>php importer.php -h localhost -u root -db test -f test.txt
100000 Entity rows imported.
IMPORT ENTITY TIME: 1.1041 seconds
100000 Data rows imported.
IMPORT DATA TIME: 1.1321 seconds
PROCESS TIME: 5.7513 seconds
13. 1M rows
c:apachehtdocsmigration>php importer.php -h localhost
-u root -db test -f test.txt
1000000 Entity rows imported.
IMPORT ENTITY TIME: 14.2068 seconds
1000000 Data rows imported.
IMPORT DATA TIME: 10.5776 seconds
PROCESS TIME: 60.2454 seconds
5M rows
c:apachehtdocsmigration>php importer.php -h localhost
-u root -db test -f test.txt
5000000 Entity rows imported.
IMPORT ENTITY TIME: 89.3361 seconds
5000000 Data rows imported.
IMPORT DATA TIME: 62.1726 seconds
PROCESS TIME: 325.9186 seconds
Playing with
innodb_io_capacity
500k rows
innodb_io_capacity=200, innodb_io_capacity_max=2000
500000 Entity rows imported.
IMPORT ENTITY TIME: 18.9711 seconds
500000 Data rows imported.
IMPORT DATA TIME: 11.8517 seconds
PROCESS TIME: 48.3198 seconds
innodb_io_capacity=2000, innodb_io_capacity_max=20000
500000 Entity rows imported.
IMPORT ENTITY TIME: 7.6654 seconds
500000 Data rows imported.
IMPORT DATA TIME: 4.3602 seconds
PROCESS TIME: 29.8597 seconds
innodb_io_capacity=20000, innodb_io_capacity_max=30000
500000 Entity rows imported.
IMPORT ENTITY TIME: 7.6674 seconds
500000 Data rows imported.
IMPORT DATA TIME: 4.3112 seconds
PROCESS TIME: 29.6327 seconds
14. Tests for Resource-type save (for
comparison)
System info:
memory: 3 banks of
DIMM DDR3 1600 MHz
8GB (2) and 4GB (1)
cpu: Intel(R)
Core(TM) i7-3610QM
CPU @ 2.30GHz
MySql version: 5.6.13-
log
SSD: OCZ-VECTOR
50k rows
c:apachehtdocsmigration>php resource.php -h localhost -u root -db test -f test.txt
All rows imported
PROCESS TIME: 196.1622 seconds
MEMORY USED: 0.80 kB
MEMORY PEAK: 186.94 kB
15. Conclusion
Use this method if
• huge data amount (> 100k rows)
• performance is keypoint
• data source is predictable
• data integrity is not an absolute requirement
(for EAV)