GNW01: In-Memory Processing for Databases

2,783 views

Published on

Gluent New World #01: In-Memory Processing for Databases with Tanel Poder

Published in: Data & Analytics

GNW01: In-Memory Processing for Databases

  1. 1. gluent.com 1 In-Memory Execution for Databases Tanel Poder a long time computer performance geek
  2. 2. gluent.com 2 Intro: About me • Tanel Põder • Oracle Database Performance geek (18+ years) • Exadata Performance geek • Linux Performance geek • Hadoop Performance geek • CEO & co-founder: Expert Oracle Exadata book (2nd edition is out now!) Instant promotion
  3. 3. gluent.com 3 Gluent Oracle Teradata NoSQL Big Data Sources MSSQL App X App Y App Z Gluent as a data virtualization layer Open Data Formats!
  4. 4. gluent.com 4 Gluent Advisor 1. Analyzes DB storage use and access patterns for safe offloading 2. 500+ Databases analyzed 3. 10+ PB analyzed – 81% offloadable 4. 2-24x query speedup 10 PB Interested in analyzing your database? http://gluent.com/whitepapers
  5. 5. gluent.com 5 Tape is dead, disk is tape, flash is disk, RAM locality is king Jim Gray, 2006 http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt
  6. 6. gluent.com 6 Seagate Cheetah 15k RPM disk specs 200 MB /sec!
  7. 7. gluent.com 7 Spinning disk IO throughput • B-Tree index-walking disk-based RDBMS • 15000 rpm spinning disks • ~200 random IOPS per disk • ~8kB read per random IO • 8 kB * 200 IOPS = 1.6 MB/sec per disk • Full scanning based workloads • Potentially much more data to access & filter • Partition pruning, zonemaps, storage indexes help to skip data 1 • Scan only required columns (formats with large chunk sizes) • Sequential IO rate up to 200MB/sec per disk http://www.dbms2.com/2013/05/27/data-skipping/ However, index scans can read only a subset of data
  8. 8. gluent.com 8 Scanning a bunch of spinning disks can keep your CPUs really busy! * Not even talking about flash or RAM here!
  9. 9. gluent.com 9 A simple query bottlenecked by CPU 9 GB scanned, processed in 7 seconds: ~1300 MB/s in PX ~80 MB/s per slave
  10. 10. gluent.com 10 A complex query bottlenecked by CPU Complex Query: Much more CPU spent on aggregations, joins. 9GB processed in 1.5 minutes 9 GB / 90 seconds = ~ 100MB/s PX 6 MB/s per slave
  11. 11. gluent.com 11 If disks and storage subsystems are getting so fast, why all the buzz around in-memory database systems? * Can’t we just cache the old database files in RAM?
  12. 12. gluent.com 12 A simple Data Retrieval test! • Retrieve 1% rows out of a 8 GB table: SELECT COUNT(*) , SUM(order_total) FROM orders WHERE warehouse_id BETWEEN 500 AND 510 The Warehouse IDs range between 1 and 999 Test data generated by SwingBench tool
  13. 13. gluent.com 13 Data Retrieval: Test Results • Remember, this is a very simple scanning + filtering query: TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ ------------------------- ---------- -------- -------- --------- --------- test1: index range scan * 16715356 265203 37438 782858 511231 test2: full buffered */ C 630573765 132075 48944 1013913 849316 test3: full direct path * 630573765 15567 11808 1013873 1013850 test4: full smart scan */ 630573765 2102 729 1013873 1013850 test5: full inmemory scan 630573765 155 155 14 0 test6: full buffer cache 630573765 7850 7831 1014741 0 Test 5 & Test 6 run entirely from memory Source: http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action But why 50x difference in CPU usage?
  14. 14. gluent.com 14 Tape is dead, disk is tape, flash is disk, RAM locality is king Jim Gray, 2006 http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt
  15. 15. gluent.com 15 Latency Numbers Every Programmer Should Know Latency Comparison Numbers -------------------------- L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns 3 us Send 1K bytes over 1 Gbps network 10,000 ns 10 us Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD Read 1 MB sequentially from memory 250,000 ns 250 us Round trip within same datacenter 500,000 ns 500 us Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms Source: https://gist.github.com/jboner/2841832
  16. 16. gluent.com 16 CPU = fast CPU L2 / L3 cache in between RAM = slow
  17. 17. gluent.com 17 RAM access is the bottleneck of modern computers Waits for RAM access show up as CPU usage in monitoring tools Want to wait less? Do it less!
  18. 18. gluent.com 18 CPU & cache friendly data structures are key! Headers, ITL entries Row Directory #0 hdr row #1 hdr row #2 hdr row #3 hdr row #4 hdr row #5 hdr row #6 hdr row #7 hdr row #8 hdr row … row #1 offset #2 offset #3 offset #0 offset … Hdr byte Column data Lock byte CC byte Col. len Column data Col. len Column data Col. len Column data Col. len • OLTP: Block->Row->Column format • 8kB blocks • Great for writes, changes • Field-length encoding • Reading column #100 requires walking through all preceding columns • Columns (with similar values) not densely packed together • Not CPU cache friendly for analytics!
  19. 19. gluent.com 19 Scanning columnar data structures Scanning a column in a row-oriented data block Scanning a column in a column-oriented compression unit col 1 col 2 col 3 col 4 col 5 col 6 col 2 col 2 col 3 col 3 col 4 col 4 col 5 col 5 col5 col 6 col 1 col 2 3… col 3 col 4 col 4 col 5 col 6 col 1 col 2 col 3 col 3 col 4 col 4 col 5 col 5 col 1 col 2 col 6 col 6 col 1 col 2 3… col 3 col 4 col 4 col 5 col 6 col 1 col 2 col 3 col 3 col 4 col 4 col 5 col 5 col 1 col 2 col 6 col 6 col 1 col 2 3… col 3 col 4 col 4 col 5 col 6 col 1 col 2 col 3 col 3 col 4 col 4 col 5 col 5 col 1 col 2 col 6 col 6 Read filter column(s) first. Access only projected columns if matches found. Reduced memory traffic. More sequential RAM access, SIMD on adjacent data.
  20. 20. gluent.com 20 How to measure this stuff?
  21. 21. gluent.com 21 CPU Performance Counters on Linux # perf stat -d -p PID sleep 30 Performance counter stats for process id '34783': 27373.819908 task-clock # 0.912 CPUs utilized 86,428,653,040 cycles # 3.157 GHz 32,115,412,877 instructions # 0.37 insns per cycle # 2.39 stalled cycles per insn 7,386,220,210 branches # 269.828 M/sec 22,056,397 branch-misses # 0.30% of all branches 76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle 58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle 256,440,384 cache-references # 9.368 M/sec 222,036,981 cache-misses # 86.584 % of all cache refs 234,361,189 LLC-loads # 8.562 M/sec 218,570,294 LLC-load-misses # 93.26% of all LL-cache hits 18,493,582 LLC-stores # 0.676 M/sec 3,233,231 LLC-store-misses # 0.118 M/sec 7,324,946,042 L1-dcache-loads # 267.589 M/sec 305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits 36,890,302 L1-dcache-prefetches # 1.348 M/sec 30.000601214 seconds time elapsed Measure what’s going on inside a CPU! Metrics explained in my blog entry: http://bit.ly/1PBIlde
  22. 22. gluent.com 22 Testing data access path differences on Oracle 12c SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0 Run the same query on same dataset stored in different formats/layouts. Full details: http://blog.tanelpoder.com/2015/11/30 /ram-is-the-new-disk-and-how-to- measure-its-performance-part-3-cpu- instructions-cycles/ Test result data: http://bit.ly/1RitNMr
  23. 23. gluent.com 23 CPU instructions used for scanning/counting 69M rows
  24. 24. gluent.com 24 Average CPU instructions per row processed • Knowing that the table has about 69M rows, I can calculate the average number of instructions issued per row processed
  25. 25. gluent.com 25 CPU cycles consumed (full scans only)
  26. 26. gluent.com 26 CPU efficiency (Instructions-per-Cycle) Yes, modern superscalar CPUs can execute multiple instructions per cycle
  27. 27. gluent.com 27 Reducing memory writes within SQL execution • Old approach: 1. Read compressed data chunk 2. Decompress data (write data to temporary memory location) 3. Filter out non-matching rows 4. Return data • New approach: 1. Read and filter compressed columns 2. Decompress only required columns of matching rows 3. Return data
  28. 28. gluent.com 28 Memory reads & writes during internal processing Unit = MB Read only requested columns Rows counted from chunk headers Scan compressed data: few memory writes
  29. 29. gluent.com 29 Past & Future
  30. 30. gluent.com 30 Some commercial column store history • Disk-optimized column stores • Expressway 103 / Sybase IQ (early ‘90s) • MonetDB (early ‘90s) • Oracle Hybrid Columnar Compression (disk/OLTP optimized) • … • Memory-optimized column stores • … • SAP HANA (December 2010) • IBM DB2 with BLU Acceleration (June 2013) • Oracle Database 12c with In-Memory Option (July 2014) • … * Not addressing memory-optimized OLTP / row-stores here
  31. 31. gluent.com 31 Future-proof Open Data Formats! • Disk-optimized columnar data structures • Apache Parquet • https://parquet.apache.org/ • Apache ORC • https://orc.apache.org/ • Memory / CPU-cache optimized data structures • Apache Arrow • Not only storage format • … also a cross-system/cross-platform IPC communication framework • https://arrow.apache.org/
  32. 32. gluent.com 32 Future 1. RAM gets cheaper + bigger, not necessarily faster 2. CPU caches get larger 3. RAM blends with storage and becomes non-volatile 4. IO subsystems (flash) get even closer to CPUs 5. IO latencies shrink 6. The latency difference between non-volatile storage and volatile RAM shrinks - new database layouts! 7. CPU cache is king – new data structures needed!
  33. 33. gluent.com 33 References • Slides & Video of this presentation: • http://www.slideshare.net/tanelp • https://vimeo.com/gluent • Index range scans vs full scans: • http://blog.tanelpoder.com/2014/09/17/about-index-range-scans- disk-re-reads-and-how-your-new-car-can-go-600-miles-per-hour/ • RAM is the new disk series: • http://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and- how-to-measure-its-performance-part-1/ • https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqA AlHnZqmuVmSFbHMLDsjaU/
  34. 34. gluent.com 34 Thanks! http://gluent.com/whitepapers We are hiring developers & data engineers!!! http://blog.tanelpoder.com tanel@tanelpoder.com @tanelpoder

×