Low Level CPU Performance Profiling Examples

gluent.com 1
Low-Level CPU Performance Profiling Examples
Tanel Poder
a long time computer performance geek
@tanelpoder
blog.tanelpoder.com

gluent.com 2
Intro: About me
• Tanel Põder
• RDBMS Performance geek 20+ years (Oracle)
• Unix/Linux Performance geek
• Hadoop Performance geek
• Spark Performance geek?
• http://blog.tanelpoder.com
• @tanelpoder
Expert Oracle Exadata
book

gluent.com 3
Gluent
Oracle
Teradata
NoSQL
Big Data
Sources
MSSQL
App
X
App
Y
App
Z
A data sharing
platform for
enterprise
applications
Gluent as a data
virtualization layer

gluent.com 4
Some Microscopic level stuff to talk about…
1. Some things worth knowing about modern CPUs
2. Measuring internal CPU efficiency (C++)
3. A columnar database scanning example (Oracle)
4. Low level Analysis of Spark Performance
• RDD vs DataFrame
• DataFrame with bad code
This is gonna be a
(hopefully fun)
hacking session!

gluent.com 5
”100%” busy?
A CPU close to
100% busy?
What if I told you your CPU is not that busy?

gluent.com 6
CPU Performance Counters on Linux
# perf stat -d -p PID sleep 30
Performance counter stats for process id '34783':
27373.819908 task-clock # 0.912 CPUs utilized
86,428,653,040 cycles # 3.157 GHz
32,115,412,877 instructions # 0.37 insns per cycle
# 2.39 stalled cycles per insn
7,386,220,210 branches # 269.828 M/sec
22,056,397 branch-misses # 0.30% of all branches
76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle
58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle
256,440,384 cache-references # 9.368 M/sec
222,036,981 cache-misses # 86.584 % of all cache refs
234,361,189 LLC-loads # 8.562 M/sec
218,570,294 LLC-load-misses # 93.26% of all LL-cache hits
18,493,582 LLC-stores # 0.676 M/sec
3,233,231 LLC-store-misses # 0.118 M/sec
7,324,946,042 L1-dcache-loads # 267.589 M/sec
305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits
36,890,302 L1-dcache-prefetches # 1.348 M/sec
30.000601214 seconds time elapsed
Measure what’s
going on inside a
CPU!
Metrics explained in
my blog entry:
http://bit.ly/1PBIlde

gluent.com 7
Modern CPUs can run multiple operations concurrently
http://software.intel.com
Multiple
ports/execution
units for
computation &
memory ops
If waiting for RAM
– CPU pipeline
stall!

gluent.com 8
Latency Numbers Every Programmer Should Know
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache,
200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD,
4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter
roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory,
20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Source:
https://gist.github.com/jboner/2841832

gluent.com 9
CPU = fast
CPU L2 / L3
cache in between
RAM = slow

gluent.com 10
Tape is dead, disk is tape, flash is disk, RAM locality is king
Jim Gray, 2006
http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt

gluent.com 11
Just caching all your data in RAM does not
give you a modern “in-memory” system!
* Columnar data structures to the rescue!

gluent.com 12
Row-Major Data Structures
SELECT
SUM(column)
FROM array

gluent.com 13
Variable field offsets Memory line
(cache line)
size = 64 Bytes

gluent.com 14
Columnar Data Structure (conceptual)
Store values of
a column next
to each other
(data locality)
Much less data
to scan (or filter)
if accessing a
subset of
columns
Better
compression due
to adjacent
repeating (or
slightly differing)
values

gluent.com 15
Single-Instruction-Multiple-Data (SIMD) processing
• Run an operation (like ADD) on multiple registers/memory
locations in a single instruction:
Do the same work
with less (but more
complex) instructions
More concurrency
inside CPU
If the underlying data
structures “feed”
data fast enough …

gluent.com 16
A database example (Oracle)

gluent.com 17
A simple Data Retrieval test!
• Retrieve 1% rows out of a 8 GB table:
SELECT
COUNT(*)
, SUM(order_total)
FROM
orders
WHERE
warehouse_id BETWEEN 500 AND 510
The Warehouse
IDs range between
1 and 999
Test data
generated by
SwingBench tool

gluent.com 18
Data Retrieval: Test Results
• Remember, this is a very simple scanning + filtering query:
TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ
------------------------- ---------- -------- -------- --------- ---------
test1: index range scan * 16715356 265203 37438 782858 511231
test2: full buffered */ C 630573765 132075 48944 1013913 849316
test3: full direct path * 630573765 15567 11808 1013873 1013850
test4: full smart scan */ 630573765 2102 729 1013873 1013850
test5: full inmemory scan 630573765 155 155 14 0
test6: full buffer cache 630573765 7850 7831 1014741 0
Test 5 & Test 6
run entirely
from memory
Source:
http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action
But why 50x
difference in
CPU usage?

gluent.com 19
CPU & cache friendly data structures are key!
Headers, ITL entries
Row Directory
#0 hdr row
#1 hdr row
#2 hdr row
#3 hdr row
#4 hdr row
#5 hdr row
#6 hdr row
#7 hdr row
#8 hdr row
… row
#1 offset
#2 offset
#3 offset
#0 offset
…
Hdr
byte
Column data
Lock
byte
CC
byte
Col.
len
Column data
Col.
len
Column data
Col.
len
Column data
Col.
len
• OLTP: Block->Row->Column format
• 8kB blocks
• Great for writes, changes
• Field-length encoding
• Reading column #100 requires walking
through all preceding columns
• Columns (with similar values) not densely
packed together
• Not CPU cache friendly for analytics!

gluent.com 20
Scanning columnar data structures
Scanning a column in a
row-oriented data block
Scanning a column in a
column-oriented compression unit
col 1 col 2
col 3
col 4
col 5
col 6
col 2
col 2
col 3
col 3
col 4
col 4
col 5
col 5
col5
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6 Read filter
column(s) first.
Access only
projected columns
if matches found.
Reduced memory
traffic. More
sequential RAM
access, SIMD on
adjacent data.

gluent.com 21
Testing data access path differences on Oracle 12c
SELECT COUNT(cust_valid) FROM
customers_nopart c WHERE cust_id
> 0
Run the same query on
same dataset stored in
different formats/layouts.
Full details:
http://blog.tanelpoder.com/2015/11/30
/ram-is-the-new-disk-and-how-to-
measure-its-performance-part-3-cpu-
instructions-cycles/
Test result data:
http://bit.ly/1RitNMr

gluent.com 22
CPU instructions used for scanning/counting 69M rows

gluent.com 23
Average CPU instructions per row processed
• Knowing that the table has about 69M rows, I can calculate
the average number of instructions issued per row processed

gluent.com 24
CPU cycles consumed (full scans only)

gluent.com 25
CPU efficiency (Instructions-per-Cycle)
Yes, modern superscalar
CPUs can execute multiple
instructions per cycle

gluent.com 26
Reducing memory writes within SQL execution
• Old approach:
1. Read compressed data chunk
2. Decompress data (write data to temporary memory location)
3. Filter out non-matching rows
4. Return data
• New approach:
1. Read and filter compressed columns
2. Decompress only required columns of matching rows
3. Return data

gluent.com 27
Memory reads & writes during internal processing
Unit = MB
Read only
requested columns
Rows counted from
chunk headers
Scan compressed data:
few memory writes

gluent.com 28
Spark Examples
• Will use:
• Spark built in tools
• Perf
• Honest Profiler
• FlameGraphs

gluent.com 29
Apache Spark
Tungsten
Data Structures
Databricks presentation:
http://www.slideshare.n
et/SparkSummit/deep-
dive-into-project-
tungsten-josh-rosen
Much denser
data structure
Using
sun.misc.Unsafe
API to bypass JVM
object allocator

gluent.com 30
Apache Spark
Tungsten
Data Structures
Much denser
data structure
“Good memory
locality”

gluent.com 31
Spark test setup (RDD)
CSV
RDD
(partitoned)
RDD
(single
partition)
“For each”
sum
column X
val lines = sc.textFile("/tmp/simple_data.csv").repartition(1)
val stringFields = lines.map(line => line.split(","))
val fullFieldLength = stringFields.first.length
val completeFields = stringFields.filter(fields => fields.length == fullFieldLength)
val data = completeFields.map(fields => fields.patch(yearIndex,
Array(Try(fields(yearIndex).toInt).getOrElse(0)), 1))
log("cache entire RDD in memory")
data.cache()
log("run map(length).max to populate cache")
println(data.map(r => r.length).reduce((l1, l2) => Math.max(l1, l2)))
.cache().repartition(1)
I wanted to simplify
this test as much as
possible

gluent.com 32
“SELECT” sum (Year) from RDD
// SUM all values of “year” column
println(data.map(d => d(yearIndex).asInstanceOf[Int]).reduce((y1, y2) => y1 + y2))
Cached RDD ~1M records, ~40 columns
1-column sum: 0.349 seconds!
17/01/19 18:43:36 INFO DAGScheduler: ResultStage 123 (reduce at demo.scala:89) finished in 0.349 s
17/01/19 18:43:36 INFO DAGScheduler: Job 61 finished: reduce at demo.scala:89, took 0.353754 s

gluent.com 33
Spark test setup (DataFrame)
CSV
RDD
partitioned
RDD
single
partition
“For each”
sum
column X
val lines = sc.textFile("/tmp/simple_data.csv").repartition(1)
val stringFields = lines.map(line => line.split(","))
val fullFieldLength = stringFields.first.length
val completeFields = stringFields.filter(fields => fields.length == fullFieldLength)
val data = completeFields.map(fields => fields.patch(yearIndex,
Array(Try(fields(yearIndex).toInt).getOrElse(0)), 1))
...
val dataFrame = ss.createDataFrame(data.map(d => Row(d: _*)), schema)
log("cache entire data-frame in memory")
dataFrame.cache()
log("run map(length).max to populate cache")
println(dataFrame.map(r => r.length).reduce((l1, l2) => Math.max(l1, l2)))
.cache().repartition(1)
DataFrame

gluent.com 34
“SELECT” sum (Year) from DataFrame (silly example!)
println(dataFrame.map(r => r(yearIndex).asInstanceOf[Int]).reduce((y1, y2) => y1 + y2))
17/01/19 19:39:25 INFO DAGScheduler: ResultStage 29 (reduce at demo.scala:71) finished in 4.664 s
17/01/19 19:39:25 INFO DAGScheduler: Job 14 finished: reduce at demo.scala:71, took 4.673204 s
Cached DataFrame: ~1M records, ~40 columns
1-column SUM: 4.67 seconds! (13x more than RDD?)
This does not
make sense!

gluent.com 35
“SELECT” sum (Year) from DataFrame (proper)
println(dataFrame.agg(sum("Year")).first.get(0))
17/01/19 19:32:02 INFO DAGScheduler: ResultStage 118 (first at demo.scala:70) finished in 0.004 s
17/01/19 19:32:02 INFO DAGScheduler: Job 40 finished: first at demo.scala:70, took 0.041698 s
Cached DataFrame ~1M records, ~40 columns
1-column sum with aggregation pushdown: 0.041 seconds!
(Over 100x faster than previous Silly DataFrame and 8.5x
faster than 1st RDD example)

gluent.com 36
Summary
• New data structures are required for CPU efficiency!
• Columnar …
• On efficient data structures, efficient code becomes possible
• Bad code still performs badly …
• It is possible to measure the CPU efficiency of your code
• That should come after the usual profiling and DAG / execution plan
validation
• All secondary metrics (like efficiency ratios) should be used in
context of how much work got done

gluent.com 38
Future-proof Open Data Formats!
• Disk-optimized columnar data structures
• Apache Parquet
• https://parquet.apache.org/
• Apache ORC
• https://orc.apache.org/
• Memory / CPU-cache optimized data structures
• Apache Arrow
• Not only storage format
• … also a cross-system/cross-platform IPC communication framework
• https://arrow.apache.org/

gluent.com 39
Future
1. RAM gets cheaper + bigger, not necessarily faster
2. CPU caches get larger
3. RAM blends with storage and becomes non-volatile
4. IO subsystems (flash) get even closer to CPUs
5. IO latencies shrink
6. The latency difference between non-volatile storage and volatile
RAM shrinks - new database layouts!
7. CPU cache is king – new data structures needed!

gluent.com 40
The tools used here:
• Honest Profiler by Richard Warburton (@RichardWarburto)
• https://github.com/RichardWarburton/honest-profiler
• Flame Graphs by Brendan Gregg (@brendangregg)
• http://www.brendangregg.com/flamegraphs.html
• Linux perf tool
• https://perf.wiki.kernel.org/index.php/Main_Page
• Spark-Prof demos:
• https://github.com/gluent/spark-prof

gluent.com 41
References
• Slides & Video of a similar presentation (about Oracle):
• http://www.slideshare.net/tanelp
• https://vimeo.com/gluent
• RAM is the new disk series:
• http://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and-
how-to-measure-its-performance-part-1/
• https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqA
AlHnZqmuVmSFbHMLDsjaU/

gluent.com 42
Thanks!
http://gluent.com/
We are hiring developers &
data engineers!!!
http://blog.tanelpoder.com
@tanelpoder

Low Level CPU Performance Profiling Examples

More Related Content

What's hot

Similar to Low Level CPU Performance Profiling Examples

More from Tanel Poder

Recently uploaded

Low Level CPU Performance Profiling Examples