gluent.com 1
In-Memory	Execution	for	Databases
Tanel	Poder
a	long	time	computer	performance	geek
gluent.com 2
Intro:	About	me
• Tanel	Põder
• Oracle	Database	Performance	geek	(18+	years)
• Exadata	Performance	geek
• Linux	Performance	geek
• Hadoop	Performance	geek
• CEO	&	co-founder:
Expert	Oracle	Exadata	
book
(2nd edition	is	out	now!)
Instant	
promotion
gluent.com 3
Gluent
Oracle
Teradata
NoSQL
Big	Data	
Sources
MSSQL
App	
X
App	
Y
App	
Z
Gluent	as	a	data	
virtualization	layer
Open	Data	
Formats!
gluent.com 4
Gluent	Advisor
1. Analyzes DB	storage	use	and	access	
patterns	for	safe	offloading
2. 500+	Databases	analyzed
3. 10+	PB analyzed	– 81% offloadable
4. 2-24x query	speedup
10	PB
Interested	in	
analyzing	your	
database?
http://gluent.com/whitepapers
gluent.com 5
Tape	is	dead,	disk	is	tape,	flash	is	disk,	RAM	locality	is	king
Jim	Gray,	2006
http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt
gluent.com 6
Seagate	Cheetah	15k	RPM	disk	specs
200	
MB	
/sec!
gluent.com 7
Spinning	disk	IO	throughput
• B-Tree	index-walking disk-based	RDBMS
• 15000	rpm	spinning	disks
• ~200	random IOPS	per	disk
• ~8kB	read	per	random	IO
• 8	kB	*	200	IOPS	=	1.6	MB/sec per	disk
• Full	scanning based	workloads
• Potentially	much	more	data	to	access	&	filter
• Partition	pruning,	zonemaps,	storage	indexes	help	to	skip	data	1
• Scan	only	required	columns	(formats	with	large	chunk	sizes)
• Sequential	IO	rate	up	to	200MB/sec per	disk
http://www.dbms2.com/2013/05/27/data-skipping/
However,	index	
scans	can	read	only	
a	subset	of	data
gluent.com 8
Scanning	a	bunch	of	spinning	disks	can	keep	
your	CPUs	really	busy!
*	Not	even	talking	about	flash	or	RAM	here!
gluent.com 9
A	simple	query	bottlenecked	by	CPU
9	GB	scanned,	processed	
in	7	seconds:
~1300	MB/s	in	PX
~80	MB/s	per	slave
gluent.com 10
A	complex	query	bottlenecked	by	CPU
Complex	Query:	Much	
more	CPU	spent	on	
aggregations,	joins.	9GB	
processed	in	1.5	minutes
9	GB	/	90	seconds	=	~	
100MB/s	PX
6	MB/s	per	slave
gluent.com 11
If	disks	and	storage	subsystems	are	getting	so	fast,	why	all	the	
buzz	around	in-memory	database	systems?
*	Can’t	we	just	cache	the	old	database	files	in	RAM?
gluent.com 12
A	simple	Data	Retrieval	test!
• Retrieve	1% rows	out	of	a	8	GB	table:
SELECT
COUNT(*)
, SUM(order_total)
FROM
orders
WHERE
warehouse_id BETWEEN 500 AND 510
The	Warehouse	
IDs	range	between	
1	and	999
Test	data	
generated	by	
SwingBench tool
gluent.com 13
Data	Retrieval:	Test	Results
• Remember,	this	is	a	very	simple	scanning	+	filtering	query:
TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ
------------------------- ---------- -------- -------- --------- ---------
test1: index range scan * 16715356 265203 37438 782858 511231
test2: full buffered */ C 630573765 132075 48944 1013913 849316
test3: full direct path * 630573765 15567 11808 1013873 1013850
test4: full smart scan */ 630573765 2102 729 1013873 1013850
test5: full inmemory scan 630573765 155 155 14 0
test6: full buffer cache 630573765 7850 7831 1014741 0
Test	5	&	Test	6	
run	entirely	
from	memory
Source:	
http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action
But	why	50x	
difference	in	
CPU	usage?
gluent.com 14
Tape	is	dead,	disk	is	tape,	flash	is	disk,	RAM	locality	is	king
Jim	Gray,	2006
http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt
gluent.com 15
Latency	Numbers	Every	Programmer	Should	Know
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache,
200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD,
4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter
roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory,
20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Source:	
https://gist.github.com/jboner/2841832
gluent.com 16
CPU	=	fast
CPU	L2	/	L3	
cache	in	between
RAM	=	slow
gluent.com 17
RAM	access	is	the	bottleneck	of	modern	computers
Waits	for	RAM	access	show	up	as	CPU	usage	in	monitoring	tools
Want	to	wait	less?	Do	it	less!
gluent.com 18
CPU	&	cache	friendly	data	structures	are	key!
Headers,	ITL	entries
Row	Directory
#0	hdr row
#1	hdr row
#2	hdr row
#3	hdr row
#4	hdr row
#5	hdr row
#6	hdr row
#7	hdr row
#8	hdr row
… row
#1	offset
#2	offset
#3	offset
#0	offset
…
Hdr
byte
Column	data
Lock	
byte
CC	
byte
Col.	
len
Column	data
Col.	
len
Column	data
Col.	
len
Column	data
Col.	
len
• OLTP:	Block->Row->Column	format
• 8kB	blocks
• Great	for	writes,	changes
• Field-length	encoding
• Reading	column	#100	requires	walking	
through	all	preceding	columns
• Columns	(with	similar	values)	not	densely	
packed	together
• Not	CPU	cache	friendly	for	analytics!
gluent.com 19
Scanning	columnar	data	structures
Scanning	a	column	in	a	
row-oriented data	block
Scanning	a	column	in	a	
column-oriented compression	unit
col	1 col	2
col	3
col	4
col	5
col	6
col	2
col	2
col	3
col	3
col	4
col	4
col	5
col	5
col5
col	6
col	1 col	2
3…
col	3 col	4
col	4 col	5
col	6 col	1 col	2
col	3
col	3
col	4
col	4
col	5
col	5
col	1 col	2
col	6
col	6
col	1 col	2
3…
col	3 col	4
col	4 col	5
col	6 col	1 col	2
col	3
col	3
col	4
col	4
col	5
col	5
col	1 col	2
col	6
col	6
col	1 col	2
3…
col	3 col	4
col	4 col	5
col	6 col	1 col	2
col	3
col	3
col	4
col	4
col	5
col	5
col	1 col	2
col	6
col	6 Read	filter	
column(s)	first.	
Access	only	
projected	columns	
if	matches	found.
Reduced	memory	
traffic.	More	
sequential	RAM	
access,	SIMD on	
adjacent	data.
gluent.com 20
How	to	measure this	stuff?
gluent.com 21
CPU	Performance	Counters	on	Linux
# perf stat -d -p PID sleep 30
Performance counter stats for process id '34783':
27373.819908 task-clock # 0.912 CPUs utilized
86,428,653,040 cycles # 3.157 GHz
32,115,412,877 instructions # 0.37 insns per cycle
# 2.39 stalled cycles per insn
7,386,220,210 branches # 269.828 M/sec
22,056,397 branch-misses # 0.30% of all branches
76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle
58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle
256,440,384 cache-references # 9.368 M/sec
222,036,981 cache-misses # 86.584 % of all cache refs
234,361,189 LLC-loads # 8.562 M/sec
218,570,294 LLC-load-misses # 93.26% of all LL-cache hits
18,493,582 LLC-stores # 0.676 M/sec
3,233,231 LLC-store-misses # 0.118 M/sec
7,324,946,042 L1-dcache-loads # 267.589 M/sec
305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits
36,890,302 L1-dcache-prefetches # 1.348 M/sec
30.000601214 seconds time elapsed
Measure	what’s	
going	on	inside a	
CPU!
Metrics	explained	in	
my	blog	entry:	
http://bit.ly/1PBIlde
gluent.com 22
Testing	data	access	path	differences	on	Oracle	12c
SELECT COUNT(cust_valid)
FROM customers_nopart c
WHERE cust_id > 0
Run	the	same	query	on	
same	dataset	stored	in	
different	formats/layouts.
Full	details:
http://blog.tanelpoder.com/2015/11/30
/ram-is-the-new-disk-and-how-to-
measure-its-performance-part-3-cpu-
instructions-cycles/
Test	result	data:
http://bit.ly/1RitNMr
gluent.com 23
CPU	instructions	used	for	scanning/counting	69M	rows
gluent.com 24
Average	CPU	instructions	per	row	processed
• Knowing	that	the	table	has	about	69M	rows,	I	can	calculate	
the	average	number	of	instructions	issued	per	row	processed
gluent.com 25
CPU	cycles	consumed	(full	scans	only)
gluent.com 26
CPU	efficiency	(Instructions-per-Cycle)
Yes,	modern	superscalar
CPUs	can	execute	multiple	
instructions	per	cycle
gluent.com 27
Reducing	memory	writes	within	SQL	execution
• Old	approach:
1. Read	compressed	data	chunk
2. Decompress	data	(write	data	to	temporary	memory	location)
3. Filter	out	non-matching	rows
4. Return	data
• New	approach:
1. Read	and	filter compressed	columns
2. Decompress	only	required	columns	of	matching	rows
3. Return	data
gluent.com 28
Memory	reads	&	writes	during	internal	processing
Unit	=	MB
Read	only	
requested	columns
Rows	counted	from	
chunk	headers
Scan	compressed	data:	
few	memory	writes
gluent.com 29
Past	&	Future
gluent.com 30
Some	commercial	column	store	history
• Disk-optimized	column	stores
• Expressway	103	/	Sybase	IQ	(early	‘90s)
• MonetDB (early	‘90s)
• Oracle	Hybrid	Columnar	Compression	(disk/OLTP	optimized)
• …
• Memory-optimized	column	stores
• …
• SAP	HANA	(December	2010)
• IBM	DB2	with	BLU	Acceleration	(June	2013)
• Oracle	Database	12c	with	In-Memory	Option	(July	2014)
• …
*	Not	addressing	memory-optimized	OLTP	/	row-stores	here
gluent.com 31
Future-proof	Open	Data	Formats!
• Disk-optimized	columnar	data	structures
• Apache	Parquet
• https://parquet.apache.org/
• Apache	ORC
• https://orc.apache.org/
• Memory	/	CPU-cache	optimized	data	structures
• Apache	Arrow
• Not	only	storage	format
• …	also	a	cross-system/cross-platform	IPC	communication	framework
• https://arrow.apache.org/
gluent.com 32
Future
1. RAM	gets	cheaper	+	bigger,	not	necessarily	faster
2. CPU	caches	get	larger
3. RAM	blends	with	storage	and	becomes	non-volatile
4. IO	subsystems	(flash)	get	even	closer	to	CPUs
5. IO	latencies	shrink
6. The	latency	difference	between	non-volatile	storage	and	volatile	
RAM	shrinks	- new	database	layouts!
7. CPU	cache	is	king	– new	data	structures	needed!
gluent.com 33
References
• Slides	&	Video	of	this	presentation:
• http://www.slideshare.net/tanelp
• https://vimeo.com/gluent
• Index	range	scans	vs	full	scans:
• http://blog.tanelpoder.com/2014/09/17/about-index-range-scans-
disk-re-reads-and-how-your-new-car-can-go-600-miles-per-hour/
• RAM	is	the	new	disk	series:
• http://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and-
how-to-measure-its-performance-part-1/
• https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqA
AlHnZqmuVmSFbHMLDsjaU/
gluent.com 34
Thanks!
http://gluent.com/whitepapers
We	are	hiring	developers	&	
data	engineers!!!
http://blog.tanelpoder.com
tanel@tanelpoder.com
@tanelpoder

GNW01: In-Memory Processing for Databases