Column Stores and Google BigQuery

Column Stores and Google BigQuery
Cloud
Presented By:
Csaba Toth
Csaba Technology Services LLC
GDG Fresno meeting
October 1 2015 Fresno, California

Goal / wish
• Being able to issue queries
• Preferably SQL style
• Over Big data
• As small response time as possible
• Plus: through web interface (no need to
install anything)
• Plus: capability to visualize

Agenda
• Big Data
• Brief look at Hadoop, HIVE and Spark
• OLAP and OLTP
• Row based data store vs. Column data store
• Google BigQuery
• Demo

Big Data
Wikipedia: “collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or
traditional data processing applications”
Examples: (Wikibon - A Comprehensive List of Big Data Statistics)
• 100 Terabytes of data is uploaded to Facebook every day
• Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user
generated data
• Twitter generates 12 Terabytes of data every day
• LinkedIn processes and mines Petabytes of user data to power the "People You May
Know" feature
• YouTube users upload 48 hours of new video content every minute of the day
• Decoding of the human genome used to take 10 years. Now it can be done in 7 days

Big Data
Three Vs: Volume, Velocity, Variety
Sources:
• Science, Sensors, Social networks, Log files
• Public Data Stores, Data warehouse appliances
• Network and in-stream monitoring technologies
• Legacy documents
Main problems:
• Storage Problem
• Money Problem
• Consuming and processing the data

Hadoop
• Hadoop is an open-source software
framework that supports data-intensive
distributed applications
• A Hadoop cluster is composed of a single
master node and multiple worker nodes

Little Hadoop history
“The Google File System” - October 2003
• http://labs.google.com/papers/gfs.html – describes a scalable,
distributed, fault-tolerant file system tailored for data-intensive
applications, running on inexpensive commodity hardware, delivers
high aggregate performance
“MapReduce: Simplified Data Processing on Large
Clusters” - April 2004
• http://queue.acm.org/detail.cfm?id=988408 – describes a
programming model and an implementation for processing large
data sets.

Hadoop
Has two main services:
1. Storing large amounts of data: HDFS – Hadoop
Distributed File System
2. Processing large amounts of data:
implementing the MapReduce programming
model

HDFS
Name node
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
Block A Block B

Job / task management
Name node
Heart beat signals and
communication
Jobtracker
Data node Data node Data node
Tasktracker Tasktracker
Map 1 Reduce 1 Map 2 Reduce 2
Tasktracker
Map 3 Reduce 3

Hadoop / RDBMS / Docum.
Hadoop / MapReduce RDBMS Document stores
Size of data Petabytes Gigabytes Gigabytes+
Integrity of data Low High (referential, typed) Low/Intermediate
Data schema Dynamic Static Dynamic
Access method Batch Interactive and Batch Interactive and Batch
Scaling Linear Nonlinear (worse than linear) Better than RDBMS
Data structure Unstructured Structured Unstructured / semi-struct.
Normalization of data Not Required Required Not or somewhat required
Query Response Time Has latency (due to batch
processing)
Can be near immediate Can be near immediate

Apache Hive
Log Data RDBMS
Data Integration LayerFlume Sqoop
Storage Layer (HDFS)
Computing Layer (MapReduce)
Advanced Query Engine (Hive, Pig)
Data Mining
(Pegasus, Mahout)
Index, Searches
(Lucene)
DB drivers
(Hive driver)
Web Browser (JS)

Hadoop architecture
http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI

Beyond Apache Hive
Goals: decrease latency
Technologies which help:
• YARN: next generation Hadoop
• Hadoop distribution specific: e.g. Cloudera
Impala
• Apache Spark

Beyond Apache Hive
• YARN: improves Hadoop performance in
many respects (resource management and
allocation, …)
• Impala: Cloudera’s MPP SQL Query engine,
based on Hadoop
• Spark: cluster computing framework with
multi-stage in-memory primitives

Apache Spark
• Open Source
• In contrast to Hadoop’s two-stage disk-
based MapReduce paradigm, multi-stage in-
memory primitives can provide up to 100x
performance increase
• It can work over HDFS

http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI
Spark and Hadoop

OLAP vs OLTP
OLTP - Online Transaction
Processing (Operational System)
OLAP - Online Analytical Processing
(Data Warehouse)
Source of data Operational data; original source Consolidation data; comes form various
sources
Purpose of data To control and run fundamental
business tasks
To help with planning, problem solving,
and decision support
Goal of operations retrieve or modify individual
records (mostly few records)
derive new information from existing data
(aggregates, transformations, calculations)
Queries queries often triggered by end user
actions and should complete
instantly
queries often run on many records or
complete data set
Read/Write mixed read/write workload mainly read or even read-only workload
RAM working set should fit in RAM data set may exceed size of RAM easily

OLAP vs OLTP
OLTP - Online Transaction
Processing (Operational System)
OLAP - Online Analytical Processing
(Data Warehouse)
ACID properties may be important often not important, data can often be
regenerated
Interactivity queries often triggered by end user
actions and should complete
instantly
queries often run interactively
Indexing use indexes to quickly find relevant
records
common: not known in advance which
aspects are interesting
so pre-indexing „relevant“ columns is
difficult
DB Design Often highly normalized with many
tables
Typically de-normalized with fewer tables;
use of star and/or snowflake schemas

Storing data: row stores
• Traditional RDBMS and often the document
stores are row oriented too
• The engine always stores and retrieves
entire rows from disk (unless indexes help)
• Row is a collection of column values
together
• Rows are materialized on disk

Row stores
All columns
are stored
together
on disk
id scientist death_by movie_name
1 Reinhardt Maximillian The Black Hole
2 Tyrell Roy Batty Blade Runner
3 Hammond Dinosaur Jurassic Park
4 Soong Lore Star Trek: TNG
5 Morbius His mind Forbidden Planet
6 Dyson Skynet Terminator 2: Judgment Day

Row stores
Performs best
when a small
number of
rows are
accessed
select * from the_table where id = 6
id scientist death_by movie_name
1 Reinhardt Maximillian The Black Hole
2 Tyrell Roy Batty Blade Runner
3 Hammond Dinosaur Jurassic Park
4 Soong Lore Star Trek: TNG
5 Morbius His mind Forbidden Planet
6 Dyson Skynet Terminator 2: Judgment Day

Row stores
• Not so great for wide rows
• If only a small subset of columns queried,
reading the entire row wastes IO

Row stores
Bad case scenario:
• select sum(bigint_column) from table
• Million rows in table
• Average row length is 1 KiB
The select reads one bigint column (8 bytes)
• Entire row must be read
• Reads ~1 GiB data for ~8MiB of column data

Column stores
• Data is organized by columns instead of
rows
• Non material world: often not materialized
during storage, exists only in memory
• Each row still has some sort of “row id”

Column stores
• A row is a collection of column values that are
associated with one another
• Associated: every row has some type of “row
id“
• Can still produce row output (assembling a
row maybe complex though – under the
hood)

Column store
Stores each COLUMN on disk
id
1
2
3
4
5
6
title
Mrs. Doubtfire
The Big Lebowski
The Fly
Steel Magnolias
The Birdcage
Erin Brokovitch
actor
Robin Williams
Jeff Bridges
Jeff Goldblum
Dolly Parton
Nathan Lane
Julia Roberts
genre
Comedy
Comedy
Horror
Drama
Comedy
Drama
row id = 1
row id = 6
Natural order may be unusual Each column has a file or segment on disk

Column stores
• Column compression can be way more efficient
than row compression or compression available
for row stores (sometimes 10:1 to 30:1 ratio)
• Compression: RLE, Integer packing,
dictionaries and lookup, other…
• Reduces both storage and IO (thus response
time)

Column stores
Best case scenario:
• select sum(bigint_column) from table
• Million rows in table
• Average row length is 1 KiB
The select reads one bigint column (8 bytes)
• Only single column read from disk
• Reads ~8MiB of column data, even less with
compression

Column stores
Bad case scenario:
select *
from long_wide_table
where order_line_id = 34653875;
• Accessing all table doesn’t save anything,
could be even more expensive than row
store
• Not ideal fo tables with few columns

Column stores
Updating and deleting rows is expensive
• Some column stores are append only
• Others just strongly discourage writes
• Some split storage into row and column
areas

Row/Column - OLTP/OLAP
Row stores are good fit for OLTP
• Reading small portions of a table, but often
many of the columns
• Frequent changes to data
• Small (<2TB) amount of data (typically
working set must fit in ram)
• "Nested loops" joins are good fit for OLTP

Row/Column - OLTP/OLAP
Column stores are good fit for OLAP
Read large portions of a table in terms of rows,
but often a small number of columns
Batch loading / updates
Big data (50TB-100TB per machine):
• Compression capabilities comes in handy
• Machine generated data is well suited

Column / Row stores
• RDBMS provide ACID capabilities
• Row stores mainly use tree style indexes
• B-tree derivative index structure provides very
fast binary search as long as it fits into memory
• Very large datasets end up unmanageably big
indexes
• Column stores: bitmap indexing
Very expensive to update

BigQuery
• A web service that enables interactive
analysis of massively large datasets
• based on Dremel, a scalable, interactive ad
hoc query system for analysis of read-only
nested data
• working in conjunction with Google Storage
• Has a RESTful web service interface

BigQuery
• You can issue SQL queries over big data
• Interactive web interface
• Can visualize results too
• As small response time as possible
• Auto scales under the hood

Resources
• Slides: http://www.slideshare.net/tothc
• Contact: http://www.meetup.com/CCalJUG/
• Csaba Toth: Introduction to Hadoop and MapReduce -
http://www.slideshare.net/tothc/introduction-to-hadoop-and-map-
reduce
• Justin Swanhart: Introduction to column stores -
http://www.slideshare.net/MySQLGeek/intro-to-column-stores
• Jan Steemann: Column-oriented databases -
http://www.slideshare.net/arangodb/introduction-to-column-
oriented-databases

Resources
• https://anonymousbi.wordpress.com/2012/11/02/hadoop-
beginners-tutorial-on-ubuntu/
• https://www.capgemini.com/blog/capping-it-off/2012/01/what-is-
hadoop
• http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI
• https://www.cloudera.com/content/cloudera/en/documentation/core/
latest/PDF/cloudera-impala.pdf
• https://www.keithrozario.com/2012/07/google-bigquery-wikipedia-
dataset-malaysia-singapore.html
• https://cloud.google.com/bigquery/web-ui-quickstart
• https://cloud.google.com/bigquery/query-reference

Column Stores and Google BigQuery

Column Stores and Google BigQuery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Column Stores and Google BigQuery

Similar to Column Stores and Google BigQuery (20)

More from Csaba Toth

More from Csaba Toth (17)

Recently uploaded

Recently uploaded (20)

Column Stores and Google BigQuery