Data lake – On Premise VS Cloud

Ido Friedman
Data Lake
From Bare metal
to the clouds

IdoFriedman.yml
Name: Ido Friedman,
Past:[Data platform consultant, Instructor, Team Leader]
Present: [Data engineer, Architect]
Technologies:
[Elasticsearch,CouchBase,MongoDB,Python,Hadoop,SQL
and more …]
WorkPlace: Perion
WhenNotWorking: @Sea

Data lake
The idea of data lake is to have a single store of
all data in the enterprise ranging from raw data
(which implies exact copy of source system
data) to transformed data which is used for
various tasks
including reporting, visualization, analytics and
machine learning.

Raw data
What is it?
How needs it?
How can we access it?
What can I get from it?
How long do you keep it?

Traditional tools of the trade
SQL

What Changed?
Hadoop started developed at 2011 based on a
Google White paper from 2004

Data locality!
What changed?
2013
2015

Data locality!
What does it mean?
What changed?
X3 = 24K$

SQL Won!
What changed?
Here to stay

Cloud storage cost
$0.00
$0.02
$0.04
$0.06
$0.08
$0.10
$0.12
$0.14
$0.16
PRICE $/GB
Google BigQuery cuts historical data storage cost in half and
accelerates many queries by 10x
What changed?

Conclusion
Large Data lake on the cloud is possible

Deployment Options
Bare Metal Cloud IaaS HadooPaaS DB/WHaaS
COMPUTE
ENGINE Data Proc Big Table

COMPUTE
ENGINE
• Full control on Hadoop Distribution and ecosystem
• Will support any weird situation you need
• Not much less work than on-premises deployments
• Hard to make it Pay per use
Moving parts counter – Very high

Data Proc
• Hadoop as a Service
• Some DevOps and Administration efforts
• Limited Choice of Hadoop deployments
• Easy to make it Pay per use
Moving parts counter – Low

Big Table
• 0 DevOps and Administration
• No Hadoop eco system
• Structured data support only
• Pay per use by design
Moving parts counter – None *
* None that you care about

How do you choose?
Tools
COMPUTE
ENGINE
Data Proc
Big Table
Hadoop
ecosystem
SQL
Data
Structured
Unstructured

What affects performance
cpu
30%
network
10%
IO
20%
Code
40%
Code You can
change it
CPU Usually Slower
per core
IO Usually Better
Network Usually Better

Give me some Numbers
by - http://tech.marksblogg.com/
A Billion Taxi Rides – ON …
BigQuery DataProc + Presto
Load time 25 Min 3 Hours*
Simple Aggregation
time
2 Sec 44 Sec
Compute cost ~0.07$/Query 1.14$/Hour
Storage cost 12.6$/Month 5.2$/Month
*convert to ORC

Numbers by - http://tech.marksblogg.com/
https://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-
comparison

Summary
No magic solutions – Test your assumptions
Always understand your data and needs
Invest the time on modeling and optimization

What are we doing?
Hadoop on metal

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

More Related Content

What's hot

Viewers also liked

Similar to Data lake – On Premise VS Cloud

More from Idan Tohami

Recently uploaded

Data lake – On Premise VS Cloud