Ido Friedman
Data Lake
From Bare metal
to the clouds
IdoFriedman.yml
Name: Ido Friedman,
Past:[Data platform consultant, Instructor, Team Leader]
Present: [Data engineer, Architect]
Technologies:
[Elasticsearch,CouchBase,MongoDB,Python,Hadoop,SQL
and more …]
WorkPlace: Perion
WhenNotWorking: @Sea
Data lake
The idea of data lake is to have a single store of
all data in the enterprise ranging from raw data
(which implies exact copy of source system
data) to transformed data which is used for
various tasks
including reporting, visualization, analytics and
machine learning.
Goals????
Raw
Raw data
What is it?
How needs it?
How can we access it?
What can I get from it?
How long do you keep it?
Traditional tools of the trade
SQL
What Changed?
Hadoop started developed at 2011 based on a
Google White paper from 2004
Data locality!
What changed?
2013
2015
Data locality!
What does it mean?
What changed?
X3 = 24K$
SQL Won!
What changed?
Here to stay
The consumersWhat changed?
Cloud storage cost
$0.00
$0.02
$0.04
$0.06
$0.08
$0.10
$0.12
$0.14
$0.16
PRICE $/GB
Google BigQuery cuts historical data storage cost in half and
accelerates many queries by 10x
What changed?
Conclusion
Large Data lake on the cloud is possible
Deployment Options
Bare Metal Cloud IaaS HadooPaaS DB/WHaaS
COMPUTE
ENGINE Data Proc Big Table
COMPUTE
ENGINE
• Full control on Hadoop Distribution and ecosystem
• Will support any weird situation you need
• Not much less work than on-premises deployments
• Hard to make it Pay per use
Moving parts counter – Very high
Data Proc
• Hadoop as a Service
• Some DevOps and Administration efforts
• Limited Choice of Hadoop deployments
• Easy to make it Pay per use
Moving parts counter – Low
Big Table
• 0 DevOps and Administration
• No Hadoop eco system
• Structured data support only
• Pay per use by design
Moving parts counter – None *
* None that you care about
How do you choose?
Tools
COMPUTE
ENGINE
Data Proc
Big Table
Hadoop
ecosystem
SQL
Data
Structured
Unstructured
THE BIG Question
Performance
What affects performance
cpu
30%
network
10%
IO
20%
Code
40%
Code You can
change it
CPU Usually Slower
per core
IO Usually Better
Network Usually Better
Give me some Numbers
by - http://tech.marksblogg.com/
A Billion Taxi Rides – ON …
BigQuery DataProc + Presto
Load time 25 Min 3 Hours*
Simple Aggregation
time
2 Sec 44 Sec
Compute cost ~0.07$/Query 1.14$/Hour
Storage cost 12.6$/Month 5.2$/Month
*convert to ORC
Numbers by - http://tech.marksblogg.com/
https://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-
comparison
Summary
No magic solutions – Test your assumptions
Always understand your data and needs
Invest the time on modeling and optimization
What are we doing?
Hadoop on metal
Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud