Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowadays”

Evolution of Data Processing Platforms from
Hadoop to nowadays
Yaroslav Ravlinko
Solution Architect
yravlinko@griddynamics.com
Personal experience of building Data Processing platforms from 2013 till now

• I'm Solution Architect at Grid Dynamics
• I’m working in IT industry for more than 10 years and
delivered more than 50 projects in different domains.
• Working with “big” data production since 2013
• Tech agnostic
• Really know why DevOps is not a name of position and why
unicorpses live
• https://medium.com/devoops-and-universe/it-trends-guide-
in-2016-lot-of-marketing-mimicking-and-even-more-
unicorpses-3b68548c72da
Yaroslav Ravlinko,
Grid Dynamics,
Lviv, Ukraine
2
About me

3
Path
Hadoop 2013

“Lets bring
computation to
data”
Spark 2015 
“Let think about
computation”  
Spanner 2017 
“Give me SQL”

4
Stories and legends
2013

Story about
pregnant girl and
diapers
2015 
Streaming is
everything
2017 
Serverless

5
You disagree with me … it’s fine

6
Things that are killing them
•Resource Management and Financial/Cost Management 
Every minute that you system is not working it costs you money
•Cost of development and gaps in ecosystem  
Somebody should do it all
•Production (SLA)  
DevOps (delivery), SRE (availability and SLA)

7
Intro
The following year in 2004, Google shared another paper on
MapReduce, further cementing the genealogy of big data.
MapReduce was a new technique to move computation to data, and
it allowed large web companies including Google to operate with
enormous amounts of information, such as the entire internet. Soon
after that, Doug Cutting, Hadoop’s initial creator, began to
implement MapReduce and the Hadoop Distributed File System
while at Yahoo, and in 2006 Hadoop 0.1.0 was released.
Six years later in 2012, Hadoop 1.0 became available…

9
First “architecture” Utility Servers
Hadoop Cluster
EDGE
Name
Node 1
EDGE
Node 1
CM
CM DB
HUE Login
Login
Cloudera Manager(CM)
ajlcd-bdhnn01
Data
Node 2
ajlcd-bdhdn02Data
Node 1
ajlcd-bdhdn01
ajlcd-bdhen01
Data
Node 4
ajlcd-bdhdn04
Data
Node 5
ajlcd-bdhdn05
ajlcd-bdhcm01
Data
Node 6
ajlcd-bdhdn06
Data
Node 7
ajlcd-bdhdn07
Name
Node 2
Secondary
ajlcd-bdhnn02
ajlcd-bdhcmdb01
Data
Node 3
ajlcd-bdhdn03

11
Utility Servers
Hadoop Cluster
EDGE
Name
Node 1
EDGE
Node 1
CM
CM DB
HUE Login
Login
Cloudera Manager(CM)
ajlcd-bdhnn01
Data
Node 2
ajlcd-bdhdn02Data
Node 1
ajlcd-bdhdn01
ajlcd-bdhen01
Data
Node 4
ajlcd-bdhdn04
Data
Node 5
ajlcd-bdhdn05
ajlcd-bdhcm01
Data
Node 6
ajlcd-bdhdn06
Data
Node 7
ajlcd-bdhdn07
Name
Node 2
Secondary
ajlcd-bdhnn02
ajlcd-bdhcmdb01
Data
Node 3
ajlcd-bdhdn03
AWS Cloud
Virtual Private Cloud
VPC Subnet
EDGE
Sqoop
Python
EMR
EMRCorporate
Data Center
RDBMS
omniture
Google
Analytics
S3
Meta
Store AWS Redshift
ETL
Data Ingestion
AWS Data
Pipeline
Hive, Pig, Shell
RAW
S3
RDBMS
QlikView
Route 53
VPN
Amazon
CloudWatch
Amazon
Route53

12
When you mastered hammer everything around become nails

14
First “architecture”
AWS Cloud
Corporate
Datacenter
AWS Cloud
AKKA
Data Flow Pipeline
Corporate
Datacenter

15
Problem of pipeline is bottlenecks … and often it is not your data

17
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/
acac3b090a577348a7106d09c051c493298ccb1d.pdf
Story about one paper

18
Spanner: Becoming a SQL System
A prime motivation for this evolution towards a more “database- like”
system was driven by the experiences of Google developers trying to build
on previous “key-value” storage systems.The prototypical example of such
a key-value system is Bigtable [4], which continues to see massive usage at
Google for a variety of applications. However, developers of many OLTP
applications found it difﬁcult to build these applications without a strong
schema system, cross-row transactions, consistent replication and a
powerful query language.The initial response to these difﬁculties was to
build transaction processing systems on top of Bigtable; an example is
Megastore [2].

19
But there two things that important
Google’s Spanner started out as a key-value store offering multi-row
transactions, external consistency, and transparent failover across
datacenters. Over the past 7 years it has evolved into a relational
database system. In that time we have added a strongly-typed schema
system and a SQL query processor, among other features.
We describe replacing our Bigtable-like SSTable stack with a blockwise-
columnar store called Ressi which is better optimized for hybrid OLTP/
OLAP query workloads
2010
analytics-focused

21
You don’t need aircraft
carrier to deliver sofa but
don’t rely on bicycle either
Evaluate problem

22
Follow the money  
You don’t need “big” data
system if you aren’t ready to
pay for it from your pocket
You are not Google

23
Engineers are hired to create
business value, not to
program things …
Production: Value = Benefits - Cost

24
Decision “tree"
Do you need
mostly run
and deploy apps?
ETL < Apps
Are your services
relying on HDFS as
persistent storage?
Are your tasks mostly
ETL like?
ETL > Apps
YES YES YES
NO NO

25
What is next
1. “War” in “big” data world is almost end.Winners are AWS/Azure/GCP. Losers: everyone who are
building in-house big data clusters
2. Hadoop isn’t bad, it solid basis for future commodities and utilities: HDFS API as de-facto API for
distributed storage, Spark/Flink/Beams as SDK for people who hate SQL
3. Learn SQL if you didn’t know it yet. It will be main interface of all “big” data solutions.

Founded in 2006, Grid Dynamics is an engineering services company built on the
premise that cloud computing is disruptive within the enterprise technology
landscape. Since that time, we’ve had the privilege to help companies like
Microsoft, eBay, PayPal, Cisco, Macy’s, Yahoo, ING, Bank of America, Kohl's,
among others, to re-architect their core mission-critical systems, develop new cloud
services, accelerate innovation cycles, increase software quality, and automate
application management.
Grid Dynamics has multiple locations in the USA and Europe, and employs over
1000 expert engineers worldwide.
About Grid Dynamics
27

www.griddynamics.com
Thank you!

Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowadays”

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowadays”

Similar to Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowadays” (20)

More from Lviv Startup Club

More from Lviv Startup Club (20)

Recently uploaded

Recently uploaded (20)

Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowadays”