Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowadays”
1. Evolution of Data Processing Platforms from
Hadoop to nowadays
Yaroslav Ravlinko
Solution Architect
yravlinko@griddynamics.com
Personal experience of building Data Processing platforms from 2013 till now
2. • I'm Solution Architect at Grid Dynamics
• I’m working in IT industry for more than 10 years and
delivered more than 50 projects in different domains.
• Working with “big” data production since 2013
• Tech agnostic
• Really know why DevOps is not a name of position and why
unicorpses live
• https://medium.com/devoops-and-universe/it-trends-guide-
in-2016-lot-of-marketing-mimicking-and-even-more-
unicorpses-3b68548c72da
Yaroslav Ravlinko,
Grid Dynamics,
Lviv, Ukraine
2
About me
6. 6
Things that are killing them
•Resource Management and Financial/Cost Management
Every minute that you system is not working it costs you money
•Cost of development and gaps in ecosystem
Somebody should do it all
•Production (SLA)
DevOps (delivery), SRE (availability and SLA)
7. 7
Intro
The following year in 2004, Google shared another paper on
MapReduce, further cementing the genealogy of big data.
MapReduce was a new technique to move computation to data, and
it allowed large web companies including Google to operate with
enormous amounts of information, such as the entire internet. Soon
after that, Doug Cutting, Hadoop’s initial creator, began to
implement MapReduce and the Hadoop Distributed File System
while at Yahoo, and in 2006 Hadoop 0.1.0 was released.
Six years later in 2012, Hadoop 1.0 became available…
9. 9
First “architecture” Utility Servers
Hadoop Cluster
EDGE
Name
Node 1
EDGE
Node 1
CM
CM DB
HUE Login
Login
Cloudera Manager(CM)
ajlcd-bdhnn01
Data
Node 2
ajlcd-bdhdn02Data
Node 1
ajlcd-bdhdn01
ajlcd-bdhen01
Data
Node 4
ajlcd-bdhdn04
Data
Node 5
ajlcd-bdhdn05
ajlcd-bdhcm01
Data
Node 6
ajlcd-bdhdn06
Data
Node 7
ajlcd-bdhdn07
Name
Node 2
Secondary
ajlcd-bdhnn02
ajlcd-bdhcmdb01
Data
Node 3
ajlcd-bdhdn03
18. 18
Spanner: Becoming a SQL System
A prime motivation for this evolution towards a more “database- like”
system was driven by the experiences of Google developers trying to build
on previous “key-value” storage systems.The prototypical example of such
a key-value system is Bigtable [4], which continues to see massive usage at
Google for a variety of applications. However, developers of many OLTP
applications found it difficult to build these applications without a strong
schema system, cross-row transactions, consistent replication and a
powerful query language.The initial response to these difficulties was to
build transaction processing systems on top of Bigtable; an example is
Megastore [2].
19. 19
But there two things that important
Google’s Spanner started out as a key-value store offering multi-row
transactions, external consistency, and transparent failover across
datacenters. Over the past 7 years it has evolved into a relational
database system. In that time we have added a strongly-typed schema
system and a SQL query processor, among other features.
We describe replacing our Bigtable-like SSTable stack with a blockwise-
columnar store called Ressi which is better optimized for hybrid OLTP/
OLAP query workloads
2010
analytics-focused
21. 21
You don’t need aircraft
carrier to deliver sofa but
don’t rely on bicycle either
Evaluate problem
22. 22
Follow the money
You don’t need “big” data
system if you aren’t ready to
pay for it from your pocket
You are not Google
23. 23
Engineers are hired to create
business value, not to
program things …
Production: Value = Benefits - Cost
24. 24
Decision “tree"
Do you need
mostly run
and deploy apps?
ETL < Apps
Are your services
relying on HDFS as
persistent storage?
Are your tasks mostly
ETL like?
ETL > Apps
YES YES YES
NO NO
25. 25
What is next
1. “War” in “big” data world is almost end.Winners are AWS/Azure/GCP. Losers: everyone who are
building in-house big data clusters
2. Hadoop isn’t bad, it solid basis for future commodities and utilities: HDFS API as de-facto API for
distributed storage, Spark/Flink/Beams as SDK for people who hate SQL
3. Learn SQL if you didn’t know it yet. It will be main interface of all “big” data solutions.
27. Founded in 2006, Grid Dynamics is an engineering services company built on the
premise that cloud computing is disruptive within the enterprise technology
landscape. Since that time, we’ve had the privilege to help companies like
Microsoft, eBay, PayPal, Cisco, Macy’s, Yahoo, ING, Bank of America, Kohl's,
among others, to re-architect their core mission-critical systems, develop new cloud
services, accelerate innovation cycles, increase software quality, and automate
application management.
Grid Dynamics has multiple locations in the USA and Europe, and employs over
1000 expert engineers worldwide.
About Grid Dynamics
27