SlideShare a Scribd company logo
Roman Nikitchenko, 10.05.2015
BIG DATA: FROM MAMMOTH TO ELEPHANT
MAMMOTH
The only real truth we know
about them is their rests. Do
you feel your enterprise data
infrastructure goes this way?
Come and see in the nearest
data center...
2
TWO YEARS AGO
● Our exciting high scalability realtime
BIG DATA solution with broad
technologies stack in production.
3
This is our
PRESENT DAY
.. yet is powered by
4
storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Healthcare
providers
data: labs,
cares ...
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
5
storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
Inbound data archives
(pretty short cycle)
One SQL DB
per application
Huge amount of data. Serious
amount of duplicates
How about retention
and data issues
investigation?
Healthcare
providers
data: labs,
cares ...
6
Outbound flow
is slow because
of RDBMS
processing
storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
Inbound data retention cycle
is short, so prolonged period
data investigation is hard
Overall huge amount of SQL databases,
high operational complexity
One application DB per service client
makes inter-application analytics and
monitoring extremely hard
YELLOW ALARMS
Healthcare
providers
data: labs,
cares ...
7
8
BIG DATA
Better ways to store huge data
volumes: cheaper, safer and easier.
WHAT TO RUN FOR?
MORE STORAGE
9
BIG DATA
WHAT TO RUN FOR?
Scalable effective distributed
processing models to open new
opportunities like machine
learning.
MORE POWER
10
BIG DATA
WHAT TO RUN FOR?
More flexible data
structures closer
to subject area
and real world.
11
RDBMS LIMITS
● Good for anything
● Not so good for
anything in
particular
OUR MAIN ENEMY WAS ...
12
MASSIVE ANALYSISIs about massive access to your data objects
Your
database
Subject area
objects data
Subject area
objects data
Subject area
objects data
Subject area
objects data
Processing
Processing
Processing
Processing
Transformation from
database structure
into object structure
Distributed
parallel
processing
Effective results
collection
Distributed
processing
results to be
joined
WHY SQL IS EVIL
13
RDBMS LIMITS
When you go massive
processing, objects
collection is getting too
complex. Think about
100.000.000 people
data scan.
Address ID City Street
1 New York 1020, Blue lake
2 Atlanta 203, Bricks av.
3 Seattle 120, Green drv.
FirstName LastName Address Payer
John Smith 1 2
Kate Davis 2 1
Samuel Brown 3 2
Payer ID Name State
1 SaferLife GA
2 YourGuard CA
Kate Davis,
Atlanta 203, Bricks av.
SafeLife, GA
SUBJECT AREA OBJECT COLLECTION
14
FirstName
LastName
Address
Payer
Birthday
RDBMS LIMITS
FirstName LastName Address Payer
John Smith 1 2
Kate Davis 2 1
Samuel Brown 3 2
And now let us add new «Birthday» column.
Easy as pie!
Let it be Patients table ...
ALTER TABLE Patient ADD Birthday ...
TABLE STRUCTURE MODIFICATION
Let's do this with 2.000.000.000 rows MySQL table in
production. What to do if your table grows further?
15
ANY RELATIONAL DATA MODEL
SOONER OR LATER
16
Your SQL
database
Shard
Shard
Shard
Shard
Processing
Processing
Processing
Processing
How to partition
data? What to do
when new shard is
added?
Need another
cluster for
processing?
Distributed
processing
results to be
joined
HOW TO SCALE?
RDBMS LIMITS
17
If you need to store plain text log,
collection of objects for a long
time or current user session
attributes do you really need
SQL?
18
Cross-application
data storage
SQL DB
Application data
SQL DB
Application data
SQL DB
Application data
Small realtime requests
Batch analytic
and reporting
load
ETL
ETL
ETL
● One-time ETL as initial step and backup strategy.
● Full migration to Apache Hbase.
● As a transition period solution — realtime synchronization.
OUR INITIAL
BIG PLAN WAS
19
OPEN SOURCE framework for big data.
Both distributed storage and processing
Provides RELIABILITY
and fault tolerance by
SOFTWARE design (for
example file system with
replication factor 3 as
default one.Horizontal scalability from
single computer up to
thousands of nodes
Why Hadoop (initially 1.x)?
20
First ever world
DATA OS
10.000 nodes computer...
Can start in production from just 4 servers, 1 of
them is for management and coordination.
Single server is enough for development
environment.
21
HBase motivation
WHY
LATENCY, SPEED AND ALL
HADOOP PROPERTIES
22
Database
Region server
Distributed
processing
WHY YET ?
DataNode Node
File system Hardware
TaskTracker
Region server DataNode NodeTaskTracker
Region server DataNode NodeTaskTracker
Region server DataNode NodeTaskTracker
● Good both for OLTP and batch load.
● Natural scaling and reliability with Hadoop.
● Data processing locality, natural sharding with regions.
● Coordination with ZooKeeper.
23
ZooKeeper
Because coordinating distributed systems is a Zoo.
● Quorum based service for
fast distributed system
coordination.
● Came in our stack with
Apache Hbase where it was
needed for coordination.
Now is part of core Hadoop
infrastructure.
● Yet we use it for our own
applications,
24
Finally we went
initial production with HADOOP 2.0
RESOURCE MANAGEMENT
DISTRIBUTED PROCESSING
FILE SYSTEM
COORDINATION
HADOOP
2.x CORE
25
Database
Region server
Distributed
processing &
coordination
Real initial approach
DataNode Node
File system Hardware
Region server DataNode Node
Region server DataNode Node
Region server DataNode Node
● ZooKeeper Instances are distributed among cluster.
● MapReduce is not service in Hadoop 2.x, just YARN application.
Resource
management
NodeManager
NodeManager
NodeManager
NodeManager
26
FIRST REAL RESULT
Cross-application
data storage
SQL DB
Application data
SQL DB
Application data
SQL DB
Application data
Small realtime requests Batch analytic
and reporting
load
ETL
ETL
ETL
CLOSE BUT NOT EXACT PLAN
Daily ETL. Satisfied our daily reporting needs with major SQL
infrastructure offload. Direct profit — massive processing is much
faster, can handle inter-application data.
DO NOT WEAR PINK GLASSES
27
APPROACH WE HAVE FIXED MUCH LATER
SQL
server
JOIN
Table1
Table2
Table3
Table4
ETL stream
SQL
server
JOIN
Table1
Table2
Table3
Table4
ETL stream
ETL stream
ETL stream
ETL stream
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
Bulk
load
Bulk
load
28
Hadoop: don't do it yourself
DON'T DO IT YOURSELF
Because of number of factors starting
from our distributed team support
needs we have selected
29
x MAX
+
=
BIG
DATA
BIG
DATA
BIG
DATA
HADOOP as INFRASTRUCTURE
30
WHERE TO GO FROM HERE?
31
The admission of
temporary residents into
Canada is a privilege, not
a right.
http://www.cic.gc.ca/
SEARCH /
SECONDARY
INDICES
32
NO SEARCH OUT OF
THE BOX OTHER THAN
LINEAR SCAN OVER
THE TABLE AND
FILTERS.
SEARCH /
SECONDARY
INDICES
The same happened to be applicable
to secondary indices in Hbase.
33
SEARCH / SECONDARY INDICES
HOW WE MADE IT
HBase
handles user
data changes
Indexes are
built on SOLR
NGData Lily indexer
transforms data
changes into SOLR
index updates
34
HBase: Data and search integration
Data
update
Client
User just puts (or
deletes) data.
Search responses
Lily HBase
NRT indexerREPLICATION
Translates data
changes into SOLR
index updates.
SOLR cloud
Search requests (HTTP)
Apache
Zookeeper does
all coordination Provides real
indexing
Search and indexing together
35
● Kafka is a high throughput distributed
messaging system.
● Allows true realtime system reaction
through publish-subscribe approach.
● New services can subscribe to data
events stream.
GOING REALTIME
Batch load
Realtime load
New
data
36
● Kafka can be separated
from Hadoop infrastructure
or have backup cluster.
● Data publishers can switch
to another cluster.
● Subscribers (including
Spark on Hadoop) keep 2
places of subscription.
● So now you are free to put
Kafka cluster in
maintenance or backup
subscribers.
GOING REALTIME
GENTLY
MAINTENANCE
37
This is our
PRESENT DAY
.. yet is powered by
38
SO WHERE ARE
WE GOING?
39
OVER BIG DATAREACTIVE
MANIFESTO
MOTIVATION
… users expect millisecond response times and 100%
uptime. Data is measured in Petabytes. Today's demands
are simply not met by yesterday’s software architectures.
40
OVER BIG DATAREACTIVE
MANIFESTO
… we want systems that are
Responsive, Resilient, Elastic
and Message Driven. We call
these Reactive Systems. http://www.reactivemanifesto.org/
41
OVER BIG DATAREACTIVE
MANIFESTO
Responsiveness is
the cornerstone of
usability and utility,
but more than that,
responsiveness
means that
problems may be
detected quickly and
dealt with effectively.
RESPONSIVE
42
OVER BIG DATAREACTIVE
MANIFESTO
The system stays
responsive in the
face of failure.
… The client of a
component is not
burdened with
handling its failures.
RESILIENT All services here are located through ZooKeeper
which is quorum based so resilience is achieved
43
OVER BIG DATAREACTIVE
MANIFESTO
Reactive Systems
can react to changes
in the input rate by
increasing or
decreasing the
resources allocated
to service these
inputs.
ELASTIC
Both HDFS and Hbase
allow dynamic node
addition / removal
YARN already handles
most resource allocation
work and makes progress
44
OVER BIG DATAREACTIVE
MANIFESTO
Reactive Systems rely
on asynchronous
message-passing to
establish a boundary
between components
that ensures loose
coupling.
MESSAGE
DRIVEN
Asynchronous
messages from
applications
Any application can
subscribe, not only
Hadoop services
45
LESSONS LEARNED
● No transition in one step. You
enter Big Data world step by step.
● Change you mind first. You should
stop thinking in old style. Do not
try simply to map your existing
approaches.
● No silver bullet. Don't ruin your
existing infrastructure. Extend it.
NoSQL is not always good and
some cases are really to be kept
on SQL. Use the right tool.
● As you progress you pay more
attention to operations and
reactive system properties.
46
QUESTION?
47

More Related Content

What's hot

Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
Treasure Data, Inc.
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
Michael Spector
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Sumeet Singh
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Spark Summit
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
Matt Sarrel
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
Charles Allen
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Charles Allen
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Spark Summit
 
Data ingestion
Data ingestionData ingestion
Data ingestion
nitheeshe2
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
confluent
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
NoSQLmatters
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
Imply
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
DataWorks Summit
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 

What's hot (20)

Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 

Viewers also liked

Spring Boot. Boot up your development. JEEConf 2015
Spring Boot. Boot up your development. JEEConf 2015Spring Boot. Boot up your development. JEEConf 2015
Spring Boot. Boot up your development. JEEConf 2015
Strannik_2013
 
Web application I have always dreamt of
Web application I have always dreamt ofWeb application I have always dreamt of
Web application I have always dreamt of
Victor_Cr
 
JEE Conf 2015: Less JS!
JEE Conf 2015: Less JS!JEE Conf 2015: Less JS!
JEE Conf 2015: Less JS!
_Dewy_
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and Future
RichardWarburton
 
Statis code analysis
Statis code analysisStatis code analysis
Statis code analysis
chashnikov
 
Spring data jee conf
Spring data jee confSpring data jee conf
Spring data jee conf
Evgeny Borisov
 
Scala Rock-Painting
Scala Rock-PaintingScala Rock-Painting
Scala Rock-Painting
GlobalLogic Ukraine
 
Spring cloud for microservices architecture
Spring cloud for microservices architectureSpring cloud for microservices architecture
Spring cloud for microservices architecture
Igor Khotin
 
Do we need JMS in 21st century?
Do we need JMS in 21st century?Do we need JMS in 21st century?
Do we need JMS in 21st century?
Mikalai Alimenkou
 

Viewers also liked (10)

Spring Boot. Boot up your development. JEEConf 2015
Spring Boot. Boot up your development. JEEConf 2015Spring Boot. Boot up your development. JEEConf 2015
Spring Boot. Boot up your development. JEEConf 2015
 
X text
X textX text
X text
 
Web application I have always dreamt of
Web application I have always dreamt ofWeb application I have always dreamt of
Web application I have always dreamt of
 
JEE Conf 2015: Less JS!
JEE Conf 2015: Less JS!JEE Conf 2015: Less JS!
JEE Conf 2015: Less JS!
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and Future
 
Statis code analysis
Statis code analysisStatis code analysis
Statis code analysis
 
Spring data jee conf
Spring data jee confSpring data jee conf
Spring data jee conf
 
Scala Rock-Painting
Scala Rock-PaintingScala Rock-Painting
Scala Rock-Painting
 
Spring cloud for microservices architecture
Spring cloud for microservices architectureSpring cloud for microservices architecture
Spring cloud for microservices architecture
 
Do we need JMS in 21st century?
Do we need JMS in 21st century?Do we need JMS in 21st century?
Do we need JMS in 21st century?
 

Similar to BIG DATA: From mammoth to elephant

Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
Roman Nikitchenko
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
Roman Nikitchenko
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
Amazon Web Services
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
Roman Nikitchenko
 
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
GeeksLab Odessa
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
ModusOptimum
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
Athens Big Data
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Cloudera, Inc.
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
exponential-inc
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
MongoDB
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
Svetlin Stanchev
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
Scott Gray
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Mydbops
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 

Similar to BIG DATA: From mammoth to elephant (20)

Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
 
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 

More from Roman Nikitchenko

Public presentations for software engineers
Public presentations for software engineersPublic presentations for software engineers
Public presentations for software engineers
Roman Nikitchenko
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
Roman Nikitchenko
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
Roman Nikitchenko
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
Roman Nikitchenko
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.
Roman Nikitchenko
 

More from Roman Nikitchenko (6)

Public presentations for software engineers
Public presentations for software engineersPublic presentations for software engineers
Public presentations for software engineers
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.
 

Recently uploaded

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

BIG DATA: From mammoth to elephant

  • 1. Roman Nikitchenko, 10.05.2015 BIG DATA: FROM MAMMOTH TO ELEPHANT
  • 2. MAMMOTH The only real truth we know about them is their rests. Do you feel your enterprise data infrastructure goes this way? Come and see in the nearest data center... 2
  • 3. TWO YEARS AGO ● Our exciting high scalability realtime BIG DATA solution with broad technologies stack in production. 3
  • 4. This is our PRESENT DAY .. yet is powered by 4
  • 5. storage storage SQL DB Processed inbound data Inbound Outbound SQL DB Processed inbound data Healthcare providers data: labs, cares ... Mostly insurance companies SQL DB Application data SQL DB Outbound information OUR INITIAL STATE: TOP VIEW CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS 5
  • 6. storage storage SQL DB Processed inbound data Inbound Outbound SQL DB Processed inbound data Mostly insurance companies SQL DB Application data SQL DB Outbound information OUR INITIAL STATE: TOP VIEW CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS Inbound data archives (pretty short cycle) One SQL DB per application Huge amount of data. Serious amount of duplicates How about retention and data issues investigation? Healthcare providers data: labs, cares ... 6
  • 7. Outbound flow is slow because of RDBMS processing storage storage SQL DB Processed inbound data Inbound Outbound SQL DB Processed inbound data Mostly insurance companies SQL DB Application data SQL DB Outbound information OUR INITIAL STATE: TOP VIEW CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS Inbound data retention cycle is short, so prolonged period data investigation is hard Overall huge amount of SQL databases, high operational complexity One application DB per service client makes inter-application analytics and monitoring extremely hard YELLOW ALARMS Healthcare providers data: labs, cares ... 7
  • 8. 8
  • 9. BIG DATA Better ways to store huge data volumes: cheaper, safer and easier. WHAT TO RUN FOR? MORE STORAGE 9
  • 10. BIG DATA WHAT TO RUN FOR? Scalable effective distributed processing models to open new opportunities like machine learning. MORE POWER 10
  • 11. BIG DATA WHAT TO RUN FOR? More flexible data structures closer to subject area and real world. 11
  • 12. RDBMS LIMITS ● Good for anything ● Not so good for anything in particular OUR MAIN ENEMY WAS ... 12
  • 13. MASSIVE ANALYSISIs about massive access to your data objects Your database Subject area objects data Subject area objects data Subject area objects data Subject area objects data Processing Processing Processing Processing Transformation from database structure into object structure Distributed parallel processing Effective results collection Distributed processing results to be joined WHY SQL IS EVIL 13
  • 14. RDBMS LIMITS When you go massive processing, objects collection is getting too complex. Think about 100.000.000 people data scan. Address ID City Street 1 New York 1020, Blue lake 2 Atlanta 203, Bricks av. 3 Seattle 120, Green drv. FirstName LastName Address Payer John Smith 1 2 Kate Davis 2 1 Samuel Brown 3 2 Payer ID Name State 1 SaferLife GA 2 YourGuard CA Kate Davis, Atlanta 203, Bricks av. SafeLife, GA SUBJECT AREA OBJECT COLLECTION 14
  • 15. FirstName LastName Address Payer Birthday RDBMS LIMITS FirstName LastName Address Payer John Smith 1 2 Kate Davis 2 1 Samuel Brown 3 2 And now let us add new «Birthday» column. Easy as pie! Let it be Patients table ... ALTER TABLE Patient ADD Birthday ... TABLE STRUCTURE MODIFICATION Let's do this with 2.000.000.000 rows MySQL table in production. What to do if your table grows further? 15
  • 16. ANY RELATIONAL DATA MODEL SOONER OR LATER 16
  • 17. Your SQL database Shard Shard Shard Shard Processing Processing Processing Processing How to partition data? What to do when new shard is added? Need another cluster for processing? Distributed processing results to be joined HOW TO SCALE? RDBMS LIMITS 17
  • 18. If you need to store plain text log, collection of objects for a long time or current user session attributes do you really need SQL? 18
  • 19. Cross-application data storage SQL DB Application data SQL DB Application data SQL DB Application data Small realtime requests Batch analytic and reporting load ETL ETL ETL ● One-time ETL as initial step and backup strategy. ● Full migration to Apache Hbase. ● As a transition period solution — realtime synchronization. OUR INITIAL BIG PLAN WAS 19
  • 20. OPEN SOURCE framework for big data. Both distributed storage and processing Provides RELIABILITY and fault tolerance by SOFTWARE design (for example file system with replication factor 3 as default one.Horizontal scalability from single computer up to thousands of nodes Why Hadoop (initially 1.x)? 20
  • 21. First ever world DATA OS 10.000 nodes computer... Can start in production from just 4 servers, 1 of them is for management and coordination. Single server is enough for development environment. 21
  • 22. HBase motivation WHY LATENCY, SPEED AND ALL HADOOP PROPERTIES 22
  • 23. Database Region server Distributed processing WHY YET ? DataNode Node File system Hardware TaskTracker Region server DataNode NodeTaskTracker Region server DataNode NodeTaskTracker Region server DataNode NodeTaskTracker ● Good both for OLTP and batch load. ● Natural scaling and reliability with Hadoop. ● Data processing locality, natural sharding with regions. ● Coordination with ZooKeeper. 23
  • 24. ZooKeeper Because coordinating distributed systems is a Zoo. ● Quorum based service for fast distributed system coordination. ● Came in our stack with Apache Hbase where it was needed for coordination. Now is part of core Hadoop infrastructure. ● Yet we use it for our own applications, 24
  • 25. Finally we went initial production with HADOOP 2.0 RESOURCE MANAGEMENT DISTRIBUTED PROCESSING FILE SYSTEM COORDINATION HADOOP 2.x CORE 25
  • 26. Database Region server Distributed processing & coordination Real initial approach DataNode Node File system Hardware Region server DataNode Node Region server DataNode Node Region server DataNode Node ● ZooKeeper Instances are distributed among cluster. ● MapReduce is not service in Hadoop 2.x, just YARN application. Resource management NodeManager NodeManager NodeManager NodeManager 26
  • 27. FIRST REAL RESULT Cross-application data storage SQL DB Application data SQL DB Application data SQL DB Application data Small realtime requests Batch analytic and reporting load ETL ETL ETL CLOSE BUT NOT EXACT PLAN Daily ETL. Satisfied our daily reporting needs with major SQL infrastructure offload. Direct profit — massive processing is much faster, can handle inter-application data. DO NOT WEAR PINK GLASSES 27
  • 28. APPROACH WE HAVE FIXED MUCH LATER SQL server JOIN Table1 Table2 Table3 Table4 ETL stream SQL server JOIN Table1 Table2 Table3 Table4 ETL stream ETL stream ETL stream ETL stream BIG DATA shard BIG DATA shard BIG DATA shard BIG DATA shard BIG DATA shard BIG DATA shard Bulk load Bulk load 28
  • 29. Hadoop: don't do it yourself DON'T DO IT YOURSELF Because of number of factors starting from our distributed team support needs we have selected 29
  • 31. WHERE TO GO FROM HERE? 31
  • 32. The admission of temporary residents into Canada is a privilege, not a right. http://www.cic.gc.ca/ SEARCH / SECONDARY INDICES 32
  • 33. NO SEARCH OUT OF THE BOX OTHER THAN LINEAR SCAN OVER THE TABLE AND FILTERS. SEARCH / SECONDARY INDICES The same happened to be applicable to secondary indices in Hbase. 33
  • 34. SEARCH / SECONDARY INDICES HOW WE MADE IT HBase handles user data changes Indexes are built on SOLR NGData Lily indexer transforms data changes into SOLR index updates 34
  • 35. HBase: Data and search integration Data update Client User just puts (or deletes) data. Search responses Lily HBase NRT indexerREPLICATION Translates data changes into SOLR index updates. SOLR cloud Search requests (HTTP) Apache Zookeeper does all coordination Provides real indexing Search and indexing together 35
  • 36. ● Kafka is a high throughput distributed messaging system. ● Allows true realtime system reaction through publish-subscribe approach. ● New services can subscribe to data events stream. GOING REALTIME Batch load Realtime load New data 36
  • 37. ● Kafka can be separated from Hadoop infrastructure or have backup cluster. ● Data publishers can switch to another cluster. ● Subscribers (including Spark on Hadoop) keep 2 places of subscription. ● So now you are free to put Kafka cluster in maintenance or backup subscribers. GOING REALTIME GENTLY MAINTENANCE 37
  • 38. This is our PRESENT DAY .. yet is powered by 38
  • 39. SO WHERE ARE WE GOING? 39
  • 40. OVER BIG DATAREACTIVE MANIFESTO MOTIVATION … users expect millisecond response times and 100% uptime. Data is measured in Petabytes. Today's demands are simply not met by yesterday’s software architectures. 40
  • 41. OVER BIG DATAREACTIVE MANIFESTO … we want systems that are Responsive, Resilient, Elastic and Message Driven. We call these Reactive Systems. http://www.reactivemanifesto.org/ 41
  • 42. OVER BIG DATAREACTIVE MANIFESTO Responsiveness is the cornerstone of usability and utility, but more than that, responsiveness means that problems may be detected quickly and dealt with effectively. RESPONSIVE 42
  • 43. OVER BIG DATAREACTIVE MANIFESTO The system stays responsive in the face of failure. … The client of a component is not burdened with handling its failures. RESILIENT All services here are located through ZooKeeper which is quorum based so resilience is achieved 43
  • 44. OVER BIG DATAREACTIVE MANIFESTO Reactive Systems can react to changes in the input rate by increasing or decreasing the resources allocated to service these inputs. ELASTIC Both HDFS and Hbase allow dynamic node addition / removal YARN already handles most resource allocation work and makes progress 44
  • 45. OVER BIG DATAREACTIVE MANIFESTO Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling. MESSAGE DRIVEN Asynchronous messages from applications Any application can subscribe, not only Hadoop services 45
  • 46. LESSONS LEARNED ● No transition in one step. You enter Big Data world step by step. ● Change you mind first. You should stop thinking in old style. Do not try simply to map your existing approaches. ● No silver bullet. Don't ruin your existing infrastructure. Extend it. NoSQL is not always good and some cases are really to be kept on SQL. Use the right tool. ● As you progress you pay more attention to operations and reactive system properties. 46