SlideShare a Scribd company logo
Fall seven times 
STAND UP EIGHT! 
Roman Nikitchenko, 30.08.2014 
JDay 
2014 
BIG.DATA
What is big data? 
Big Data is like 
TEENAGE SEX. 
Everyone is talking 
about it, nobody doing 
it correctly. 
www.vitech.com.ua 2
Agenda 
WHY WRONG? 
So what is wrong 
with big data? 
BUILD RIGHT! 
Big data, our 
approach. 
MYTHICAL 
CREATURES 
Some magics 
around big data. 
YET NOTES 
Some notes we 
want to share. 
www.vitech.com.ua 3
What is BIG DATA? 
BIG DATA IS EVERYWHERE 
● Really BIG DATA things: photo banks, video storage, 
historical measurements. 
● Intensive data transactions and high distribution: stores 
(offline or online), banks, advertising networks. 
● Realtime data: measurements and minitoring, gaming. 
● Intensive processing: science, modelling. 
● High volumes of small things: social networks, 
healthcare 
www.vitech.com.ua 4
BIG DATA in just 3 words 
Indeed any real big 
data is just about 
DIGITAL LIFE 
FOOTPRINT 
www.vitech.com.ua 5
So... 
BIG DATA is not about the 
data. It is about OUR ABILITY 
TO HANDLE THEM. 
www.vitech.com.ua 6
How to do it DEFINITELY WRONG 
Greedy CFO: 
ONE server per 
year! Controlled 
costs! 
Stupid architect: Looks clean 
and everybody is happy! 
Lazy admin: 
Natural grow! 
2006 2007 2008 2009 2010 … 2014 … future 
Putting yearly data on new server looks nice! 
www.vitech.com.ua 7
You are screwed! 
Poor CFO: 
How many... 
this year? 
Crazy 
admin: 
No way to 
balance!!! 
FIRED! 
2006 2007 2008 2009 2010 … 2014 … future 
But then we come to REALLY BAD STATE. 
www.vitech.com.ua 8
… but why? 
Analytic: Why year 
average takes so long 
and cluster is IDLE! 
Clients: transactions 
are so slow... 
2006 2007 2008 2009 2010 … 2014 … future 
Why? Because of cluster design. 
www.vitech.com.ua 9
So BIG DATA our way 
Linear scalability so 2 
times more power 
costs approximately 2 
times more 
No natural keys so 
load balancing 
No 'special' hardware 
so staging is closer to 
production. 
www.vitech.com.ua 10
Why hadoop? 
BIG 
DATA BIG 
= 
+ 
x MAX 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
www.vitech.com.ua 11
HBase on top ot Hadoop 
GROUP DANCES 
ON THE ELEPHANT 
BACK 
● Low latency. 
● Highly scalable. 
● Reliability is incredible. 
www.vitech.com.ua 12
Zookeeper 
… because coordinating 
distributed systems is a 
Zoo 
Apache 
ZooKeeper 
www.vitech.com.ua 13
Important points 
POINTS 
TO NOTE 
www.vitech.com.ua 14
Hadoop design points. 
BIG 
DATA 
● Reliable key nodes, 
replacable unified 
workers. 
● No RAID on workers and 
no any 'logical' things like 
LVM. 
www.vitech.com.ua 15
Not so commodity... 
● Typecal node is expected to 
include at least 64G memory 
● Starting from 4 x 2T drives for 
storage. 8-16 x 4T drives are not 
so rare. This is for general 
'workload' node. 
● 12 and more CPU cores. 2 CPUs 
is normal approach. 
● SSD is starting to be widely used 
not only for OS and caching but 
for data itself. 
● Main outcome — per node costs 
model is changing. 
HARDWARE 
IS GOING 
CHEAPER 
AND MORE 
POWERFUL 
www.vitech.com.ua 16
Hadoop is... 
… ELEPHANT 
● Designed for throughput, 
not for latency. 
● HDFS blocks are expected 
to be large. There is issue 
with lot of small files. 
● Write once, read many 
times ideology. 
www.vitech.com.ua 17
Most important concept about Hadoop 
First ever world 
DATA OS 
10.000 nodes computer... 
Recent technology changes are focused on 
higher scale. Better resource usage and 
control, lower MTTR, higher security, 
redundancy, fault tolerance. 
www.vitech.com.ua 18
Virtualization 
NOT 
SO 
REAL 
ELEPHANT 
www.vitech.com.ua 19
Virtualization cpmcerms 
CONCERNS 
● Is possible for key nodes. Not for 
workers unless you are really big. 
● Several nodes on single physical 
host: what happens if this host fail? 
● Loaded services on VM: is it 
meaningful? Double duties? 
www.vitech.com.ua 20
Virtualization: practical case 
● Apache ZooKeeper is 
QUORUM based service. 
● If host with 2 ZK fails, 
Everything fail which 
breaks tolerancy to 1 
failure. 
● Can you garantee equal 
performance for ZK 
service instances? 
● DON'T PUT QUORUM 
SERVICES IN VIRTUAL 
ENVIRONMENT! 
HOST 
HOST 
www.vitech.com.ua 21
ETL 
LOAD 
YOUR 
DATA 
WITH CARE 
www.vitech.com.ua 22
ETL & BD: main stages 
SQL 
server 
EXTRACT TRANSFORM LOAD 
Table1 
BIG DATA shard 
Table2 
JOIN Transform 
Partition 
Table3 
BIG DATA shard 
Table4 BIG DATA shard 
● SQL solution are usually not so 
distributed as Big Data one. How to 
partition your data? 
● Big data storages are mostly non-relational. 
You are to map table 
relations into objects. Where to put this 
complexity? 
www.vitech.com.ua 23
ETL & BD: complexity on SQL 
SQL 
server 
Table1 
BIG DATA shard 
Table2 
JOIN 
ETL stream 
Table3 
BIG DATA shard 
Table4 BIG DATA shard 
● It's hard to transform SQL relationship 
into NoSQL objects: complex joins. 
● Simple stream on big data, lowered 
network traffic. HUGE load on SQL. 
● What if you have several SQL servers 
and you need 2 times faster import? 
SQL 
dies on 
this 
www.vitech.com.ua 24
ETL & BD: complexity on BD side 
SQL 
server 
Table1 
ETL stream BIG DATA shard 
Table2 
ETL stream 
JOIN 
Table3 
BIG DATA shard 
ETL stream 
Table4 ETL stream 
BIG DATA shard 
● Simple streaming from SQL. Things 
like joins on Big Data side. 
● Even if you have 100 SQL servers, 
you have to scale single cluster. 
● Network load is more intensive. 
Much 
more 
scalable 
www.vitech.com.ua 25
ETL: data partitioning 
KEY DNC 
KEY 
KEY 
2008 BIG DATA shard 
BIG DATA shard 
BIG DATA shard 
2009 
2010 
A... 
S... 
DNC 
U... DNC 
● By what criteria to partition? Used storage size is to be 
equally distributed. Processing load is to be shared. 
Partitioning is to scale with cluster. 
● Definitely no natural keys. No equal storage distribution, 
no processing load share. 
● Select good synthetic keys starting from input stream 
(before entering big data). UUID is great example. 
www.vitech.com.ua 26
HBase: Data and search integration 
User just puts (or 
deletes) data. 
Replication can be 
set up to column 
family level. 
REPLICATION 
HBase 
cluster 
HBase regions 
HDFS 
Data update 
Client 
Search requests (HTTP) 
Search responses 
Translates data 
changes into SOLR 
index updates. 
Lily HBase 
NRT indexer 
SOLR cloud 
Apache 
Zookeeper does 
all coordination 
Finally provides 
search 
Serves low level 
file system. 
www.vitech.com.ua 27
Changing Big data industry 
BIG DATA INDUSTRY 
face changes 
www.vitech.com.ua 28
New concepts 
DATA LAKE 
Take as much data 
about your business 
processes as you can 
take. The more data 
you have the more 
value you could get 
from it. 
www.vitech.com.ua 29
New concepts 
ENTERPRISE DATA HUB 
Don't ruine your existing data warehouse. 
Just extend it with new, centralized big 
data storage through data migration 
solution. 
www.vitech.com.ua 30
Networking effects 
NETWORK 
MATTERS 
Global data storage is reality. 
This makes whole picture 
completely crazy. 
www.vitech.com.ua 31
Share your knowledge! 
DO NOT 
HIDE YOUR 
EXPERIENCE 
www.vitech.com.ua 32
Hadoop: don't do it yourself 
www.vitech.com.ua 33
Major Hadoop distributions 
● HortonWorks are 'barely open source'. Innovative, but 
'running too fast'. Most ot their key technologies are not 
so mature yet. On the blade as for me. 
● Cloudera is stable enough but not stale. Hadoop 2.3 with 
YARN, HBase 0.96.x. Balance. 
● MapR focuses on performance per node but they are 
slightly outdated in term of functionality and their 
distribution costs. For cases where node performance is 
high priority. 
www.vitech.com.ua 34
Questions and discussion 
www.vitech.com.ua 35

More Related Content

What's hot

DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
DataStax
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
How to Successfully Visualize DSE Graph data
How to Successfully Visualize DSE Graph dataHow to Successfully Visualize DSE Graph data
How to Successfully Visualize DSE Graph data
DataStax
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
Felix Gessert
 
Streaming Customer Insights with DataStax Cassandra & Apache Kafta at British...
Streaming Customer Insights with DataStax Cassandra & Apache Kafta at British...Streaming Customer Insights with DataStax Cassandra & Apache Kafta at British...
Streaming Customer Insights with DataStax Cassandra & Apache Kafta at British...
DataStax
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
DataStax Training – Everything you need to become a Cassandra Rockstar
DataStax Training – Everything you need to become a Cassandra RockstarDataStax Training – Everything you need to become a Cassandra Rockstar
DataStax Training – Everything you need to become a Cassandra Rockstar
DataStax
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
Adam Muise
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
Niels Naglé
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
DataStax Academy
 
Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data
Digipolis Antwerpen
 
Data Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStaxData Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStax
DataStax
 
How jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStaxHow jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStax
DataStax
 
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStaxWebinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
DataStax
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
Crate.io
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
Adam Muise
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data Science
Edureka!
 

What's hot (20)

DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
How to Successfully Visualize DSE Graph data
How to Successfully Visualize DSE Graph dataHow to Successfully Visualize DSE Graph data
How to Successfully Visualize DSE Graph data
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
Streaming Customer Insights with DataStax Cassandra & Apache Kafta at British...
Streaming Customer Insights with DataStax Cassandra & Apache Kafta at British...Streaming Customer Insights with DataStax Cassandra & Apache Kafta at British...
Streaming Customer Insights with DataStax Cassandra & Apache Kafta at British...
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
DataStax Training – Everything you need to become a Cassandra Rockstar
DataStax Training – Everything you need to become a Cassandra RockstarDataStax Training – Everything you need to become a Cassandra Rockstar
DataStax Training – Everything you need to become a Cassandra Rockstar
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data
 
Data Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStaxData Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStax
 
How jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStaxHow jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStax
 
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStaxWebinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data Science
 

Similar to Big Data: fall seven times, stand up eight!

Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
Roman Nikitchenko
 
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
GeeksLab Odessa
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
Roman Nikitchenko
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
Roman Nikitchenko
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
Richard McDougall
 
Hybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseHybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouse
Laine Campbell
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
Stfalcon Meetups
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
Roman Nikitchenko
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
Cindy Gross
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
Bart Vandewoestyne
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
Svetlin Stanchev
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
benoitg
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
exponential-inc
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
Bart Vandewoestyne
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Big Data Spain
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
Roman Nikitchenko
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
grepalex
 

Similar to Big Data: fall seven times, stand up eight! (20)

Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
 
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Hybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseHybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouse
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 

Recently uploaded

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Big Data: fall seven times, stand up eight!

  • 1. Fall seven times STAND UP EIGHT! Roman Nikitchenko, 30.08.2014 JDay 2014 BIG.DATA
  • 2. What is big data? Big Data is like TEENAGE SEX. Everyone is talking about it, nobody doing it correctly. www.vitech.com.ua 2
  • 3. Agenda WHY WRONG? So what is wrong with big data? BUILD RIGHT! Big data, our approach. MYTHICAL CREATURES Some magics around big data. YET NOTES Some notes we want to share. www.vitech.com.ua 3
  • 4. What is BIG DATA? BIG DATA IS EVERYWHERE ● Really BIG DATA things: photo banks, video storage, historical measurements. ● Intensive data transactions and high distribution: stores (offline or online), banks, advertising networks. ● Realtime data: measurements and minitoring, gaming. ● Intensive processing: science, modelling. ● High volumes of small things: social networks, healthcare www.vitech.com.ua 4
  • 5. BIG DATA in just 3 words Indeed any real big data is just about DIGITAL LIFE FOOTPRINT www.vitech.com.ua 5
  • 6. So... BIG DATA is not about the data. It is about OUR ABILITY TO HANDLE THEM. www.vitech.com.ua 6
  • 7. How to do it DEFINITELY WRONG Greedy CFO: ONE server per year! Controlled costs! Stupid architect: Looks clean and everybody is happy! Lazy admin: Natural grow! 2006 2007 2008 2009 2010 … 2014 … future Putting yearly data on new server looks nice! www.vitech.com.ua 7
  • 8. You are screwed! Poor CFO: How many... this year? Crazy admin: No way to balance!!! FIRED! 2006 2007 2008 2009 2010 … 2014 … future But then we come to REALLY BAD STATE. www.vitech.com.ua 8
  • 9. … but why? Analytic: Why year average takes so long and cluster is IDLE! Clients: transactions are so slow... 2006 2007 2008 2009 2010 … 2014 … future Why? Because of cluster design. www.vitech.com.ua 9
  • 10. So BIG DATA our way Linear scalability so 2 times more power costs approximately 2 times more No natural keys so load balancing No 'special' hardware so staging is closer to production. www.vitech.com.ua 10
  • 11. Why hadoop? BIG DATA BIG = + x MAX DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA www.vitech.com.ua 11
  • 12. HBase on top ot Hadoop GROUP DANCES ON THE ELEPHANT BACK ● Low latency. ● Highly scalable. ● Reliability is incredible. www.vitech.com.ua 12
  • 13. Zookeeper … because coordinating distributed systems is a Zoo Apache ZooKeeper www.vitech.com.ua 13
  • 14. Important points POINTS TO NOTE www.vitech.com.ua 14
  • 15. Hadoop design points. BIG DATA ● Reliable key nodes, replacable unified workers. ● No RAID on workers and no any 'logical' things like LVM. www.vitech.com.ua 15
  • 16. Not so commodity... ● Typecal node is expected to include at least 64G memory ● Starting from 4 x 2T drives for storage. 8-16 x 4T drives are not so rare. This is for general 'workload' node. ● 12 and more CPU cores. 2 CPUs is normal approach. ● SSD is starting to be widely used not only for OS and caching but for data itself. ● Main outcome — per node costs model is changing. HARDWARE IS GOING CHEAPER AND MORE POWERFUL www.vitech.com.ua 16
  • 17. Hadoop is... … ELEPHANT ● Designed for throughput, not for latency. ● HDFS blocks are expected to be large. There is issue with lot of small files. ● Write once, read many times ideology. www.vitech.com.ua 17
  • 18. Most important concept about Hadoop First ever world DATA OS 10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance. www.vitech.com.ua 18
  • 19. Virtualization NOT SO REAL ELEPHANT www.vitech.com.ua 19
  • 20. Virtualization cpmcerms CONCERNS ● Is possible for key nodes. Not for workers unless you are really big. ● Several nodes on single physical host: what happens if this host fail? ● Loaded services on VM: is it meaningful? Double duties? www.vitech.com.ua 20
  • 21. Virtualization: practical case ● Apache ZooKeeper is QUORUM based service. ● If host with 2 ZK fails, Everything fail which breaks tolerancy to 1 failure. ● Can you garantee equal performance for ZK service instances? ● DON'T PUT QUORUM SERVICES IN VIRTUAL ENVIRONMENT! HOST HOST www.vitech.com.ua 21
  • 22. ETL LOAD YOUR DATA WITH CARE www.vitech.com.ua 22
  • 23. ETL & BD: main stages SQL server EXTRACT TRANSFORM LOAD Table1 BIG DATA shard Table2 JOIN Transform Partition Table3 BIG DATA shard Table4 BIG DATA shard ● SQL solution are usually not so distributed as Big Data one. How to partition your data? ● Big data storages are mostly non-relational. You are to map table relations into objects. Where to put this complexity? www.vitech.com.ua 23
  • 24. ETL & BD: complexity on SQL SQL server Table1 BIG DATA shard Table2 JOIN ETL stream Table3 BIG DATA shard Table4 BIG DATA shard ● It's hard to transform SQL relationship into NoSQL objects: complex joins. ● Simple stream on big data, lowered network traffic. HUGE load on SQL. ● What if you have several SQL servers and you need 2 times faster import? SQL dies on this www.vitech.com.ua 24
  • 25. ETL & BD: complexity on BD side SQL server Table1 ETL stream BIG DATA shard Table2 ETL stream JOIN Table3 BIG DATA shard ETL stream Table4 ETL stream BIG DATA shard ● Simple streaming from SQL. Things like joins on Big Data side. ● Even if you have 100 SQL servers, you have to scale single cluster. ● Network load is more intensive. Much more scalable www.vitech.com.ua 25
  • 26. ETL: data partitioning KEY DNC KEY KEY 2008 BIG DATA shard BIG DATA shard BIG DATA shard 2009 2010 A... S... DNC U... DNC ● By what criteria to partition? Used storage size is to be equally distributed. Processing load is to be shared. Partitioning is to scale with cluster. ● Definitely no natural keys. No equal storage distribution, no processing load share. ● Select good synthetic keys starting from input stream (before entering big data). UUID is great example. www.vitech.com.ua 26
  • 27. HBase: Data and search integration User just puts (or deletes) data. Replication can be set up to column family level. REPLICATION HBase cluster HBase regions HDFS Data update Client Search requests (HTTP) Search responses Translates data changes into SOLR index updates. Lily HBase NRT indexer SOLR cloud Apache Zookeeper does all coordination Finally provides search Serves low level file system. www.vitech.com.ua 27
  • 28. Changing Big data industry BIG DATA INDUSTRY face changes www.vitech.com.ua 28
  • 29. New concepts DATA LAKE Take as much data about your business processes as you can take. The more data you have the more value you could get from it. www.vitech.com.ua 29
  • 30. New concepts ENTERPRISE DATA HUB Don't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution. www.vitech.com.ua 30
  • 31. Networking effects NETWORK MATTERS Global data storage is reality. This makes whole picture completely crazy. www.vitech.com.ua 31
  • 32. Share your knowledge! DO NOT HIDE YOUR EXPERIENCE www.vitech.com.ua 32
  • 33. Hadoop: don't do it yourself www.vitech.com.ua 33
  • 34. Major Hadoop distributions ● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. On the blade as for me. ● Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.96.x. Balance. ● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority. www.vitech.com.ua 34
  • 35. Questions and discussion www.vitech.com.ua 35