SlideShare a Scribd company logo
Roman Nikitchenko, 04.12.2014
Any real big data is 
just about 
DIGITAL LIFE 
FOOTPRINT 
www.vitech.com.ua 2
NOT ALL 
THINGS IN 
OUR LIFE 
ARE NICE 
THE 
SAME IS 
ABOUT 
... 
www.vitech.com.ua 3
BIG DATA is not about the 
data. It is about OUR ABILITY 
TO HANDLE THEM. 
www.vitech.com.ua 4
YARN 
www.vitech.com.ua 5
Basics 
Most dangerous 
things in Big Data 
Beware! 
Don't shoot your 
own foot with BIG 
GUN! 
Couple of 
specific notes 
Some aspects are 
more special. 
www.vitech.com.ua 6
MOST SERIOUS BIG DATA failure IS ... 
NO DATA 
www.vitech.com.ua 7
NO DATA 
The biggest mistake in BIG 
DATA strategy is to limit 
amount of data you collect. 
NO MONEY 
www.vitech.com.ua 8
WHERE 
ARE 
YOU? 
www.vitech.com.ua 9
DATA LAKE 
Take as much data 
about your business 
processes as you can 
take. The more data 
you have the more 
value you could get 
from it. 
www.vitech.com.ua 10
YOU 
ALWAYS 
HAVE 
OPTION 
● We have developed 
our own online 
storage which lowers 
maintenance and 
stores anything. 
www.vitech.com.ua 11
Most serious errors in Big 
Data are about operations 
and infrastructure. Not about 
algorithms, or code. 
LIVE WITH IT 
www.vitech.com.ua 12
YOU 
ALWAYS 
HAVE 
OPTION 
● We have special 
engineering roadmap for 
big data infrastructure 
development. 
www.vitech.com.ua 13
Use robust solutions 
Why hadoop? 
BIG 
DATA BIG 
= 
+ 
x MAX 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
www.vitech.com.ua 14
What is 
HADOOP? 
● Hadoop is open source 
framework for big 
data. Both distributed 
storage and 
processing. 
● Hadoop is reliable and 
fault tolerant with no 
rely on hardware for 
these properties. 
● Hadoop has unique 
horisontal scalability. 
Currently — from 
single computer up to 
thousands of cluster 
nodes. 
www.vitech.com.ua 15
Hadoop: don't do it yourself 
www.vitech.com.ua 16
Option? Our experience is: 
● HortonWorks are 'barely open source'. Innovative, but 
'running too fast'. Most ot their key technologies are not 
so mature yet. Some people LOVE them. 
Cloudera is stable enough but not stale. Hadoop 2.5 with 
YARN, HBase 0.98.x, Spark 1.x. Balance as for late 2014. 
● MapR focuses on performance per node but they are 
slightly outdated in term of functionality and their 
distribution costs. For cases where node performance is 
high priority. 
www.vitech.com.ua 17
HBase motivation 
Hadoop is... 
● Designed for throughput, 
not for latency. 
● HDFS blocks are expected 
to be large. There is issue 
with lot of small files. 
● Write once, read many 
times ideology. 
● MapReduce is not so 
flexible so any database 
built on top of it. 
● How about realtime? 
www.vitech.com.ua 18
Uses commodity 
hardware... 
'Commodity' word 
understanding is growing 
● 64G RAM is considered pretty small 
amount. 128G is more and more often 
configuration. 
● 2xCPU with 6 cores each is considered 
commodity. 
● 4xHDD is a minimum. SSD are used 
more and more often. 
www.vitech.com.ua 19
VIRTUALIZATION 
Virtualization 
NOT 
SO 
REAL 
ELEPHANT 
www.vitech.com.ua 20
CONCERNS 
● Is possible for key nodes. Not for 
workers unless you are really big. 
● Several nodes on single physical 
host: what happens if this host fail? 
● Loaded services on VM: is it 
meaningful? Double duties? 
www.vitech.com.ua 21
REAL EXAMPLE 
Virtualization: practical case 
● Apache ZooKeeper is 
QUORUM based service. 
● If host with 2 ZK fails, 
Everything fail which 
breaks tolerancy to 1 
failure. 
● Can you garantee equal 
performance for ZK 
service instances? 
● DON'T PUT QUORUM 
SERVICES IN VIRTUAL 
ENVIRONMENT! 
HOST 
HOST 
www.vitech.com.ua 22
YOU 
ALWAYS 
HAVE 
OPTION 
● Indeed there is lot of 
options with 
virtualization. The only 
concern is about ability 
to use your own brains. 
www.vitech.com.ua 23
Need online HBase motivation 
storage for 
big data? 
LATENCY, SPEED and all 
Hadoop properties. 
www.vitech.com.ua 24
NO ANY 
SECONDARY 
INDEXES OUT OF 
THE BOX. 
www.vitech.com.ua 25
YOU 
ALWAYS 
HAVE 
LOT OF 
OPTIONS 
● We have buit our 
search indexing 
technology. 
www.vitech.com.ua 26
INDEX ALTERNATIVE: SOLR 
INDEX UPDATE 
INDEX QUERY 
Search responses 
Index update request is 
analyzed, tokenized, 
transformed... and the 
same is for queries. 
● SOLR indexes documents. What is stored into 
SOLR index is not what you index. SOLR is NOT A 
STORAGE, ONLY INDEX 
● But it can index ANYTHING. Search result is 
document ID 
www.vitech.com.ua 27
● HBase handles user data change online 
requests. 
● NGData Lily indexer handles stream of changes 
and transforms them into SOLR index change 
requests. 
● Indexes are built on SOLR so HBase data are 
searchable. 
www.vitech.com.ua 28
HBase: Data and search integration 
Replication can be 
set up to column 
HBase regions 
HDFS 
Data update 
www.vitech.com.ua 29 
Client 
User just puts (or 
deletes) data. 
Search responses 
Lily HBase 
NRT indexer 
family level. 
REPLICATION 
HBase 
cluster 
Translates data 
changes into SOLR 
index updates. 
SOLR cloud 
Search requests (HTTP) 
Apache 
Zookeeper does 
all coordination 
Finally provides 
search 
Serves low level 
file system.
ETL 
LOAD 
YOUR 
DATA 
WITH CARE 
ETL 
www.vitech.com.ua 30
ENTERPRISE DATA HUB 
Don't ruine your existing data warehouse. 
Just extend it with new, centralized big 
data storage through data migration 
solution. 
www.vitech.com.ua 31
ETL & BD: main stages 
SQL 
server 
EXTRACT TRANSFORM LOAD 
Table1 
BIG DATA shard 
Table2 
JOIN Transform 
Partition 
Table3 
BIG DATA shard 
Table4 BIG DATA shard 
● SQL solution are usually not so 
distributed as Big Data one. How to 
partition your data? 
● Big data storages are mostly non-relational. 
You are to map table 
relations into objects. Where to put this 
complexity? 
www.vitech.com.ua 32
ETL & BD: complexity on SQL 
SQL 
server 
Table1 
BIG DATA shard 
Table2 
JOIN 
ETL stream 
Table3 
BIG DATA shard 
Table4 BIG DATA shard 
● It's hard to transform SQL relationship 
into NoSQL objects: complex joins. 
● Simple stream on big data, lowered 
network traffic. HUGE load on SQL. 
● What if you have several SQL servers 
and you need 2 times faster import? 
SQL 
dies on 
this 
www.vitech.com.ua 33
ETL & BD: complexity on BD side 
SQL 
server 
Table1 
ETL stream BIG DATA shard 
Table2 
ETL stream 
JOIN 
Table3 
BIG DATA shard 
ETL stream 
Table4 ETL stream 
BIG DATA shard 
● Simple streaming from SQL. Things 
like joins on Big Data side. 
● Even if you have 100 SQL servers, 
you have to scale single cluster. 
● Network load is more intensive. 
Much 
more 
scalable 
www.vitech.com.ua 34
YARN: future of Hadoop 
● YARN forms resource management layer and completes 
real distributed data OS so heterogeneous clusters and 
multi-tenancy are real things. 
● New distributed processing approaches: MapReduce is 
from now only one among other YARN appliactions. 
www.vitech.com.ua 35
First ever world 
DATA OS 
10.000 nodes computer... 
Recent technology changes are focused on 
higher scale. Better resource usage and 
control, lower MTTR, higher security, 
redundancy, fault tolerance. 
www.vitech.com.ua 36
YARN 
This is how retail 
agents often 
work. 
www.vitech.com.ua 37
YARN What can be reality 
This is how it 
often works. 
CPU 
CPU CPU CPU 
YARN presents 
CPU CPU CPU CPU 
it's about 
reservation. Indeed 
you could have no 
resource because of 
service not aware 
of YARN. 
www.vitech.com.ua 38
YOU 
ALWAYS 
HAVE 
OPTION 
www.vitech.com.ua 39
Apache 
Spark 
● Better MapReduce with at least some 
MapReduce elements able to be reused. 
● New job models. Not only Map and Reduce. 
● Scala and Python API in addition to Java. 
Functional model support. 
● Results can be passed through memory 
including final one. 
www.vitech.com.ua 40
● Works much better if knows about size of job to 
do. Streaming is just sequence of small jobs. 
● Requires proper YARN tuning to use resources 
properly. No dynamic allocation of executors. 
● Persistance: int limitation with 2G. HUGE 
amount of memory as for today. 
● You cannot partition data 'on the fly'. Should 
guess right way. 
www.vitech.com.ua 41
Your cluster is ready for 
Map-reduce Spark next tasks 
YARN 
● Dynamic, faster to startup, 
resources reusage. 
● Unified management 
infrastructure such as 
logging. 
+ 
www.vitech.com.ua 42
It is simply too 
good to wait... 
www.vitech.com.ua 43
TRUST ME ;-) 
www.vitech.com.ua 44
Share your knowledge! 
DO NOT 
HIDE YOUR 
EXPERIENCE 
www.vitech.com.ua 45
Questions and discussion 
www.vitech.com.ua 46

More Related Content

What's hot

DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
DataStax
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
grepalex
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
DataStax Academy
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Mark Rittman
 
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsWhat Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
Todd Hoff
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
Asim Jalis
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
Big Data Joe™ Rossi
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
DataStax Academy
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
Michelle Darling
 
ODI12c as your Big Data Integration Hub
ODI12c as your Big Data Integration HubODI12c as your Big Data Integration Hub
ODI12c as your Big Data Integration Hub
Mark Rittman
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
Maturin BADO
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
adunne
 
Life After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data CloudLife After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data Cloud
OSCON Byrum
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
Mark Rittman
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
ScyllaDB
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
MartinStrycek
 
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Severalnines
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
Anant Corporation
 

What's hot (20)

DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
 
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsWhat Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
ODI12c as your Big Data Integration Hub
ODI12c as your Big Data Integration HubODI12c as your Big Data Integration Hub
ODI12c as your Big Data Integration Hub
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
 
Life After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data CloudLife After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data Cloud
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
 

Similar to Big Data - Big Pitfalls.

Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
Roman Nikitchenko
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
Roman Nikitchenko
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
Stfalcon Meetups
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
Roman Nikitchenko
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
Richard McDougall
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
DataWorks Summit
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
solarisyougood
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
Roman Nikitchenko
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
Roman Nikitchenko
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
MLconf
 
Empowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with AlternatorEmpowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with Alternator
ScyllaDB
 
Idi2017 - Cloud DB: strengths and weaknesses
Idi2017 - Cloud DB: strengths and weaknessesIdi2017 - Cloud DB: strengths and weaknesses
Idi2017 - Cloud DB: strengths and weaknesses
Linuxaria.com
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
exponential-inc
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
jhugg
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 

Similar to Big Data - Big Pitfalls. (20)

Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
Empowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with AlternatorEmpowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with Alternator
 
Idi2017 - Cloud DB: strengths and weaknesses
Idi2017 - Cloud DB: strengths and weaknessesIdi2017 - Cloud DB: strengths and weaknesses
Idi2017 - Cloud DB: strengths and weaknesses
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Big Data - Big Pitfalls.

  • 2. Any real big data is just about DIGITAL LIFE FOOTPRINT www.vitech.com.ua 2
  • 3. NOT ALL THINGS IN OUR LIFE ARE NICE THE SAME IS ABOUT ... www.vitech.com.ua 3
  • 4. BIG DATA is not about the data. It is about OUR ABILITY TO HANDLE THEM. www.vitech.com.ua 4
  • 6. Basics Most dangerous things in Big Data Beware! Don't shoot your own foot with BIG GUN! Couple of specific notes Some aspects are more special. www.vitech.com.ua 6
  • 7. MOST SERIOUS BIG DATA failure IS ... NO DATA www.vitech.com.ua 7
  • 8. NO DATA The biggest mistake in BIG DATA strategy is to limit amount of data you collect. NO MONEY www.vitech.com.ua 8
  • 9. WHERE ARE YOU? www.vitech.com.ua 9
  • 10. DATA LAKE Take as much data about your business processes as you can take. The more data you have the more value you could get from it. www.vitech.com.ua 10
  • 11. YOU ALWAYS HAVE OPTION ● We have developed our own online storage which lowers maintenance and stores anything. www.vitech.com.ua 11
  • 12. Most serious errors in Big Data are about operations and infrastructure. Not about algorithms, or code. LIVE WITH IT www.vitech.com.ua 12
  • 13. YOU ALWAYS HAVE OPTION ● We have special engineering roadmap for big data infrastructure development. www.vitech.com.ua 13
  • 14. Use robust solutions Why hadoop? BIG DATA BIG = + x MAX DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA www.vitech.com.ua 14
  • 15. What is HADOOP? ● Hadoop is open source framework for big data. Both distributed storage and processing. ● Hadoop is reliable and fault tolerant with no rely on hardware for these properties. ● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes. www.vitech.com.ua 15
  • 16. Hadoop: don't do it yourself www.vitech.com.ua 16
  • 17. Option? Our experience is: ● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. Some people LOVE them. Cloudera is stable enough but not stale. Hadoop 2.5 with YARN, HBase 0.98.x, Spark 1.x. Balance as for late 2014. ● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority. www.vitech.com.ua 17
  • 18. HBase motivation Hadoop is... ● Designed for throughput, not for latency. ● HDFS blocks are expected to be large. There is issue with lot of small files. ● Write once, read many times ideology. ● MapReduce is not so flexible so any database built on top of it. ● How about realtime? www.vitech.com.ua 18
  • 19. Uses commodity hardware... 'Commodity' word understanding is growing ● 64G RAM is considered pretty small amount. 128G is more and more often configuration. ● 2xCPU with 6 cores each is considered commodity. ● 4xHDD is a minimum. SSD are used more and more often. www.vitech.com.ua 19
  • 20. VIRTUALIZATION Virtualization NOT SO REAL ELEPHANT www.vitech.com.ua 20
  • 21. CONCERNS ● Is possible for key nodes. Not for workers unless you are really big. ● Several nodes on single physical host: what happens if this host fail? ● Loaded services on VM: is it meaningful? Double duties? www.vitech.com.ua 21
  • 22. REAL EXAMPLE Virtualization: practical case ● Apache ZooKeeper is QUORUM based service. ● If host with 2 ZK fails, Everything fail which breaks tolerancy to 1 failure. ● Can you garantee equal performance for ZK service instances? ● DON'T PUT QUORUM SERVICES IN VIRTUAL ENVIRONMENT! HOST HOST www.vitech.com.ua 22
  • 23. YOU ALWAYS HAVE OPTION ● Indeed there is lot of options with virtualization. The only concern is about ability to use your own brains. www.vitech.com.ua 23
  • 24. Need online HBase motivation storage for big data? LATENCY, SPEED and all Hadoop properties. www.vitech.com.ua 24
  • 25. NO ANY SECONDARY INDEXES OUT OF THE BOX. www.vitech.com.ua 25
  • 26. YOU ALWAYS HAVE LOT OF OPTIONS ● We have buit our search indexing technology. www.vitech.com.ua 26
  • 27. INDEX ALTERNATIVE: SOLR INDEX UPDATE INDEX QUERY Search responses Index update request is analyzed, tokenized, transformed... and the same is for queries. ● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX ● But it can index ANYTHING. Search result is document ID www.vitech.com.ua 27
  • 28. ● HBase handles user data change online requests. ● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests. ● Indexes are built on SOLR so HBase data are searchable. www.vitech.com.ua 28
  • 29. HBase: Data and search integration Replication can be set up to column HBase regions HDFS Data update www.vitech.com.ua 29 Client User just puts (or deletes) data. Search responses Lily HBase NRT indexer family level. REPLICATION HBase cluster Translates data changes into SOLR index updates. SOLR cloud Search requests (HTTP) Apache Zookeeper does all coordination Finally provides search Serves low level file system.
  • 30. ETL LOAD YOUR DATA WITH CARE ETL www.vitech.com.ua 30
  • 31. ENTERPRISE DATA HUB Don't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution. www.vitech.com.ua 31
  • 32. ETL & BD: main stages SQL server EXTRACT TRANSFORM LOAD Table1 BIG DATA shard Table2 JOIN Transform Partition Table3 BIG DATA shard Table4 BIG DATA shard ● SQL solution are usually not so distributed as Big Data one. How to partition your data? ● Big data storages are mostly non-relational. You are to map table relations into objects. Where to put this complexity? www.vitech.com.ua 32
  • 33. ETL & BD: complexity on SQL SQL server Table1 BIG DATA shard Table2 JOIN ETL stream Table3 BIG DATA shard Table4 BIG DATA shard ● It's hard to transform SQL relationship into NoSQL objects: complex joins. ● Simple stream on big data, lowered network traffic. HUGE load on SQL. ● What if you have several SQL servers and you need 2 times faster import? SQL dies on this www.vitech.com.ua 33
  • 34. ETL & BD: complexity on BD side SQL server Table1 ETL stream BIG DATA shard Table2 ETL stream JOIN Table3 BIG DATA shard ETL stream Table4 ETL stream BIG DATA shard ● Simple streaming from SQL. Things like joins on Big Data side. ● Even if you have 100 SQL servers, you have to scale single cluster. ● Network load is more intensive. Much more scalable www.vitech.com.ua 34
  • 35. YARN: future of Hadoop ● YARN forms resource management layer and completes real distributed data OS so heterogeneous clusters and multi-tenancy are real things. ● New distributed processing approaches: MapReduce is from now only one among other YARN appliactions. www.vitech.com.ua 35
  • 36. First ever world DATA OS 10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance. www.vitech.com.ua 36
  • 37. YARN This is how retail agents often work. www.vitech.com.ua 37
  • 38. YARN What can be reality This is how it often works. CPU CPU CPU CPU YARN presents CPU CPU CPU CPU it's about reservation. Indeed you could have no resource because of service not aware of YARN. www.vitech.com.ua 38
  • 39. YOU ALWAYS HAVE OPTION www.vitech.com.ua 39
  • 40. Apache Spark ● Better MapReduce with at least some MapReduce elements able to be reused. ● New job models. Not only Map and Reduce. ● Scala and Python API in addition to Java. Functional model support. ● Results can be passed through memory including final one. www.vitech.com.ua 40
  • 41. ● Works much better if knows about size of job to do. Streaming is just sequence of small jobs. ● Requires proper YARN tuning to use resources properly. No dynamic allocation of executors. ● Persistance: int limitation with 2G. HUGE amount of memory as for today. ● You cannot partition data 'on the fly'. Should guess right way. www.vitech.com.ua 41
  • 42. Your cluster is ready for Map-reduce Spark next tasks YARN ● Dynamic, faster to startup, resources reusage. ● Unified management infrastructure such as logging. + www.vitech.com.ua 42
  • 43. It is simply too good to wait... www.vitech.com.ua 43
  • 44. TRUST ME ;-) www.vitech.com.ua 44
  • 45. Share your knowledge! DO NOT HIDE YOUR EXPERIENCE www.vitech.com.ua 45
  • 46. Questions and discussion www.vitech.com.ua 46