Roman Nikitchenko, 04.12.2014
Any real big data is 
just about 
DIGITAL LIFE 
FOOTPRINT 
www.vitech.com.ua 2
NOT ALL 
THINGS IN 
OUR LIFE 
ARE NICE 
THE 
SAME IS 
ABOUT 
... 
www.vitech.com.ua 3
BIG DATA is not about the 
data. It is about OUR ABILITY 
TO HANDLE THEM. 
www.vitech.com.ua 4
YARN 
www.vitech.com.ua 5
Basics 
Most dangerous 
things in Big Data 
Beware! 
Don't shoot your 
own foot with BIG 
GUN! 
Couple of 
specific notes 
Some aspects are 
more special. 
www.vitech.com.ua 6
MOST SERIOUS BIG DATA failure IS ... 
NO DATA 
www.vitech.com.ua 7
NO DATA 
The biggest mistake in BIG 
DATA strategy is to limit 
amount of data you collect. 
NO MONEY 
www.vitech.com.ua 8
WHERE 
ARE 
YOU? 
www.vitech.com.ua 9
DATA LAKE 
Take as much data 
about your business 
processes as you can 
take. The more data 
you have the more 
value you could get 
from it. 
www.vitech.com.ua 10
YOU 
ALWAYS 
HAVE 
OPTION 
● We have developed 
our own online 
storage which lowers 
maintenance and 
stores anything. 
www.vitech.com.ua 11
Most serious errors in Big 
Data are about operations 
and infrastructure. Not about 
algorithms, or code. 
LIVE WITH IT 
www.vitech.com.ua 12
YOU 
ALWAYS 
HAVE 
OPTION 
● We have special 
engineering roadmap for 
big data infrastructure 
development. 
www.vitech.com.ua 13
Use robust solutions 
Why hadoop? 
BIG 
DATA BIG 
= 
+ 
x MAX 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
www.vitech.com.ua 14
What is 
HADOOP? 
● Hadoop is open source 
framework for big 
data. Both distributed 
storage and 
processing. 
● Hadoop is reliable and 
fault tolerant with no 
rely on hardware for 
these properties. 
● Hadoop has unique 
horisontal scalability. 
Currently — from 
single computer up to 
thousands of cluster 
nodes. 
www.vitech.com.ua 15
Hadoop: don't do it yourself 
www.vitech.com.ua 16
Option? Our experience is: 
● HortonWorks are 'barely open source'. Innovative, but 
'running too fast'. Most ot their key technologies are not 
so mature yet. Some people LOVE them. 
Cloudera is stable enough but not stale. Hadoop 2.5 with 
YARN, HBase 0.98.x, Spark 1.x. Balance as for late 2014. 
● MapR focuses on performance per node but they are 
slightly outdated in term of functionality and their 
distribution costs. For cases where node performance is 
high priority. 
www.vitech.com.ua 17
HBase motivation 
Hadoop is... 
● Designed for throughput, 
not for latency. 
● HDFS blocks are expected 
to be large. There is issue 
with lot of small files. 
● Write once, read many 
times ideology. 
● MapReduce is not so 
flexible so any database 
built on top of it. 
● How about realtime? 
www.vitech.com.ua 18
Uses commodity 
hardware... 
'Commodity' word 
understanding is growing 
● 64G RAM is considered pretty small 
amount. 128G is more and more often 
configuration. 
● 2xCPU with 6 cores each is considered 
commodity. 
● 4xHDD is a minimum. SSD are used 
more and more often. 
www.vitech.com.ua 19
VIRTUALIZATION 
Virtualization 
NOT 
SO 
REAL 
ELEPHANT 
www.vitech.com.ua 20
CONCERNS 
● Is possible for key nodes. Not for 
workers unless you are really big. 
● Several nodes on single physical 
host: what happens if this host fail? 
● Loaded services on VM: is it 
meaningful? Double duties? 
www.vitech.com.ua 21
REAL EXAMPLE 
Virtualization: practical case 
● Apache ZooKeeper is 
QUORUM based service. 
● If host with 2 ZK fails, 
Everything fail which 
breaks tolerancy to 1 
failure. 
● Can you garantee equal 
performance for ZK 
service instances? 
● DON'T PUT QUORUM 
SERVICES IN VIRTUAL 
ENVIRONMENT! 
HOST 
HOST 
www.vitech.com.ua 22
YOU 
ALWAYS 
HAVE 
OPTION 
● Indeed there is lot of 
options with 
virtualization. The only 
concern is about ability 
to use your own brains. 
www.vitech.com.ua 23
Need online HBase motivation 
storage for 
big data? 
LATENCY, SPEED and all 
Hadoop properties. 
www.vitech.com.ua 24
NO ANY 
SECONDARY 
INDEXES OUT OF 
THE BOX. 
www.vitech.com.ua 25
YOU 
ALWAYS 
HAVE 
LOT OF 
OPTIONS 
● We have buit our 
search indexing 
technology. 
www.vitech.com.ua 26
INDEX ALTERNATIVE: SOLR 
INDEX UPDATE 
INDEX QUERY 
Search responses 
Index update request is 
analyzed, tokenized, 
transformed... and the 
same is for queries. 
● SOLR indexes documents. What is stored into 
SOLR index is not what you index. SOLR is NOT A 
STORAGE, ONLY INDEX 
● But it can index ANYTHING. Search result is 
document ID 
www.vitech.com.ua 27
● HBase handles user data change online 
requests. 
● NGData Lily indexer handles stream of changes 
and transforms them into SOLR index change 
requests. 
● Indexes are built on SOLR so HBase data are 
searchable. 
www.vitech.com.ua 28
HBase: Data and search integration 
Replication can be 
set up to column 
HBase regions 
HDFS 
Data update 
www.vitech.com.ua 29 
Client 
User just puts (or 
deletes) data. 
Search responses 
Lily HBase 
NRT indexer 
family level. 
REPLICATION 
HBase 
cluster 
Translates data 
changes into SOLR 
index updates. 
SOLR cloud 
Search requests (HTTP) 
Apache 
Zookeeper does 
all coordination 
Finally provides 
search 
Serves low level 
file system.
ETL 
LOAD 
YOUR 
DATA 
WITH CARE 
ETL 
www.vitech.com.ua 30
ENTERPRISE DATA HUB 
Don't ruine your existing data warehouse. 
Just extend it with new, centralized big 
data storage through data migration 
solution. 
www.vitech.com.ua 31
ETL & BD: main stages 
SQL 
server 
EXTRACT TRANSFORM LOAD 
Table1 
BIG DATA shard 
Table2 
JOIN Transform 
Partition 
Table3 
BIG DATA shard 
Table4 BIG DATA shard 
● SQL solution are usually not so 
distributed as Big Data one. How to 
partition your data? 
● Big data storages are mostly non-relational. 
You are to map table 
relations into objects. Where to put this 
complexity? 
www.vitech.com.ua 32
ETL & BD: complexity on SQL 
SQL 
server 
Table1 
BIG DATA shard 
Table2 
JOIN 
ETL stream 
Table3 
BIG DATA shard 
Table4 BIG DATA shard 
● It's hard to transform SQL relationship 
into NoSQL objects: complex joins. 
● Simple stream on big data, lowered 
network traffic. HUGE load on SQL. 
● What if you have several SQL servers 
and you need 2 times faster import? 
SQL 
dies on 
this 
www.vitech.com.ua 33
ETL & BD: complexity on BD side 
SQL 
server 
Table1 
ETL stream BIG DATA shard 
Table2 
ETL stream 
JOIN 
Table3 
BIG DATA shard 
ETL stream 
Table4 ETL stream 
BIG DATA shard 
● Simple streaming from SQL. Things 
like joins on Big Data side. 
● Even if you have 100 SQL servers, 
you have to scale single cluster. 
● Network load is more intensive. 
Much 
more 
scalable 
www.vitech.com.ua 34
YARN: future of Hadoop 
● YARN forms resource management layer and completes 
real distributed data OS so heterogeneous clusters and 
multi-tenancy are real things. 
● New distributed processing approaches: MapReduce is 
from now only one among other YARN appliactions. 
www.vitech.com.ua 35
First ever world 
DATA OS 
10.000 nodes computer... 
Recent technology changes are focused on 
higher scale. Better resource usage and 
control, lower MTTR, higher security, 
redundancy, fault tolerance. 
www.vitech.com.ua 36
YARN 
This is how retail 
agents often 
work. 
www.vitech.com.ua 37
YARN What can be reality 
This is how it 
often works. 
CPU 
CPU CPU CPU 
YARN presents 
CPU CPU CPU CPU 
it's about 
reservation. Indeed 
you could have no 
resource because of 
service not aware 
of YARN. 
www.vitech.com.ua 38
YOU 
ALWAYS 
HAVE 
OPTION 
www.vitech.com.ua 39
Apache 
Spark 
● Better MapReduce with at least some 
MapReduce elements able to be reused. 
● New job models. Not only Map and Reduce. 
● Scala and Python API in addition to Java. 
Functional model support. 
● Results can be passed through memory 
including final one. 
www.vitech.com.ua 40
● Works much better if knows about size of job to 
do. Streaming is just sequence of small jobs. 
● Requires proper YARN tuning to use resources 
properly. No dynamic allocation of executors. 
● Persistance: int limitation with 2G. HUGE 
amount of memory as for today. 
● You cannot partition data 'on the fly'. Should 
guess right way. 
www.vitech.com.ua 41
Your cluster is ready for 
Map-reduce Spark next tasks 
YARN 
● Dynamic, faster to startup, 
resources reusage. 
● Unified management 
infrastructure such as 
logging. 
+ 
www.vitech.com.ua 42
It is simply too 
good to wait... 
www.vitech.com.ua 43
TRUST ME ;-) 
www.vitech.com.ua 44
Share your knowledge! 
DO NOT 
HIDE YOUR 
EXPERIENCE 
www.vitech.com.ua 45
Questions and discussion 
www.vitech.com.ua 46

Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

  • 1.
  • 2.
    Any real bigdata is just about DIGITAL LIFE FOOTPRINT www.vitech.com.ua 2
  • 3.
    NOT ALL THINGSIN OUR LIFE ARE NICE THE SAME IS ABOUT ... www.vitech.com.ua 3
  • 4.
    BIG DATA isnot about the data. It is about OUR ABILITY TO HANDLE THEM. www.vitech.com.ua 4
  • 5.
  • 6.
    Basics Most dangerous things in Big Data Beware! Don't shoot your own foot with BIG GUN! Couple of specific notes Some aspects are more special. www.vitech.com.ua 6
  • 7.
    MOST SERIOUS BIGDATA failure IS ... NO DATA www.vitech.com.ua 7
  • 8.
    NO DATA Thebiggest mistake in BIG DATA strategy is to limit amount of data you collect. NO MONEY www.vitech.com.ua 8
  • 9.
    WHERE ARE YOU? www.vitech.com.ua 9
  • 10.
    DATA LAKE Takeas much data about your business processes as you can take. The more data you have the more value you could get from it. www.vitech.com.ua 10
  • 11.
    YOU ALWAYS HAVE OPTION ● We have developed our own online storage which lowers maintenance and stores anything. www.vitech.com.ua 11
  • 12.
    Most serious errorsin Big Data are about operations and infrastructure. Not about algorithms, or code. LIVE WITH IT www.vitech.com.ua 12
  • 13.
    YOU ALWAYS HAVE OPTION ● We have special engineering roadmap for big data infrastructure development. www.vitech.com.ua 13
  • 14.
    Use robust solutions Why hadoop? BIG DATA BIG = + x MAX DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA www.vitech.com.ua 14
  • 15.
    What is HADOOP? ● Hadoop is open source framework for big data. Both distributed storage and processing. ● Hadoop is reliable and fault tolerant with no rely on hardware for these properties. ● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes. www.vitech.com.ua 15
  • 16.
    Hadoop: don't doit yourself www.vitech.com.ua 16
  • 17.
    Option? Our experienceis: ● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. Some people LOVE them. Cloudera is stable enough but not stale. Hadoop 2.5 with YARN, HBase 0.98.x, Spark 1.x. Balance as for late 2014. ● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority. www.vitech.com.ua 17
  • 18.
    HBase motivation Hadoopis... ● Designed for throughput, not for latency. ● HDFS blocks are expected to be large. There is issue with lot of small files. ● Write once, read many times ideology. ● MapReduce is not so flexible so any database built on top of it. ● How about realtime? www.vitech.com.ua 18
  • 19.
    Uses commodity hardware... 'Commodity' word understanding is growing ● 64G RAM is considered pretty small amount. 128G is more and more often configuration. ● 2xCPU with 6 cores each is considered commodity. ● 4xHDD is a minimum. SSD are used more and more often. www.vitech.com.ua 19
  • 20.
    VIRTUALIZATION Virtualization NOT SO REAL ELEPHANT www.vitech.com.ua 20
  • 21.
    CONCERNS ● Ispossible for key nodes. Not for workers unless you are really big. ● Several nodes on single physical host: what happens if this host fail? ● Loaded services on VM: is it meaningful? Double duties? www.vitech.com.ua 21
  • 22.
    REAL EXAMPLE Virtualization:practical case ● Apache ZooKeeper is QUORUM based service. ● If host with 2 ZK fails, Everything fail which breaks tolerancy to 1 failure. ● Can you garantee equal performance for ZK service instances? ● DON'T PUT QUORUM SERVICES IN VIRTUAL ENVIRONMENT! HOST HOST www.vitech.com.ua 22
  • 23.
    YOU ALWAYS HAVE OPTION ● Indeed there is lot of options with virtualization. The only concern is about ability to use your own brains. www.vitech.com.ua 23
  • 24.
    Need online HBasemotivation storage for big data? LATENCY, SPEED and all Hadoop properties. www.vitech.com.ua 24
  • 25.
    NO ANY SECONDARY INDEXES OUT OF THE BOX. www.vitech.com.ua 25
  • 26.
    YOU ALWAYS HAVE LOT OF OPTIONS ● We have buit our search indexing technology. www.vitech.com.ua 26
  • 27.
    INDEX ALTERNATIVE: SOLR INDEX UPDATE INDEX QUERY Search responses Index update request is analyzed, tokenized, transformed... and the same is for queries. ● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX ● But it can index ANYTHING. Search result is document ID www.vitech.com.ua 27
  • 28.
    ● HBase handlesuser data change online requests. ● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests. ● Indexes are built on SOLR so HBase data are searchable. www.vitech.com.ua 28
  • 29.
    HBase: Data andsearch integration Replication can be set up to column HBase regions HDFS Data update www.vitech.com.ua 29 Client User just puts (or deletes) data. Search responses Lily HBase NRT indexer family level. REPLICATION HBase cluster Translates data changes into SOLR index updates. SOLR cloud Search requests (HTTP) Apache Zookeeper does all coordination Finally provides search Serves low level file system.
  • 30.
    ETL LOAD YOUR DATA WITH CARE ETL www.vitech.com.ua 30
  • 31.
    ENTERPRISE DATA HUB Don't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution. www.vitech.com.ua 31
  • 32.
    ETL & BD:main stages SQL server EXTRACT TRANSFORM LOAD Table1 BIG DATA shard Table2 JOIN Transform Partition Table3 BIG DATA shard Table4 BIG DATA shard ● SQL solution are usually not so distributed as Big Data one. How to partition your data? ● Big data storages are mostly non-relational. You are to map table relations into objects. Where to put this complexity? www.vitech.com.ua 32
  • 33.
    ETL & BD:complexity on SQL SQL server Table1 BIG DATA shard Table2 JOIN ETL stream Table3 BIG DATA shard Table4 BIG DATA shard ● It's hard to transform SQL relationship into NoSQL objects: complex joins. ● Simple stream on big data, lowered network traffic. HUGE load on SQL. ● What if you have several SQL servers and you need 2 times faster import? SQL dies on this www.vitech.com.ua 33
  • 34.
    ETL & BD:complexity on BD side SQL server Table1 ETL stream BIG DATA shard Table2 ETL stream JOIN Table3 BIG DATA shard ETL stream Table4 ETL stream BIG DATA shard ● Simple streaming from SQL. Things like joins on Big Data side. ● Even if you have 100 SQL servers, you have to scale single cluster. ● Network load is more intensive. Much more scalable www.vitech.com.ua 34
  • 35.
    YARN: future ofHadoop ● YARN forms resource management layer and completes real distributed data OS so heterogeneous clusters and multi-tenancy are real things. ● New distributed processing approaches: MapReduce is from now only one among other YARN appliactions. www.vitech.com.ua 35
  • 36.
    First ever world DATA OS 10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance. www.vitech.com.ua 36
  • 37.
    YARN This ishow retail agents often work. www.vitech.com.ua 37
  • 38.
    YARN What canbe reality This is how it often works. CPU CPU CPU CPU YARN presents CPU CPU CPU CPU it's about reservation. Indeed you could have no resource because of service not aware of YARN. www.vitech.com.ua 38
  • 39.
    YOU ALWAYS HAVE OPTION www.vitech.com.ua 39
  • 40.
    Apache Spark ●Better MapReduce with at least some MapReduce elements able to be reused. ● New job models. Not only Map and Reduce. ● Scala and Python API in addition to Java. Functional model support. ● Results can be passed through memory including final one. www.vitech.com.ua 40
  • 41.
    ● Works muchbetter if knows about size of job to do. Streaming is just sequence of small jobs. ● Requires proper YARN tuning to use resources properly. No dynamic allocation of executors. ● Persistance: int limitation with 2G. HUGE amount of memory as for today. ● You cannot partition data 'on the fly'. Should guess right way. www.vitech.com.ua 41
  • 42.
    Your cluster isready for Map-reduce Spark next tasks YARN ● Dynamic, faster to startup, resources reusage. ● Unified management infrastructure such as logging. + www.vitech.com.ua 42
  • 43.
    It is simplytoo good to wait... www.vitech.com.ua 43
  • 44.
    TRUST ME ;-) www.vitech.com.ua 44
  • 45.
    Share your knowledge! DO NOT HIDE YOUR EXPERIENCE www.vitech.com.ua 45
  • 46.
    Questions and discussion www.vitech.com.ua 46