Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

Any real big data is
just about
DIGITAL LIFE
FOOTPRINT
www.vitech.com.ua 2

NOT ALL
THINGS IN
OUR LIFE
ARE NICE
THE
SAME IS
ABOUT
...
www.vitech.com.ua 3

BIG DATA is not about the
data. It is about OUR ABILITY
TO HANDLE THEM.
www.vitech.com.ua 4

Basics
Most dangerous
things in Big Data
Beware!
Don't shoot your
own foot with BIG
GUN!
Couple of
specific notes
Some aspects are
more special.
www.vitech.com.ua 6

MOST SERIOUS BIG DATA failure IS ...
NO DATA
www.vitech.com.ua 7

NO DATA
The biggest mistake in BIG
DATA strategy is to limit
amount of data you collect.
NO MONEY
www.vitech.com.ua 8

WHERE
ARE
YOU?
www.vitech.com.ua 9

DATA LAKE
Take as much data
about your business
processes as you can
take. The more data
you have the more
value you could get
from it.
www.vitech.com.ua 10

YOU
ALWAYS
HAVE
OPTION
● We have developed
our own online
storage which lowers
maintenance and
stores anything.

Most serious errors in Big
Data are about operations
and infrastructure. Not about
algorithms, or code.
LIVE WITH IT

YOU
ALWAYS
HAVE
OPTION
● We have special
engineering roadmap for
big data infrastructure
development.

Use robust solutions
Why hadoop?
BIG
DATA BIG
=
+
x MAX
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA

What is
HADOOP?
● Hadoop is open source
framework for big
data. Both distributed
storage and
processing.
● Hadoop is reliable and
fault tolerant with no
rely on hardware for
these properties.
● Hadoop has unique
horisontal scalability.
Currently — from
single computer up to
thousands of cluster
nodes.

Hadoop: don't do it yourself

Option? Our experience is:
● HortonWorks are 'barely open source'. Innovative, but
'running too fast'. Most ot their key technologies are not
so mature yet. Some people LOVE them.
Cloudera is stable enough but not stale. Hadoop 2.5 with
YARN, HBase 0.98.x, Spark 1.x. Balance as for late 2014.
● MapR focuses on performance per node but they are
slightly outdated in term of functionality and their
distribution costs. For cases where node performance is
high priority.

HBase motivation
Hadoop is...
● Designed for throughput,
not for latency.
● HDFS blocks are expected
to be large. There is issue
with lot of small files.
● Write once, read many
times ideology.
● MapReduce is not so
flexible so any database
built on top of it.
● How about realtime?

Uses commodity
hardware...
'Commodity' word
understanding is growing
● 64G RAM is considered pretty small
amount. 128G is more and more often
configuration.
● 2xCPU with 6 cores each is considered
commodity.
● 4xHDD is a minimum. SSD are used
more and more often.

VIRTUALIZATION
Virtualization
NOT
SO
REAL
ELEPHANT

CONCERNS
● Is possible for key nodes. Not for
workers unless you are really big.
● Several nodes on single physical
host: what happens if this host fail?
● Loaded services on VM: is it
meaningful? Double duties?

REAL EXAMPLE
Virtualization: practical case
● Apache ZooKeeper is
QUORUM based service.
● If host with 2 ZK fails,
Everything fail which
breaks tolerancy to 1
failure.
● Can you garantee equal
performance for ZK
service instances?
● DON'T PUT QUORUM
SERVICES IN VIRTUAL
ENVIRONMENT!
HOST
HOST

YOU
ALWAYS
HAVE
OPTION
● Indeed there is lot of
options with
virtualization. The only
concern is about ability
to use your own brains.

Need online HBase motivation
storage for
big data?
LATENCY, SPEED and all
Hadoop properties.

NO ANY
SECONDARY
INDEXES OUT OF
THE BOX.

YOU
ALWAYS
HAVE
LOT OF
OPTIONS
● We have buit our
search indexing
technology.

INDEX ALTERNATIVE: SOLR
INDEX UPDATE
INDEX QUERY
Search responses
Index update request is
analyzed, tokenized,
transformed... and the
same is for queries.
● SOLR indexes documents. What is stored into
SOLR index is not what you index. SOLR is NOT A
STORAGE, ONLY INDEX
● But it can index ANYTHING. Search result is
document ID

● HBase handles user data change online
requests.
● NGData Lily indexer handles stream of changes
and transforms them into SOLR index change
requests.
● Indexes are built on SOLR so HBase data are
searchable.

HBase: Data and search integration
Replication can be
set up to column
HBase regions
HDFS
Data update
Client
User just puts (or
deletes) data.
Search responses
Lily HBase
NRT indexer
family level.
REPLICATION
HBase
cluster
Translates data
changes into SOLR
index updates.
SOLR cloud
Search requests (HTTP)
Apache
Zookeeper does
all coordination
Finally provides
search
Serves low level
file system.

ETL
LOAD
YOUR
DATA
WITH CARE
ETL

ENTERPRISE DATA HUB
Don't ruine your existing data warehouse.
Just extend it with new, centralized big
data storage through data migration
solution.

ETL & BD: main stages
SQL
server
EXTRACT TRANSFORM LOAD
Table1
BIG DATA shard
Table2
JOIN Transform
Partition
Table3
BIG DATA shard
Table4 BIG DATA shard
● SQL solution are usually not so
distributed as Big Data one. How to
partition your data?
● Big data storages are mostly non-relational.
You are to map table
relations into objects. Where to put this
complexity?

ETL & BD: complexity on SQL
SQL
server
Table1
BIG DATA shard
Table2
JOIN
ETL stream
Table3
BIG DATA shard
Table4 BIG DATA shard
● It's hard to transform SQL relationship
into NoSQL objects: complex joins.
● Simple stream on big data, lowered
network traffic. HUGE load on SQL.
● What if you have several SQL servers
and you need 2 times faster import?
SQL
dies on
this

ETL & BD: complexity on BD side
SQL
server
Table1
ETL stream BIG DATA shard
Table2
ETL stream
JOIN
Table3
BIG DATA shard
ETL stream
Table4 ETL stream
BIG DATA shard
● Simple streaming from SQL. Things
like joins on Big Data side.
● Even if you have 100 SQL servers,
you have to scale single cluster.
● Network load is more intensive.
Much
more
scalable

YARN: future of Hadoop
● YARN forms resource management layer and completes
real distributed data OS so heterogeneous clusters and
multi-tenancy are real things.
● New distributed processing approaches: MapReduce is
from now only one among other YARN appliactions.

First ever world
DATA OS
10.000 nodes computer...
Recent technology changes are focused on
higher scale. Better resource usage and
control, lower MTTR, higher security,
redundancy, fault tolerance.

YARN
This is how retail
agents often
work.

YARN What can be reality
This is how it
often works.
CPU
CPU CPU CPU
YARN presents
CPU CPU CPU CPU
it's about
reservation. Indeed
you could have no
resource because of
service not aware
of YARN.

YOU
ALWAYS
HAVE
OPTION

Apache
Spark
● Better MapReduce with at least some
MapReduce elements able to be reused.
● New job models. Not only Map and Reduce.
● Scala and Python API in addition to Java.
Functional model support.
● Results can be passed through memory
including final one.

● Works much better if knows about size of job to
do. Streaming is just sequence of small jobs.
● Requires proper YARN tuning to use resources
properly. No dynamic allocation of executors.
● Persistance: int limitation with 2G. HUGE
amount of memory as for today.
● You cannot partition data 'on the fly'. Should
guess right way.

Your cluster is ready for
Map-reduce Spark next tasks
YARN
● Dynamic, faster to startup,
resources reusage.
● Unified management
infrastructure such as
logging.
+

It is simply too
good to wait...

TRUST ME ;-)

Share your knowledge!
DO NOT
HIDE YOUR
EXPERIENCE

Questions and discussion

Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.

Similar to Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls. (20)

More from GeeksLab Odessa

More from GeeksLab Odessa (20)

Recently uploaded

Recently uploaded (20)

Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.