Mariusz Gil

BIG
data
ecosystem
/ ABOUT ME /
This talk is about

BIG DATA
What is...

BIG DATA?
VOLUME
large amounts of data
VELOCITY
needs to be analyzed quickly
VARIETY

different types of structured and unstructured data
Big Data is data that is too large,
complex and dynamics for any conventional data tools
to capture, store, manage and analyze.
30 billion pieces of content we added past month
more than 2 billion videos were watched yesterday
more than 58 millions messages were send yesterday
/ MAIN QUESTIONS /
WHY?
49

%
IMPROVED RISK
MANAGEMENT

32

%
INCREASED
SALES FIGURES

36 40

%
IMPROVED
MANAGEMENT
CONTROL

%
IT ANALYSIS

43

%
MARKET-ORIENTED
PRODUCT DEVELOPMENT

27

%
FINANCES AND
ECONOMICS
690 nodes Hadoop cluster for predictions and analytics
HOW?
HDFS

YARN / MapReduce v2

HADOOP DISTRIBUTED FILE SYSTEM

DISTRIBUTED PROCESSING FRAMEWORK

COLUMNAR STORAGE

SQL DATA WAREHOUSE ENGINE

HIVE

DATA SERIALIZATION

AVRO

SCALABLE MACHINE LEARNING

MAHOUT

SCRIPTING FOR LARGE DATA SETS

PIG

WORKFLOWS ORCHESTRATION

PROVISIONING, MANAGING AND MONITORING CLUSTERS

HBASE

DATA EXCHANGE

SQOOP

OOZIE

DISTRIBUTED COORDINATION SERVICE

ZOOKEEPER

LOG COLLECTOR

FLUME

AMBARI
WHIRR
RUNNING CLOUD SERVICES
We can choose from multiple

VENDORS
like Cloudera, HortonWorks or Amazon
Even from...
Can we get results

FASTER?
Cloudera Impala
Storm
Apache Drill
thanks

Big data ecosystem