Big data landscape

Big data
The technology landscape and its applications.

Natalino Busa - 12 Feb. 2013

Outline

● Big Data: Who are thou?
● Big Data: The technology landscape

● Hadoop: Overview
● Analytics & Machine Learning
● Opportunities


Hype cycle on new IT technologies

Gartner 2012


What is big data?

DATA (structured and un-structured, Logs, ETL, social)

Velocity Diversity Volume

BIG DATA

Hardware Software Services

Infrastructure Marketing (e.g. Unica) RDBMS
(Private) Cloud Analytics (Tableau) OLAP
Networking Modeling (SAS) Messaging


Big Data Heat map


How big is big?

SkyTree (tm) defines: Analytics Requirements Index (ARI)

ARI = # Rows × # Columns
Time (secs)

Where # Rows = Number of records being analyzed

# Columns = Number of variables captured in each record

Time (secs) = The timeframe within which to complete the analysis

Example: For each view (1000 views/sec) produce a personalized banner
I need to analyze 100 variables on 1000 records (historic data) every 1 ms

ARI = (1000*100)/0.001 = 100 M values/sec


What data?

Big Data can imply:

● Complex Data refactoring in Batch (lots of rows)
● Real-Time Event Processing (high-speed responses)
● Multidimensional analisys (lots of parameters)

● ... or any of those three
Response
time

Pa
ram
ete s
rs titie
En


More data

customers +
customers + products +
customers + products + surveys +
customers + products + surveys + transactions +
customers products surveys transactions social messages

Database Databases Federated Data Aggregated Data Linked Data Just Data

Structured Unstructured

● in today's IT environments there is a gradual shift
from structured data to unstructured data

RDBMS are well suited to deal with structured data ->
but: more and complex ETL, how to deal with new data (structures) ?

Map-Reduce and noSQL systems are good with unstructured data ->
but: how to we query and analyze this data?


Big Data: how to deal with it

● Big Data at rest (storage, access)
● Big Data in motion (streaming, dataflows)

● Big Data analytics (OLAP, OTAP, BI)
● Big Data modeling (predictive, machine learning)


Big Data at rest

Analytical RDBMSs (EDW) Oracle, IBM, and various MPP's

Hadoop Distributed Systems HDFS (distributed file system)
Hbase (Big Table)

Batch Real-time

Cassandra HBase Analytics

Logs HDFS EDW EDW EDW

● Traditional EDW and Distributed ● These systems do not exclude each
BigData / NoSQL solutions are others and can coexist to form a full
complementary to each other. enterprise level solution.


Big Data at rest

No need to get everything out of the hadoop ecosystem:

NoSQL DBMSs: Couchbase ( ++ reads, caching)
Cassandra ( ++ writes, OLAP)

... hybrid solutions are also possible:

HDFS + Cassandra : in-memory analytics + large DFS
HDFS + Solr/Lucene: fast text search on a distributed file system


Big Data in motion

Stream processing // Dataflow architectures

Used to support the automatic analysis of data-in-motion in real-time or near real-time.

- Identify meaningful patterns
- Trigger action to respond to them as quickly as possible.

- Storm (from twitter)
dataflow processing framework
++ multi-language

- Akka (from typesafe)
dataflow actor framework
++ speed

Both are:
Distributed, fault-tolerant, streaming


Big Data Landscape

Machine Learning on Big Data

Unstructured
SAS, R over HDFS Mahout

REST
Logs flume Hbase Hive
Data Interfaces

scribe ● Batch Analytics
HDFS ● Visualization
MapR BI
● Monitoring
● Marketing
sqoop Cassandra Pig
EDW
hiho

Unstructured
FS OLAP OTAP Impala
● Real-Time Analytics
● Streaming
STORM


Lambda Architecture

Logic layer
Software as a Service
e.g realt-time predictor

from http://www.manning.com/marz/

Why do machine learning on big data

http://www.skytree.net/why-do-machine-learning-on-big-data/


Machine Learning: What?
SIMILARITY SEARCH
Similarity search provides a way to find the
objects that are the most similar, in an overall
sense, to the object(s) of interest.

PREDICTIVE ANALYTICS
Predictive analytics is the science of analyzing current and
historical facts/data to make predictions about future events.

CLUSTERING AND SEGMENTATION
Cluster analysis and segmentation represents a purely data
driven approach to grouping similar objects, behaviors, or
whatever is represented by the data.

From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/ Natalino Busa - 12 Feb. 2013

Word Counting on Map Reduce


Machine learning on Map Reduce

From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011


Machine learning on Map Reduce

From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011 Natalino Busa - 12 Feb. 2013

Machine Learning: Use Cases

E-Commerce / E-Tailing
● Product Recommendation Engines
● Cross Channel Analytics
● Events/Activity Behavior Segmentation

Product Marketing
● Campaign management and optimization
● Market and consumer segmentations
● Pricing Optimization

Customer Marketing
● Customer Churn Management
● (Mobile) User Behavior Prediction
● Offer Personalization


Big Data: Opportunities

Unstructured Data
● Clustering
● Distributed processing
● Distributed Storage

Modeling & Analytics
● Distributed Machine Learning
● Fast Online Analytics Cubes

Streaming and Real-Time processing
● Build RT profiles
● Decision trees and Predictions
● Offer Personalization


Thanks

linkedin:
www.linkedin.com/in/natalinobusa

blog:
www.natalinobusa.com

Big data landscape

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Big data landscape

Similar to Big data landscape (20)

More from Natalino Busa

More from Natalino Busa (19)

Recently uploaded

Recently uploaded (20)

Big data landscape