An overview about several technologies which contribute to the landscape of Big Data.
An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online
Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.
6. How big is big?
SkyTree (tm) defines: Analytics Requirements Index (ARI)
ARI = # Rows × # Columns
Time (secs)
Where # Rows = Number of records being analyzed
# Columns = Number of variables captured in each record
Time (secs) = The timeframe within which to complete the analysis
Example: For each view (1000 views/sec) produce a personalized banner
I need to analyze 100 variables on 1000 records (historic data) every 1 ms
ARI = (1000*100)/0.001 = 100 M values/sec
Natalino Busa - 12 Feb. 2013
7. What data?
Big Data can imply:
● Complex Data refactoring in Batch (lots of rows)
● Real-Time Event Processing (high-speed responses)
● Multidimensional analisys (lots of parameters)
● ... or any of those three
Response
time
Pa
ram
ete s
rs titie
En
Natalino Busa - 12 Feb. 2013
8. More data
customers +
customers + products +
customers + products + surveys +
customers + products + surveys + transactions +
customers products surveys transactions social messages
Database Databases Federated Data Aggregated Data Linked Data Just Data
Structured Unstructured
● in today's IT environments there is a gradual shift
from structured data to unstructured data
RDBMS are well suited to deal with structured data ->
but: more and complex ETL, how to deal with new data (structures) ?
Map-Reduce and noSQL systems are good with unstructured data ->
but: how to we query and analyze this data?
Natalino Busa - 12 Feb. 2013
9. Big Data: how to deal with it
● Big Data at rest (storage, access)
● Big Data in motion (streaming, dataflows)
● Big Data analytics (OLAP, OTAP, BI)
● Big Data modeling (predictive, machine learning)
Natalino Busa - 12 Feb. 2013
10. Big Data at rest
Analytical RDBMSs (EDW) Oracle, IBM, and various MPP's
Hadoop Distributed Systems HDFS (distributed file system)
Hbase (Big Table)
Batch Real-time
Cassandra HBase Analytics
Logs HDFS EDW EDW EDW
● Traditional EDW and Distributed ● These systems do not exclude each
BigData / NoSQL solutions are others and can coexist to form a full
complementary to each other. enterprise level solution.
Natalino Busa - 12 Feb. 2013
11. Big Data at rest
No need to get everything out of the hadoop ecosystem:
NoSQL DBMSs: Couchbase ( ++ reads, caching)
Cassandra ( ++ writes, OLAP)
... hybrid solutions are also possible:
HDFS + Cassandra : in-memory analytics + large DFS
HDFS + Solr/Lucene: fast text search on a distributed file system
Natalino Busa - 12 Feb. 2013
12. Big Data in motion
Stream processing // Dataflow architectures
Used to support the automatic analysis of data-in-motion in real-time or near real-time.
- Identify meaningful patterns
- Trigger action to respond to them as quickly as possible.
- Storm (from twitter)
dataflow processing framework
++ multi-language
- Akka (from typesafe)
dataflow actor framework
++ speed
Both are:
Distributed, fault-tolerant, streaming
Natalino Busa - 12 Feb. 2013
13. Big Data Landscape
Machine Learning on Big Data
Unstructured
SAS, R over HDFS Mahout
REST
Logs flume Hbase Hive
Data Interfaces
scribe ● Batch Analytics
HDFS ● Visualization
MapR BI
● Monitoring
● Marketing
sqoop Cassandra Pig
EDW
hiho
Unstructured
FS OLAP OTAP Impala
● Real-Time Analytics
● Streaming
STORM
Natalino Busa - 12 Feb. 2013
14. Lambda Architecture
Logic layer
Software as a Service
e.g realt-time predictor
from http://www.manning.com/marz/
Natalino Busa - 12 Feb. 2013
15. Why do machine learning on big data
http://www.skytree.net/why-do-machine-learning-on-big-data/
Natalino Busa - 12 Feb. 2013
16. Machine Learning: What?
SIMILARITY SEARCH
Similarity search provides a way to find the
objects that are the most similar, in an overall
sense, to the object(s) of interest.
PREDICTIVE ANALYTICS
Predictive analytics is the science of analyzing current and
historical facts/data to make predictions about future events.
CLUSTERING AND SEGMENTATION
Cluster analysis and segmentation represents a purely data
driven approach to grouping similar objects, behaviors, or
whatever is represented by the data.
From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/ Natalino Busa - 12 Feb. 2013