0
The three generations of
Big Data processing
Rubén Casado
ruben.casado@treelogic.com
Agenda
1.

Big Data processing

2.

Batch processing

3.

Real-time processing

4.

Hybrid computation model

5.

Conclusi...
About me :-)





PhD in Software Engineering

MSc in Computer Science
BSc in Computer Science

Work
Experience

Academics
About Treelogic
Treelogic is an R&D
intensive company with
the mission of creating,
boosting, developing and
adapting scientific and
techn...
TREELOGIC – Distributor and Sales
International Projects

National Projects

Research Lines
Computer Vision

Regional Projects

Solutions
Security & Safety
...
7 ongoing FP7 projects
ICT, SEC, OCEAN
Coordinating 5 of them

3 ongoing Eurostars projects
Coordinating all of them
7 years’
experience in
R&D projects

Research
&

INNOVATION
www.datadopter.com
Agenda
1.

Big Data processing

2.

Batch processing

3.

Real-time processing

4.

Hybrid computation model

5.

Conclusi...
What is Big Data?
A massive volume of both
structured and unstructured data
that is so large to process with

traditional ...
How is Big Data?
Big Data are high-volume, high-velocity,
and/or high-variety information assets that
require new forms of...
3 problems

Volume

Variety

Velocity
3 solutions

Batch processing

Real-time
NoSQL

processing
3 solutions

Batch processing

Real-time
NoSQL

processing
Batch processing
•

Scalable

•

Large amount of static data

•

Distributed

•

Parallel

•

Fault tolerant

•

High late...
Real-time processing
•

Low latency

•

Continuous
unbounded
streams of data
•

Distributed
•

Velocity
•

Parallel

Fault...
Hybrid computation model
•

Low latency

•

Massive data + Streaming data

•

Scalable

•

Combine batch and real-time res...
Hybrid computation model

All data

Batch processing

Batch
results
Final results
Combination

New data

Real-time process...
Processing Paradigms





Large amount of statics data
Scalable solution
Volume

2006
1ª
Generation

2010

Real-time pr...
10 years of Big Data
processing technologies
Batch

MapReduce: Simplified
Data Processing on
Large Clusters
Yahoo! starts
...
Processing Pipeline

DATA

DATA

DATA

ACQUISITION

STORAGE

ANALYSIS

RESULTS
Air Quality case study



Static stations and mobile sensors in Asturias sending streaming data



Historical data of > ...
Agenda
1.

Big Data processing overview

2.

Batch processing

3.

Real-time processing

4.

Hybrid computation model

5.
...
Batch processing technologies

DATA

DATA

DATA

ACQUISITION

STORAGE

ANALYSIS

o

HDFS
commands

o

Sqoop

o

Flume

o

...
HDFS commands
•

B
A
T
C
H

Import to HDFS

hadoop dfs -copyFromLocal
<path-to-local> <path-to-remote>
hadoop dfs –copyFro...
Sqoop
•

Tool designed for transferring data between
HDFS/HBase and structural datastores

•

Based in MapReduce

•

Inclu...
B
A
T
C
H

Sqoop

DATA
ACQUISITION

3) Export results to database

1) Import data from database to HDFS

export --connect
...
Flume

B
A
T
C
H

DATA
ACQUISITION

•

Service for collecting, aggregating, and moving
large amounts of log data

•

Simpl...
B
A
T
C
H

Flume
•

Architecture
o

Waiting for events .

Sink
•

o

ACQUISITION

Source
•

o

DATA

Sends the information...
B
A
T
C
H

Flume
Stations send the information to the servers. Flume collects

DATA
ACQUISITION

this information and move...
Scribe

B
A
T
C
H

DATA
ACQUISITION

•

Server for aggregating log data streamed in real time from
a large number of serve...
Scribe
•

B
A
T
C
H

DATA
ACQUISITION

Sending a sensor message to a Scribe Server

category=‘mobile‘;
// '1; 43.5298; -5....
HDFS
•

Distributed FileSystem for Hadoop

•

B
A
T
C
H

DATA
STORAGE

Master-Slaves Architecture (NameNode – DataNodes)
o...
HBase

B
A
T
C
H

DATA
STORAGE

•

Open-source non-relational distributed column-oriented
database modeled after Google’s ...
MapReduce

B
A
T
C
H

•

Framework for processing large amount of data in parallel
across a distributed cluster

•

ANALYT...
B
A
T
C
H

MapReduce
•

Design Patterns
o

Joins

o

Statistics

o

o

AVG

o

Replicated join

o

VAR

o

o

Reduce side ...
MapReduce
•

B
A
T
C
H

DATA
ANALYTICS

Obtain the S02 average of each station

Station;

Tittle;

latitude; longitude;

D...
B
A
T
C
H

MapReduce

ANALYTICS

Maps get records and produce the SO2 value in
<Station_Id, SO2_value>
<Station_ID, S02_VA...
B
A
T
C
H

MapReduce
•

DATA
ANALYTICS

Reducer receives <Station_Id, List<SO2_value> >
and computes the average for the s...
Hive

B
A
T
C
H

DATA
ANALYTICS

•

Hive is a data warehouse system for Hadoop
that facilitates easy data summarization, a...
Hive
•

B
A
T
C
H

DATA
ANALYTICS

Obtain the S02 average of each station
SELECT Titulo, avg(SO2)
FROM air_quality
GROUP B...
Pig

B
A
T
C
H

•

Platform for analyzing large data sets

•

High-level language for expressing data
analysis programs. P...
Pig
•

B
A
T
C
H

DATA
ANALYTICS

Obtain the S02 average of each station

calidad_aire = load '/CalidadAire_Gijon' using P...
Cascading

B
A
T
C
H

DATA
ANALYTICS

•

Cascading is a data processing API and
processing query planner used for defining...
Cascading
•

B
A
T
C
H

DATA
ANALYTICS

Obtain the S02 average of each station

// define source and sink Taps.
Tap source...
Spark

B
A
T
C
H

DATA
ANALYTICS

•

Cluster computing systems for faster data analytics

•

Not a modified version of Had...
Spark
•

B
A
T
C
H

DATA
ANALYTICS

Hadoop is slow due to replication, serialization
and IO tasks
Spark
•

10x-100x faster

B
A
T
C
H

DATA
ANALYTICS
Shark

B
A
T
C
H

•

Large-scale data warehouse system for Spark

•

SQL on top of Spark

•

Actually Hive QL over Spark

...
Spark / Shark

B
A
T
C
H

DATA
ANALYTICS

Pros
•

Faster than Hadoop ecosystem

•

Easier to develop new applications
o

(...
Agenda
1.

Big Data processing

2.

Batch processing

3.

Real-time processing

4.

Hybrid computation model

5.

Conclusi...
Real-time processing technologies

DATA

DATA

DATA

ACQUISITION

STORAGE

ANALYSIS

o

Flume

o

Kafka

o

Flume

o

Kest...
Flume

R
E
A
L

DATA
ACQUISITION
Kafka
•

R
E
A
L

DATA
STORAGE

Kafka is a distributed, partitioned, replicated commit log service
o

Producer/Consumer mo...
Kafka
Insert AirQuality sensor log file into Kafka
cluster and consume the info.

R
E
A
L

DATA
STORAGE

// new Producter
...
Kafka
AirQuality Consumer

R
E
A
L

DATA
STORAGE

ConsumerConnector consumer =
Consumer.createJavaConsumerConnector(config...
Kestrel

R
E
A
L

DATA
STORAGE

•

Simple distributed message queue

•

A single Kestrel server has a set of queues (stric...
Flume
Interceptor
•
•
•
•

Interface org.apache.flume.interceptor.Interceptor
Can modify or even drop events based on any ...
Flume

R
E
A
L

DATA
ANALYTICS

• The sensors’ information must be filtered by "Station 2"
o An interceptor will filter in...
Flume
class StationFilter implements Interceptor

…
if(!"Station".equals("2"))
discard data;
else
save data;
# Write forma...
Storm

R
E
A
L

DATA
ANALYTICS

•

Distributed and scalable realtime computation system

•

Doing for real-time processing...
Storm
•

R
E
A
L

AirQuality average values
o Step

1: build the topology

CAReader

Spout

LineProcessor

Bolt

AvgValues...
Storm
•

AirQuality average values
o Step

R
E
A
L

DATA
ANALYTICS

1: build the topology

TopologyBuilder AirAVG= new Top...
Storm
•

AirQuality average values
o Step

DATA
ANALYTICS

2: CAReader implementation (IRichSpout interface)

public void ...
Storm
•

AirQuality average values
o Step

3: LineProcessor implementation

(IBasicBolt interface)
public void declareOutp...
Storm
•

AirQuality average values
o Step

R
E
A
L

DATA
ANALYTICS

4: AvgValues implementation

(IBasicBolt interface)
pu...
Trident
•

R
E
A
L

DATA
ANALYTICS

High level abstraction on top of Storm
o

Provides high level operations (joins, filte...
S4





Simple Scalable Streaming System

R
E
A
L

DATA
ANALYTICS

Distributed, Scalable, Fault-tolerant platform for p...
S4
•

AirQuality average values
…

<bean id="split" class="SplitPE">
<property name="dispatcher" ref="dispatcher"/>
<prope...
Spark Streaming

R
E
A
L

DATA
ANALYTICS

•

Spark for real-time processing

•

Streaming computation as a series of very ...
Agenda
1.

Big Data processing

2.

Batch processing

3.

Real-time processing

4.

Hybrid computation model

5.

Conclusi...
Hybrid Computation Model
•

We are in the beginning of this generation

•

Short-term Big Data processing goal

•

Abstrac...
SummingBird

HYBRID
COMPUTATION
MODEL

•

Library to write MapReduce-like process that can
be executed on Hadoop, Storm or...
SummingBird

HYBRID
COMPUTATION
MODEL
SummingBird

HYBRID
COMPUTATION
MODEL

Pros
•

Hybrid computation model

•

Same programing model for all
proccesing parad...
Lambdoop


Software abstraction layer over Open Source technologies
o




HYBRID
COMPUTATION
MODEL

Hadoop, HBase, Sqoo...
Lambdoop

HYBRID
COMPUTATION
MODEL

Streaming data

Workflow

Data

Static data

Operation

Data
Lambdoop

HYBRID
COMPUTATION
MODEL

DataInput db_historical = new StaticCSVInput(URI_db);
Data historical = new Data (db_h...
Lambdoop
DataInput stream_sensor = new StreamXMLInput(URI_sensor);

HYBRID
COMPUTATION
MODEL

Data sensor = new Data(strea...
Lambdoop
DataInput

historical= new StaticCSVInput(URI_folder);

HYBRID
COMPUTATION
MODEL

DataInput stream_sensor= new St...
Lambdoop

HYBRID
COMPUTATION
MODEL

Pros
•

High abstraction layer for all processing model

•

All steps in the data proc...
Agenda
1.

Big Data processing

2.

Batch processing

3.

Real-time processing

4.

Hybrid computation model

5.

Conclusi...
Conclusions
•

Big Data is not only Hadoop

•

Identify the processing requirements of your
project

•

Analyze the altern...
Thanks for your attention!
Contact us:
ruben.casado@treelogic.com

info@datadopter.com
www.datadopter.com
www.treelogic.co...
Upcoming SlideShare
Loading in...5
×

The three generations of Big Data processing

5,311

Published on

Big Data is often characterized by the 3 “Vs”: variety, volume and velocity. While variety refers to the nature of the information (multiple sources, schema-less data, etc), both volume and velocity refer to processing issues that have to be addressed by different processing paradigms.

Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, the processing solution break down broadly into massively parallel processing (batch processing). Batch processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced.

Several applications require real-time processing of data streams from heterogeneous sources, in contrast with the approach of batch processing. Real time processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time). Domains of application include smart cities, entertainment of disaster management. The low latency is the main goal of this processing paradigm.

Batch processing provides strong results since it can use more data and, for example, perform better training of predictive models. But it is not feasible for domains where a low response time is a critical issue. Real time processing solves this issue, but the analyzed information is limited in order to achieve low latency. Many domains require the benefit of both batch and real time processing approaches so a new processing paradigm is needed: the hybrid model. To obtain a complete result, the batch and real-time results must be queried and the results merged together. Synchronization, results composition and other non-trivial issues have to be addressed at this stage in which could be considered a key element of the hybrid modell.

This walk will overview the time-evolution of the big data processing techniques, identify main hits (both technologies and scientific publications) and give and introduction of the key technologies to understand the complex Big Data processing domain.

Published in: Technology, Business

Transcript of "The three generations of Big Data processing"

  1. 1. The three generations of Big Data processing Rubén Casado ruben.casado@treelogic.com
  2. 2. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  3. 3. About me :-)
  4. 4.    PhD in Software Engineering MSc in Computer Science BSc in Computer Science Work Experience Academics
  5. 5. About Treelogic
  6. 6. Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledge to improve quality standards in our daily life
  7. 7. TREELOGIC – Distributor and Sales
  8. 8. International Projects National Projects Research Lines Computer Vision Regional Projects Solutions Security & Safety Big Data Teraherzt technology R&D Manag. System Justice Health Data science Social Media Analysis Semantics Internal Projects R&D Transport Financial services ICT tailored solutions
  9. 9. 7 ongoing FP7 projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostars projects Coordinating all of them
  10. 10. 7 years’ experience in R&D projects Research & INNOVATION
  11. 11. www.datadopter.com
  12. 12. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  13. 13. What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques
  14. 14. How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -
  15. 15. 3 problems Volume Variety Velocity
  16. 16. 3 solutions Batch processing Real-time NoSQL processing
  17. 17. 3 solutions Batch processing Real-time NoSQL processing
  18. 18. Batch processing • Scalable • Large amount of static data • Distributed • Parallel • Fault tolerant • High latency Volume
  19. 19. Real-time processing • Low latency • Continuous unbounded streams of data • Distributed • Velocity • Parallel Fault-tolerant
  20. 20. Hybrid computation model • Low latency • Massive data + Streaming data • Scalable • Combine batch and real-time results Volume Velocity
  21. 21. Hybrid computation model All data Batch processing Batch results Final results Combination New data Real-time processing Stream results
  22. 22. Processing Paradigms    Large amount of statics data Scalable solution Volume 2006 1ª Generation 2010 Real-time processing     Inception Batch processing   2003 Computing streaming data Low latency Velocity 2ª Generation 2014 Hybrid computation   Lambda Architecture Volume + Velocity 3ª Generation
  23. 23. 10 years of Big Data processing technologies Batch MapReduce: Simplified Data Processing on Large Clusters Yahoo! starts working on Hadoop 2003 2004 2005 The Google File System 2006 Real-Time Yahoo! creates S4 Facebook creates Hive 2008 Apache Hadoop is in production Doug Cutting starts developing Hadoop 2009 Hybrid Nathan Marz defines the Lambda Architecture LinkedIn LinkedIn! presents presents Samza KafkA 2010 2011 Cloudera presents Flume 2012 2013 Nathan Marz creates Storm Yahoo! creates Pig Google publishes MillWheel: Fault-Tolerant Stream Processing at Internet Scale
  24. 24. Processing Pipeline DATA DATA DATA ACQUISITION STORAGE ANALYSIS RESULTS
  25. 25. Air Quality case study  Static stations and mobile sensors in Asturias sending streaming data  Historical data of > 10 years  Monitoring, trends identification, predictions
  26. 26. Agenda 1. Big Data processing overview 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  27. 27. Batch processing technologies DATA DATA DATA ACQUISITION STORAGE ANALYSIS o HDFS commands o Sqoop o Flume o Scribe o HDFS o MapReduce o HBase o Hive o Pig o Cascading o Spark o Shark RESULTS
  28. 28. HDFS commands • B A T C H Import to HDFS hadoop dfs -copyFromLocal <path-to-local> <path-to-remote> hadoop dfs –copyFromLocal /home/hduser/AirQuality/ /hdfs/AirQuality/ DATA ACQUISITION
  29. 29. Sqoop • Tool designed for transferring data between HDFS/HBase and structural datastores • Based in MapReduce • Includes connectors for multiple databases o o PostgreSQL, o Oracle, o SQL Server and o DB2 o • MySQL, Generic JDBC connector Java API B A T C H DATA ACQUISITION
  30. 30. B A T C H Sqoop DATA ACQUISITION 3) Export results to database 1) Import data from database to HDFS export --connect jdbc:mysql://localhost/testDatabase jdbc:mysql://localhost/testDatabase --target-dir --export-dir hdfs://rootHDFS/testDatabase -- hdfs://rootHDFS/testDatabase -- username user1 --password pass1 -m 1 2) Analyze data (HADOOP) import -all-tables --connect username user1 --password pass1 -m 1
  31. 31. Flume B A T C H DATA ACQUISITION • Service for collecting, aggregating, and moving large amounts of log data • Simple and flexible architecture based on streaming data flows • Reliability, scalability, extensibility, manageability Support log stream types • o o o Avro Syslog Netcast
  32. 32. B A T C H Flume • Architecture o Waiting for events . Sink • o ACQUISITION Source • o DATA Sends the information towards another agent or system. Channel • Stores the information until it is consumed by the sink. Sources Avro Thrift Exec JMS NetCat Syslog TCP/UDP HTTP Custom Channels Memory JDBC File Sinks HDFS Logger Avro Thrift IRC File Roll Null HBase Custom
  33. 33. B A T C H Flume Stations send the information to the servers. Flume collects DATA ACQUISITION this information and move it into the HDFS for further analsys  Air quality syslogs Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  34. 34. Scribe B A T C H DATA ACQUISITION • Server for aggregating log data streamed in real time from a large number of servers • There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. • The central scribe server(s) can write the messages to the files that are their final destination
  35. 35. Scribe • B A T C H DATA ACQUISITION Sending a sensor message to a Scribe Server category=‘mobile‘; // '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' message= sensor_log.readLine(); log_entry = scribe.LogEntry(category, message) // Create a Scribe Client client = scribe.Client(iprot=protocol, oprot=protocol) transport.open() result = client.Log(messages=[log_entry]) transport.close()
  36. 36. HDFS • Distributed FileSystem for Hadoop • B A T C H DATA STORAGE Master-Slaves Architecture (NameNode – DataNodes) o o • NameNode: Manage the directory tree and regulates access to files by clients DataNodes: Store the data Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes
  37. 37. HBase B A T C H DATA STORAGE • Open-source non-relational distributed column-oriented database modeled after Google’s BigTable. • Random, realtime read/write access to the data. • Not a relational database. o • Very light «schema» Rows are stored in sorted order.
  38. 38. MapReduce B A T C H • Framework for processing large amount of data in parallel across a distributed cluster • ANALYTICS Slightly inspired in the Divide and Conquer (D&C) classic strategy • DATA Developer has to implement Map and Reduce functions: o Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes parsed to the format <K, V> o Reduce: It collects the <K, List(V)> and generates the results
  39. 39. B A T C H MapReduce • Design Patterns o Joins o Statistics o o AVG o Replicated join o VAR o o Reduce side Join Semi join o Count o … Sorting: o o o o Secondary sort Total Order Sort o Filtering o Top-K Binning … DATA ANALYTICS
  40. 40. MapReduce • B A T C H DATA ANALYTICS Obtain the S02 average of each station Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  41. 41. B A T C H MapReduce ANALYTICS Maps get records and produce the SO2 value in <Station_Id, SO2_value> <Station_ID, S02_VALUE> Mapper Mapper Input Data Mapper <1, 6> <1, 2> <3, 1> <1, 9> <3, 9> <2, 6> <2, 6> <1, 6> <2, 0> <2, 8> <1, 2> <3,9> … … … Shuffling • DATA
  42. 42. B A T C H MapReduce • DATA ANALYTICS Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station <Station_ID, [SO1, SO2,…,SOn> Reducer Divide Station_ID, AVG_SO2 Sum Divide 2,013 2, Reducer 1, 2,695 3, 3,562 … Shuffling Sum
  43. 43. Hive B A T C H DATA ANALYTICS • Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets • Abstraction layer on top of MapReduce • SQL-like language called HiveQL. Metastore: Central repository of Hive metadata. •
  44. 44. Hive • B A T C H DATA ANALYTICS Obtain the S02 average of each station SELECT Titulo, avg(SO2) FROM air_quality GROUP BY Estacion CREATE TABLE air_quality(Estacion int, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' LINES TERMINATED BY 'n' STORED AS TEXTFILE; LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire;
  45. 45. Pig B A T C H • Platform for analyzing large data sets • High-level language for expressing data analysis programs. Pig Latin. Data flow programming language. • Abstraction layer on top of MapReduce • Procedural language DATA ANALYTICS
  46. 46. Pig • B A T C H DATA ANALYTICS Obtain the S02 average of each station calidad_aire = load '/CalidadAire_Gijon' using PigStorage(';') AS (estacion:chararray, titulo:chararray, latitud:chararray, longitud:chararray, fecha:chararray, so2:chararray, no:chararray, co:chararray, pm10:chararray, o3:chararray, dd:chararray, vv:chararray, tmp:chararray, hr:chararray, prb:chararray, rs:chararray, ll:chararray, ben:chararray, tol:chararray, mxil:chararray, pm25:chararray); grouped = GROUP air_quality BY estacion; avg = FOREACH grouped GENERATE group, AVG(so2); dump avg;
  47. 47. Cascading B A T C H DATA ANALYTICS • Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows • Makes development of complex Hadoop MapReduce workflows easy • In the same way that Pig
  48. 48. Cascading • B A T C H DATA ANALYTICS Obtain the S02 average of each station // define source and sink Taps. Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextLine( new Fields( “Estacion", “SO2" ) ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( “avgSO2" ); assembly = new GroupBy( assembly, new Fields( “Estacion" ) ); // For every Tuple group Aggregator avg = new Average( new Fields( “SO2" ) ); assembly = new Every( assembly, avg ); // Tell Hadoop which jar file to use Flow flow = flowConnector.connect( “avg-SO2", source, sink, assembly ); // execute the flow, block until complete flow.complete();
  49. 49. Spark B A T C H DATA ANALYTICS • Cluster computing systems for faster data analytics • Not a modified version of Hadoop • Compatible with HDFS In-memory data storage for very fast iterative processing MapReduce-like engine API in Scala, Java and Python • • •
  50. 50. Spark • B A T C H DATA ANALYTICS Hadoop is slow due to replication, serialization and IO tasks
  51. 51. Spark • 10x-100x faster B A T C H DATA ANALYTICS
  52. 52. Shark B A T C H • Large-scale data warehouse system for Spark • SQL on top of Spark • Actually Hive QL over Spark • Up to 100 x faster than Hive DATA ANALYTICS
  53. 53. Spark / Shark B A T C H DATA ANALYTICS Pros • Faster than Hadoop ecosystem • Easier to develop new applications o (Scala, Java and Python API) Cons • Not tested in extremely large clusters yet • Problems when Reducer’s data does not fit in memory
  54. 54. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  55. 55. Real-time processing technologies DATA DATA DATA ACQUISITION STORAGE ANALYSIS o Flume o Kafka o Flume o Kestrel o Storm o Trident o S4 o Spark Streaming RESULTS
  56. 56. Flume R E A L DATA ACQUISITION
  57. 57. Kafka • R E A L DATA STORAGE Kafka is a distributed, partitioned, replicated commit log service o Producer/Consumer model o Kafka maintains feeds of messages in categories called topics o Kafka is run as a cluster
  58. 58. Kafka Insert AirQuality sensor log file into Kafka cluster and consume the info. R E A L DATA STORAGE // new Producter Producer<String, String> producer = new Producer<String, String>(config); //Open sensor log file BufferedReader br… String line; while(true) { line = br.readLine(); if(line ==null) … //wait; else producer.send(new KeyedMessage<String, String>(topic, line)); }
  59. 59. Kafka AirQuality Consumer R E A L DATA STORAGE ConsumerConnector consumer = Consumer.createJavaConsumerConnector(config); Map<String, Integer> topicCountMap = new HashMap<String, Integer>(); topicCountMap.put(topic, new Integer(1)); Map<String, List<KafkaMessageStream>> consumerMap = consumer.createMessageStreams(topicCountMap); KafkaMessageStream stream = consumerMap.get(topic).get(0); ConsumerIterator it = stream.iterator(); while(it.hasNext()){ // consume it.next()
  60. 60. Kestrel R E A L DATA STORAGE • Simple distributed message queue • A single Kestrel server has a set of queues (strictly-ordered FIFO) • On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication • Kestrel vs Kafka o Kafka consumers cheaper (basically just the bandwidth usage) o Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation. o Kafka has significantly better throughput. o Kestrel does not support ordered consumption
  61. 61. Flume Interceptor • • • • Interface org.apache.flume.interceptor.Interceptor Can modify or even drop events based on any criteria Flume supports chaining of interceptors. Types: o Timestamp interceptor o Host interceptor o Static interceptor o UUID interceptor o Morphline interceptor o Regex Filtering interceptor o Regex Extractor interceptor R E A L DATA ANALYTICS
  62. 62. Flume R E A L DATA ANALYTICS • The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Source and Channel. Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  63. 63. Flume class StationFilter implements Interceptor … if(!"Station".equals("2")) discard data; else save data; # Write format can be text or writable … #Defining channel – Memory type …1 … #Defining source – Syslog … … # Defining sink – HDFS … … #Defining interceptor agent.sources.source.interceptors = i1 agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter R E A L DATA ANALYTICS
  64. 64. Storm R E A L DATA ANALYTICS • Distributed and scalable realtime computation system • Doing for real-time processing what Hadoop did for batch processing • Topology: processing graph. Each node contains processing logic (spouts and bolts). Links between nodes are streams of data o Spout: Source of streams. Read a data source and emit the data into the topology as a stream o Bolts: Processing unit. Read data from several streams, does some processing and possibly emits new streams o Stream: Unbounded sequence of tuples. Tuples can contain any serializable object Hadoop Storm JobTracker Nimbus TaskTracker Supervisor Job Topology
  65. 65. Storm • R E A L AirQuality average values o Step 1: build the topology CAReader Spout LineProcessor Bolt AvgValues Bolt DATA ANALYTICS
  66. 66. Storm • AirQuality average values o Step R E A L DATA ANALYTICS 1: build the topology TopologyBuilder AirAVG= new TopologyBuilder(); builder.setSpout("ca-reader", new CAReader(), 1); //shuffleGrouping -> even distribution AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3) .shuffleGrouping("ca-reader"); //fieldsGrouping -> fields with the same value goes to the same task AirAVG.setBolt("ca-avg-values", new AvgValues(), 2) .fieldsGrouping("ca-line-processor", new Fields("id"));
  67. 67. Storm • AirQuality average values o Step DATA ANALYTICS 2: CAReader implementation (IRichSpout interface) public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { } R E A L //Initialize file BufferedReader br = new … … public void nextTuple() { String line = br.readLine(); if (line == null) { return; } else collector.emit(new Values(line)); }
  68. 68. Storm • AirQuality average values o Step 3: LineProcessor implementation (IBasicBolt interface) public void declareOutputFields (OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "stationName", "lat", … } public void execute (Tuple input, BasicOutputCollector collector) { collector.emit(new Values(input.getString(0).split(";"); } R E A L DATA ANALYTICS
  69. 69. Storm • AirQuality average values o Step R E A L DATA ANALYTICS 4: AvgValues implementation (IBasicBolt interface) public void execute (Tuple input, BasicOutputCollector collector) { //totals and count are hashmaps with each station accumulated values if (totals.containsKey(id)) { item = totals.get(id); count = counts.get(id); } else { //Create new item } //update values item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2"))); item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no"))); … } 69
  70. 70. Trident • R E A L DATA ANALYTICS High level abstraction on top of Storm o Provides high level operations (joins, filters, projections, aggregations, functions…) Pros o Easy, powerful and flexible o Incremental topology development o Exactly-once semantics Cons o Very few built-in functions Lower performance and higher latency than Storm o
  71. 71. S4    Simple Scalable Streaming System R E A L DATA ANALYTICS Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data Inspired by MapReduce and Actor models of computation o Data processing is based on Processing Elements (PE) o Messages are transmitted between PEs in the form of events (Key, Attributes) o Processing Nodes are the logical hosts to PEs
  72. 72. S4 • AirQuality average values … <bean id="split" class="SplitPE"> <property name="dispatcher" ref="dispatcher"/> <property name="keys"> <!-- Listen for both words and sentences --> <list> <value>LogLines *</value> </list> </property> </bean> <bean id="average" class="AveragePE"> <property name="keys"> <list> <value>CAItem stationId</value> </list> </property> </bean> … R E A L DATA ANALYTICS
  73. 73. Spark Streaming R E A L DATA ANALYTICS • Spark for real-time processing • Streaming computation as a series of very short batch jobs (windows) • Keep state in memory • API similar to Spark
  74. 74. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  75. 75. Hybrid Computation Model • We are in the beginning of this generation • Short-term Big Data processing goal • Abstraction layer over the Lambda Architecture • Promising technologies o SummingBird o Lambdoop
  76. 76. SummingBird HYBRID COMPUTATION MODEL • Library to write MapReduce-like process that can be executed on Hadoop, Storm or hybrid model • Scala syntaxis • Same logic can be executed in batch, real-time and hybrid bath/real mode
  77. 77. SummingBird HYBRID COMPUTATION MODEL
  78. 78. SummingBird HYBRID COMPUTATION MODEL Pros • Hybrid computation model • Same programing model for all proccesing paradigms • Extensible Cons • MapReduce-like programing • • Scala Not as abstract as some users would like
  79. 79. Lambdoop  Software abstraction layer over Open Source technologies o   HYBRID COMPUTATION MODEL Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process Same single API for the three processing paradigms o Batch processing similar to Pig / Cascading o Real time processing using built-in functions easier than Trident o Hybrid computation model transparent for the developer
  80. 80. Lambdoop HYBRID COMPUTATION MODEL Streaming data Workflow Data Static data Operation Data
  81. 81. Lambdoop HYBRID COMPUTATION MODEL DataInput db_historical = new StaticCSVInput(URI_db); Data historical = new Data (db_historical); Workflow batch = new Workflow (historical); Operation filter = new Filter (“Station", “=", 2); Operation select = new Select (“Titulo“, “SO2"); Operation group = new Group(“Titulo"); Operation average = new Average (“SO2"); batch.add(filter); batch.add(select); batch.add(group); batch.add(variance); batch.run(); Data results = batch.getResults(); …
  82. 82. Lambdoop DataInput stream_sensor = new StreamXMLInput(URI_sensor); HYBRID COMPUTATION MODEL Data sensor = new Data(stream_sensor) Workflow streaming = new Workflow (sensor, new WindowsTime(100) ); Operation filter = new Filter ("Station", "=", 2); Operation select = new Select ("Titulo", "S02"); Operation group = new Group("Titulo"); Operation average = new Average ("S02"); streaming.add(filter); streaming.add(select); streaming.add(group); streaming.add(average); streaming.run(); While (true) { Data live_results = streaming.getResults(); … }
  83. 83. Lambdoop DataInput historical= new StaticCSVInput(URI_folder); HYBRID COMPUTATION MODEL DataInput stream_sensor= new StreamXMLInput(URI_sensor); Data all_info = new Data (historical, stream_sensor); Workflow hybrid = new Workflow (all_info, new WindowsTime(1000) ); Operation filter = new Filter ("Station", "=", 2); Operation select = new Select ("Titulo", "SO2"); Operation group = new Group("Titulo"); Operation average = new Average ("SO2"); hybrid.add(filter); hybrid.add(select); hybrid.add(group); hybrid.add(variance); hybrid.run(); Data updated_results = hybrid.getResults();
  84. 84. Lambdoop HYBRID COMPUTATION MODEL Pros • High abstraction layer for all processing model • All steps in the data processing pipeline • Same Java API for all programing paradigms • Extensible Cons • • • Ongoing project Not open-source yet Not tested in larger cluster yet
  85. 85. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  86. 86. Conclusions • Big Data is not only Hadoop • Identify the processing requirements of your project • Analyze the alternatives for all steps in the data pipeline • The battle for real-time processing is open • Stay tuned for the hybrid computation model
  87. 87. Thanks for your attention! Contact us: ruben.casado@treelogic.com info@datadopter.com www.datadopter.com www.treelogic.com MADRID Avda. de Manoteras, 38 Oficina D507 28050 Madrid · España ASTURIAS Parque Tecnológico de Asturias Parcela 30 33428 Llanera - Asturias · España 902 286 386
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×