Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

Dr. Rubén Casado
ruben.casado@treelogic.com
ruben_casado
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades
UNIVERSIDAD COMPLUETENSEMADRID
19 de Noviembre de 2014

1.
Big Data processing
2.
Batch processing
3.
Streaming processing
4.
Hybrid computation model
5.
Open Issues & Conclusions
Agenda


PhD in Software Engineering

MSc in Computer Science

BSc in Computer Science
Academics
Work
Experience

1.
Big Data processing
2.
Batchprocessing
3.
Streamingprocessing
4.
Hybridcomputationmodel
5.
Agenda

A massivevolume of both structuredand unstructureddata that is so large to process with traditional database and software techniques
What is Big Data?

Big Data are high-volume, high-velocity, and/or high-varietyinformation assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization
How is Big Data?
-Gartner IT Glossary -

3 problems
Volume
Variety
Velocity

3 solutions
Batch processing
NoSQL
Streaming
processing

Volume
Variety
Velocity
Science or Engineering?

Volume
Variety
Value
Velocity

Volume
Variety
Value
Velocity
Software
Engineering
Data Science

13

Relational Databases

Schema based

ACID (Atomicity, Consistency, Isolation, Durability)

Performance penalty

Scalability issues

NoSQL

Not Only SQL

Families of solutions

Google BigTable, Amazon Dynamo

BASE = Basically Available, Soft state, Eventually consistent

CAP= Consistency, Availability, Partition tolerance
NoSQL

14

Key-value

Key: ID

Value: associateddata

Diccionario

LinkedIn Voldemort

Riak, Redis

Memcache, Membase

Document

More complex tan K-V

Documents are indexed by ID

Multiple index

MongoDB

CouchDB

Column

Tables with predefined families of fields

Fields within families are flexible

Vertical and horizontal partitioning

HBase

Cassandra

Graph

Nodes

Relationships

Neo4j

FlockDB

OrientDB
CR7: ‘Cristiano Ronaldo’
CR7:{Name: ’Cristiano’
Surname: ‘Ronaldo’
Age: 29}
CR7: [Personal:{Name: ’Cristiano’
Surname: ‘Ronaldo’
Edad: 29}
Job: {Team: ‘R. Madrid’
Salary: 20.000.000}]
NoSQL
[CR]
[R.Madrid]
se_llama
juega
[Cristiano]

•
Scalable
•
Large amount of staticdata
•
Distributed
•
Parallel
•
Fault tolerant
•
High latency
Batch processing
Volume

•
Low latency
•
Continuous unbounded streams of data
•
Distributed
•
Parallel
•
Fault-tolerant
Streaming processing
Velocity

•
Lowlatency:real-time
•
Massivedata-at-rest+ data-in-motion
•
Scalable
•
Combinebatchand streamingresults
Volume
Velocity

All data
New data
Batch processing
Streamingprocessing
Batch
results
Stream
results
Combination
Final results


Batchprocessing

Largeamountof staticsdata

Scalablesolution

Volume

Streamingprocessing

Computing streamingdata

Lowlatency

Velocity

Hybridcomputation

Lambda Architecture

Volume+ Velocity
2006
2010
2014
1ª Generation
2ª Generation
3ª Generation
Inception
2003
Processing Paradigms

Batch
+10 years of Big Data processing technologies
2003
2004
2005
2013
2011
2010
2008
The Google File System
MapReduce: Simplified Data Processing on Large Clusters
Doug Cutting starts developing Hadoop
2006
Yahoo! starts working on Hadoop
Apache Hadoop is in production
Nathan Marzcreates Storm
Yahoo! creates S4
2009
Facebook creates Hive
Yahoo! creates Pig
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
LinkedIn presents Samza
LinkedIn presents KafkA
Clouderapresents Flume
2012
Nathan Marzdefines the Lambda Architecture
Streaming
Hybrid
2014
Spark stack is open sourced
Lambdoop & Summinbgirdfirst steps
StratospherebecomesApache Flink

Processing Pipeline
DATA
ACQUISITION
DATA
STORAGE
DATA
ANALYSIS
RESULTS


Static stations and mobile sensors in Asturias sending streaming data

Historical data of > 10 years

Monitoring, trends identification, predictions
Air Quality case study

1.
Big Data processing overview
2.
Batch processing
3.
Real-time processing
4.
5.
Agenda

Batch processing technologies
DATA
ACQUISITION
DATA
STORAGE
DATA
ANALYSIS
RESULTS
o
HDFS commands
o
Sqoop
o
Flume
o
Scribe
o
HDFS
o
HBase
o
MapReduce
o
Hive
o
Pig
o
Cascading
o
Spark
o
SparkSQL (Shark)

•
Import to HDFS
hadoopdfs-copyFromLocal
<path-to-local> <path-to-remote>
hadoopdfs–copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/
HDFS commands
DATA
ACQUISITION
BATCH

•
Tool designed for transferring data between HDFS/HBase and structural datastores
•
Based in MapReduce
•
Includes connectors for multiple databases
o
MySQL,
o
PostgreSQL,
o
Oracle,
o
SQL Server and
o
DB2
o
Generic JDBC connector
•
Java API
Sqoop
DATA
ACQUISITION
BATCH

import-all-tables--connectjdbc:mysql://localhost/testDatabase--target-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1
1) Import data from database to HDFS
export--connectjdbc:mysql://localhost/testDatabase--export-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1
3) Export results to database
2) Analyzedata (HADOOP)
Sqoop
DATA
ACQUISITION
BATCH

•
Service for collecting, aggregating, and moving large amounts of log data
•
Simple and flexible architecture based on streaming data flows
•
Reliability, scalability, extensibility, manageability
•
Support log stream types
o
Avro
o
Syslog
o
Netcast
Flume
DATA
ACQUISITION
BATCH

Sources
Channels
Sinks
Avro
Memory
HDFS
Thrift
JDBC
Logger
Exec
File
Avro
JMS
Thrift
NetCat
IRC
Syslog TCP/UDP
File Roll
HTTP
Null
HBase
Custom
Custom
•
Architecture
o
Source
•
Waitingforevents.
o
Sink
•
Sendstheinformationtowardsanotheragentorsystem.
o
Channel
•
Storestheinformationuntilitisconsumedbythesink.
Flume
DATA
ACQUISITION
BATCH

Stations send the information to the servers. Flume collects this information and move it into the HDFS for further analsys

Air quality syslogs
Flume
DATA
ACQUISITION
BATCH
Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";

•
Server for aggregating log datastreamed in real time from a large number of servers
•
There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.
•
The central scribe server(s) can write the messages to the files that are their final destination
Scribe
DATA
ACQUISITION
BATCH

category=‘mobile‘;
// '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …'
message= sensor_log.readLine();
log_entry= scribe.LogEntry(category, message)
// Createa Scribe Client
client= scribe.Client(iprot=protocol, oprot=protocol)
transport.open()
result= client.Log(messages=[log_entry])
transport.close()
•
Sending a sensor message to a Scribe Server
Scribe
DATA
ACQUISITION
BATCH

•
Distributed FileSystem for Hadoop
•
Master-Slaves Architecture (NameNode–DataNodes)
o
NameNode: Manage the directory tree and regulates access to files by clients
o
DataNodes: Store the data
•
Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes
HDFS
DATA
STORAGE
BATCH

•
Open-source non-relational distributed column-oriented databasemodeled after Google’s BigTable.
•
Random, realtime read/write access to the data.
•
Not a relational database.
o
Very light «schema»
•
Rows are stored in sorted order.
DATA
STORAGE
BATCH
HBase

•
Framework for processing large amount of datain parallel
across a distributed cluster
•
Slightly inspired in the Divide and Conquer (D&C) classic strategy
•
Developer has to implement Map and Reduce functions:
o
Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodesparsed to the format <K, V>
o
Reduce: It collects the <K, List(V)> and generates the results
MapReduce
DATA
ANALYTICS
BATCH

•
Design Patterns
o
Joins
o
Reduce side Join
o
Replicated join
o
Semi join
o
Sorting:
o
Secondary sort
o
Total Order Sort
o
Filtering
MapReduce
o
Statistics
o
AVG
o
VAR
o
Count
o
…
o
Top-K
o
Binning
o
…
DATA
ANALYTICS
BATCH

•
Obtain the S02average of each station
MapReduce
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";
DATA
ANALYTICS
BATCH

Input Data
Mapper
Mapper
Mapper
<1, 6>
…
…
…
Shuffling
<1, 2>
<3, 1>
<1, 9>
<3, 9>
<2, 6>
<2, 6>
<1, 6>
<2, 0>
<2, 8>
<1, 2>
<3,9>
<Station_ID, S02_VALUE>
MapReduce
DATA
ANALYTICS
BATCH
•
Maps get records and produce the SO2 value in <Station_Id, SO2_value>

Station_ID, AVG_SO2
1, 2,013
2, 2,695
3, 3,562
Reducer
Sum
Divide
Shuffling
Reducer
Sum
Divide
…
<Station_ID, [SO1, SO2,…,SOn>
•
Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station
MapReduce
DATA
ANALYTICS
BATCH

Hive
•
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets
•
Abstractionlayer on top of MapReduce
•
SQL-like language called HiveQL.
•
Metastore: Central repository of Hive metadata.
DATA
ANALYTICS
BATCH

CREATE TABLE air_quality(Estacionint, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire;
Hive
•
SELECT Titulo, avg(SO2)
FROM air_quality
GROUP BY Estacion
DATA
ANALYTICS
BATCH

•
Platform for analyzinglarge data sets
•
High-level language for expressing data analysis programs. Pig Latin. Data flow programming language.
•
Abstraction layer on top of MapReduce
•
Procedurallanguage
Pig
DATA
ANALYTICS
BATCH

Pig
DATA
ANALYTICS
BATCH
•
calidad_aire= load'/CalidadAire_Gijon' usingPigStorage(';') AS(estacion:chararray, titulo:chararray, latitud:chararray, longitud:chararray, fecha:chararray, so2:chararray, no:chararray, co:chararray, pm10:chararray, o3:chararray, dd:chararray, vv:chararray, tmp:chararray, hr:chararray, prb:chararray, rs:chararray, ll:chararray, ben:chararray, tol:chararray, mxil:chararray, pm25:chararray);
grouped = GROUPair_qualityBYestacion;
avg= FOREACHgrouped GENERATEgroup, AVG(so2);
dumpavg;

•
Cascading is a data processing APIand processing query planner used for defining, sharing, and executing data-processing workflows
•
Makes development of complex Hadoop MapReduce workflows easy
•
In the same way that Pig
DATA
ANALYTICS
BATCH
Cascading

// define source and sink Taps.
Tap source= new Hfs( sourceScheme, inputPath);
Scheme sinkScheme= new TextLine( new Fields( “Estacion", “SO2" ) );
Tap sink= new Hfs( sinkScheme, outputPath, SinkMode.REPLACE);
Pipe assembly= new Pipe( “avgSO2" );
assembly = new GroupBy( assembly, new Fields( “Estacion" ) );
// For every Tuple group
Aggregator avg= new Average( new Fields( “SO2" ) );
assembly = new Every( assembly, avg);
// Tell Hadoopwhich jar file to use
Flow flow= flowConnector.connect( “avg-SO2", source, sink, assembly );
// execute the flow, block until complete
flow.complete();
DATA
ANALYTICS
BATCH
•
Cascading

Spark
•
Cluster computing systems for faster data analytics
•
Nota modified version of Hadoop
•
Compatible with HDFS
•
In-memorydata storage for very fast iterativeprocessing
•
MapReduce-likeengine
•
API in Scala, Java and Python
DATA
ANALYTICS
BATCH

Spark
DATA
ANALYTICS
BATCH
•
Hadoop is slow due to replication, serialization and IO tasks

Spark
DATA
ANALYTICS
BATCH
•
10x-100x faster

Spark SQL
•
Large-scale data warehouse system for Spark
•
SQLon top of Spark (akaSHARK)
•
ActuallyHiveQLover Spark
•
Up to100 x faster than Hive
DATA
ANALYTICS
BATCH

Pros
•
FasterthanHadoopecosystem
•
Easierto developnew applications
o
(Scala, Java and Python API)
Cons
•
Not tested in extremely large clusters yet
•
Problems when Reducer’s data does not fit in memory
DATA
ANALYTICS
BATCH
Spark

Real-time processing technologies
DATA
ACQUISITION
DATA
STORAGE
DATA
ANALYSIS
RESULTS
o
Flume
o
Kafka
o
Kestrel
o
Flume
o
Storm
o
Trident
o
S4
o
Spark Streaming

Flume
DATA
ACQUISITION
STREAM

•
Kafka is a distributed, partitioned, replicated commit log service
o
Producer/Consumermodel
o
Kafka maintains feeds of messages in categories called topics
o
Kafka is run as a cluster
Kafka
DATA
STORAGE
STREAM

Insert AirQuality sensor log file into Kafka cluster and consume the info.
// new Producter
Producer<String, String> producer= new Producer<String, String>(config);
//Open sensor log file
BufferedReaderbr…
Stringline;
while(true)
{
line = br.readLine();
if(line ==null)
… //wait;
else
producer.send(new KeyedMessage<String, String>(topic, line));
}
Kafka
DATA
STORAGE
STREAM

AirQuality Consumer
ConsumerConnectorconsumer= Consumer.createJavaConsumerConnector(config);
Map<String, Integer> topicCountMap= new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(1));
Map<String, List<KafkaMessageStream>> consumerMap= consumer.createMessageStreams(topicCountMap);
KafkaMessageStreamstream= consumerMap.get(topic).get(0);
ConsumerIteratorit= stream.iterator();
while(it.hasNext()){
// consume it.next()
Kafka
DATA
STORAGE
STREAM

•
Simple distributed message queue
•
A single Kestrel server has a set of queues (strictly-ordered FIFO)
•
On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication
•
Kestrel vsKafka
o
Kafka consumers cheaper (basically just the bandwidth usage)
o
Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation.
o
Kafka has significantly better throughput.
o
Kestrel does not support ordered consumption
Kestrel
DATA
STORAGE
STREAM

Interceptor
•
Interface org.apache.flume.interceptor.Interceptor
•
Can modify or even drop events based on any criteria
•
Flume supports chainingof interceptors.
•
Types:
o
Timestamp interceptor
o
Host interceptor
o
Static interceptor
o
UUID interceptor
o
Morphline interceptor
o
Regex Filtering interceptor
o
Regex Extractor interceptor
DATA
ANALYTICS
STREAM
Flume

•
The sensors’ information must be filtered by "Station 2"
o
An interceptor will filter information between Sourceand Channel.
"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35";"981"; "23";
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62";"983"; "23";
Flume
DATA
ANALYTICS
STREAM

# Writeformatcan be textorwritable
…
#Definingchannel–Memory type …1
…
#Definingsource–Syslog…
…
# Definingsink–HDFS …
…
#Defininginterceptor
agent.sources.source.interceptors= i1
agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter
class StationFilter implements Interceptor
…
if(!"Station".equals("2"))
discard data;
else
save data;
Flume
DATA
ANALYTICS
STREAM

Hadoop
Storm
JobTracker
Nimbus
TaskTracker
Supervisor
Job
Topology
•
Distributed and scalable realtime computation system
•
Doing for real-time processing what Hadoop did for batch processing
•
Topology:processinggraph.Eachnodecontainsprocessinglogic(spoutsandbolts).Linksbetweennodesarestreamsofdata
o
Spout:Sourceofstreams.Readadatasourceandemitthedataintothetopologyasastream
o
Bolts:Processingunit.Readdatafromseveralstreams,doessomeprocessingandpossiblyemitsnewstreams
o
Stream:Unboundedsequenceoftuples.Tuplescancontainanyserializableobject
Storm
DATA
ANALYTICS
STREAM

CAReader
LineProcessor
AvgValues
•
AirQuality average values
o
Step 1: build the topology
Storm
Spout
Bolt
Bolt
DATA
ANALYTICS
STREAM

•
o
Step 1: build the topology
TopologyBuilderAirAVG= new TopologyBuilder();
builder.setSpout("ca-reader", new CAReader(), 1);
//shuffleGrouping-> evendistribution
AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3)
.shuffleGrouping("ca-reader");
//fieldsGrouping-> fieldswiththesamevaluegoestothesametask
AirAVG.setBolt("ca-avg-values", new AvgValues(), 2)
.fieldsGrouping("ca-line-processor", new Fields("id"));
Storm
DATA
ANALYTICS
STREAM

public void open(Map conf, TopologyContextcontext,
SpoutOutputCollectorcollector) {
//Initializefile
BufferedReaderbr= new …
…
}
publicvoidnextTuple() {
Stringline = br.readLine();
if(line == null) {
return;
} else
collector.emit(new Values(line));
}
Storm
•
o
Step 2: CAReader implementation (IRichSpout interface)
DATA
ANALYTICS
STREAM

publicvoiddeclareOutputFields(OutputFieldsDeclarerdeclarer)
{
declarer.declare(new
Fields("id", "stationName", "lat", …
}
publicvoidexecute(Tupleinput, BasicOutputCollectorcollector)
{
collector.emit(new Values(input.getString(0).split(";");
}
Storm
•
o
Step 3: LineProcessor implementation (IBasicBolt interface)
DATA
ANALYTICS
STREAM

public void execute (Tuple input, BasicOutputCollector collector)
{
//totals and count are hashmaps with each station accumulated values
if (totals.containsKey(id)) {
item = totals.get(id);
count = counts.get(id);
}
else {
//Create new item
}
//update values
item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2")));
item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no")));
…
}
Storm
•
oStep 4: AvgValues implementation (IBasicBolt interface)
DATA
ANALYTICS
STREAM
66

•
High level abstraction on top of Storm
o
Provides high level operations (joins, filters, projections, aggregations, functions…)
Pros
o
Easy, powerful and flexible
o
Incremental topology development
o
Exactly-once semantics
Cons
o
Very few built-in functions
o
Lower performance and higher latency than Storm
Trident
DATA
ANALYTICS
STREAM


Simple Scalable Streaming System

Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data

Inspired by MapReduce and Actor models of computation
o
Data processing is based on Processing Elements (PE)
o
Messages are transmitted between PEs in the form of events (Key, Attributes)
o
Processing Nodes are the logical hosts to PEs
S4
DATA
ANALYTICS
STREAM

…
<bean id="split" class="SplitPE">
<property name="dispatcher" ref="dispatcher"/>
<property name="keys">

<list>
<value>LogLines *</value>
</list>
</property>
</bean>
<bean id="average" class="AveragePE">
<property name="keys">
<list>
<value>CAItem stationId</value>
</list>
</property>
</bean>
…
•
S4
DATA
ANALYTICS
STREAM

Spark Streaming
•
Spark for real-time processing
•
Streaming computation as a series of very short batch jobs (windows)
•
Keep state in memory
•
API similar to Spark
DATA
ANALYTICS
STREAM

•
We are in the beginning of this generation
•
Short-term Big Data processing goal
•
Abstraction layer over the Lambda Architecture
•
Promising technologies
o
SummingBird
o
Lambdoop
Hybrid Computation Model

SummingBird
•
Library to write MapReduce-likeprocess that can be executed on Hadoop, Stormor hybrid model
•
Scalasyntaxis
•
Same logic can be executed in batch, real-time and hybrid bath/real mode
HYBRID
COMPUTATION
MODEL

SummingBird
HYBRID
COMPUTATION
MODEL

Pros
•
•
Same programing model for all proccesing paradigms
•
Extensible
Cons
•
MapReduce-like programing
•
Scala
•
Not as abstract as some users would like
SummingBird
HYBRID
COMPUTATION
MODEL


Software abstraction layer over Open Source technologies
o
Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident

Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process

Same single APIfor the three processing paradigms
o
Batch processing similar to Pig / Cascading
o
Real time processing using built-in functions easier than Trident
o
Hybrid computation model transparent for the developer
Lambdoop
HYBRID
COMPUTATION
MODEL

Lambdoop
Data
Operation
Data
Workflow
Streamingdata
Staticdata
HYBRID
COMPUTATION
MODEL

DataInput db_historical = new StaticCSVInput(URI_db);
Datahistorical = new Data(db_historical);
Workflowbatch = new Workflow(historical);
Operation filter = new Filter(“Station",“=", 2);
Operation select = new Select(“Titulo“, “SO2");
Operation group = new Group(“Titulo");
Operation average = new Average(“SO2");
batch.add(filter);
batch.add(select);
batch.add(group);
batch.add(variance);
batch.run();
Dataresults = batch.getResults();
…
Lambdoop
HYBRID
COMPUTATION
MODEL

DataInput stream_sensor = new StreamXMLInput(URI_sensor);
Datasensor = new Data(stream_sensor)
Workflowstreaming = new Workflow (sensor, new WindowsTime(100) );
Operation filter = new Filter("Station","=", 2);
Operation select = new Select("Titulo", "S02");
Operation group = new Group("Titulo");
Operation average = new Average("S02");
streaming.add(filter);
streaming.add(select);
streaming.add(group);
streaming.add(average);
streaming.run();
While (true)
{
Data live_results = streaming.getResults();
…
}
Lambdoop
HYBRID
COMPUTATION
MODEL

DataInput historical= new StaticCSVInput(URI_folder);
DataInput stream_sensor= new StreamXMLInput(URI_sensor);
Data all_info = new Data (historical, stream_sensor);
Workflow hybrid= new Workflow (all_info, new WindowsTime(1000) );
Operation filter = new Filter ("Station","=", 2);
Operation select = new Select ("Titulo", "SO2");
Operation group = new Group("Titulo");
Operation average = new Average("SO2");
hybrid.add(filter);
hybrid.add(select);
hybrid.add(group);
hybrid.add(variance);
hybrid.run();
Data updated_results = hybrid.getResults();
Lambdoop
HYBRID
COMPUTATION
MODEL

Pros
•
High abstraction layer for all processing model
•
All steps in the data processing pipeline
•
Same Java API for all programing paradigms
•
Extensible
Cons
•
Ongoing project
•
Not open-source yet
•
Not tested in larger cluster yet
Lambdoop
HYBRID
COMPUTATION
MODEL

Open Issues
•
Interoperabilitybetween well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark)
•
European technologies (Stratosphere / Apache Flink)
•
Massive StreamingMachine Learning
•
Real-time Interactive Visual Analytics
•
Vertical (domain-driven) solutions

Conclusions
Casado R., Younas M. Emergingtrendsand technologiesin big data processing. ConcurrencyComputat.: Pract. Exper. 2014

Conclusions
•
Big Data is notonlyHadoop
•
Identify the processing requirements of your project
•
Analyzethe alternatives for all steps in the data pipeline
•
The battle for real-time processing is open
•
Stay tuned for the hybrid computation model

Thanks for your attention! Questions?
ruben.casado@treelogic.com
ruben_casado

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

Similar to Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades (20)

More from Facultad de Informática UCM

More from Facultad de Informática UCM (20)

Recently uploaded

Recently uploaded (20)

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades