SlideShare a Scribd company logo
1 of 86
Download to read offline
Dr. Rubén Casado 
ruben.casado@treelogic.com 
ruben_casado 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades 
UNIVERSIDAD COMPLUETENSEMADRID 
19 de Noviembre de 2014
1. 
Big Data processing 
2. 
Batch processing 
3. 
Streaming processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
 
PhD in Software Engineering 
 
MSc in Computer Science 
 
BSc in Computer Science 
Academics 
Work 
Experience
1. 
Big Data processing 
2. 
Batchprocessing 
3. 
Streamingprocessing 
4. 
Hybridcomputationmodel 
5. 
Open Issues & Conclusions 
Agenda
A massivevolume of both structuredand unstructureddata that is so large to process with traditional database and software techniques 
What is Big Data?
Big Data are high-volume, high-velocity, and/or high-varietyinformation assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization 
How is Big Data? 
-Gartner IT Glossary -
3 problems 
Volume 
Variety 
Velocity
3 solutions 
Batch processing 
NoSQL 
Streaming 
processing
3 solutions 
Batch processing 
NoSQL 
Streaming 
processing
Volume 
Variety 
Velocity 
Science or Engineering?
Science or Engineering? 
Volume 
Variety 
Value 
Velocity
Science or Engineering? 
Volume 
Variety 
Value 
Velocity 
Software 
Engineering 
Data Science
13 
 
Relational Databases 
 
Schema based 
 
ACID (Atomicity, Consistency, Isolation, Durability) 
 
Performance penalty 
 
Scalability issues 
 
NoSQL 
 
Not Only SQL 
 
Families of solutions 
 
Google BigTable, Amazon Dynamo 
 
BASE = Basically Available, Soft state, Eventually consistent 
 
CAP= Consistency, Availability, Partition tolerance 
NoSQL
14 
 
Key-value 
 
Key: ID 
 
Value: associateddata 
 
Diccionario 
 
LinkedIn Voldemort 
 
Riak, Redis 
 
Memcache, Membase 
 
Document 
 
More complex tan K-V 
 
Documents are indexed by ID 
 
Multiple index 
 
MongoDB 
 
CouchDB 
 
Column 
 
Tables with predefined families of fields 
 
Fields within families are flexible 
 
Vertical and horizontal partitioning 
 
HBase 
 
Cassandra 
 
Graph 
 
Nodes 
 
Relationships 
 
Neo4j 
 
FlockDB 
 
OrientDB 
CR7: ‘Cristiano Ronaldo’ 
CR7:{Name: ’Cristiano’ 
Surname: ‘Ronaldo’ 
Age: 29} 
CR7: [Personal:{Name: ’Cristiano’ 
Surname: ‘Ronaldo’ 
Edad: 29} 
Job: {Team: ‘R. Madrid’ 
Salary: 20.000.000}] 
NoSQL 
[CR] 
[R.Madrid] 
se_llama 
juega 
[Cristiano]
• 
Scalable 
• 
Large amount of staticdata 
• 
Distributed 
• 
Parallel 
• 
Fault tolerant 
• 
High latency 
Batch processing 
Volume
• 
Low latency 
• 
Continuous unbounded streams of data 
• 
Distributed 
• 
Parallel 
• 
Fault-tolerant 
Streaming processing 
Velocity
• 
Lowlatency:real-time 
• 
Massivedata-at-rest+ data-in-motion 
• 
Scalable 
• 
Combinebatchand streamingresults 
Hybrid computation model 
Volume 
Velocity
All data 
New data 
Batch processing 
Streamingprocessing 
Batch 
results 
Stream 
results 
Combination 
Final results 
Hybrid computation model
 
Batchprocessing 
 
Largeamountof staticsdata 
 
Scalablesolution 
 
Volume 
 
Streamingprocessing 
 
Computing streamingdata 
 
Lowlatency 
 
Velocity 
 
Hybridcomputation 
 
Lambda Architecture 
 
Volume+ Velocity 
2006 
2010 
2014 
1ª Generation 
2ª Generation 
3ª Generation 
Inception 
2003 
Processing Paradigms
Batch 
+10 years of Big Data processing technologies 
2003 
2004 
2005 
2013 
2011 
2010 
2008 
The Google File System 
MapReduce: Simplified Data Processing on Large Clusters 
Doug Cutting starts developing Hadoop 
2006 
Yahoo! starts working on Hadoop 
Apache Hadoop is in production 
Nathan Marzcreates Storm 
Yahoo! creates S4 
2009 
Facebook creates Hive 
Yahoo! creates Pig 
MillWheel: Fault-Tolerant Stream Processing at Internet Scale 
LinkedIn presents Samza 
LinkedIn presents KafkA 
Clouderapresents Flume 
2012 
Nathan Marzdefines the Lambda Architecture 
Streaming 
Hybrid 
2014 
Spark stack is open sourced 
Lambdoop & Summinbgirdfirst steps 
StratospherebecomesApache Flink
Processing Pipeline 
DATA 
ACQUISITION 
DATA 
STORAGE 
DATA 
ANALYSIS 
RESULTS
 
Static stations and mobile sensors in Asturias sending streaming data 
 
Historical data of > 10 years 
 
Monitoring, trends identification, predictions 
Air Quality case study
1. 
Big Data processing overview 
2. 
Batch processing 
3. 
Real-time processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
Batch processing technologies 
DATA 
ACQUISITION 
DATA 
STORAGE 
DATA 
ANALYSIS 
RESULTS 
o 
HDFS commands 
o 
Sqoop 
o 
Flume 
o 
Scribe 
o 
HDFS 
o 
HBase 
o 
MapReduce 
o 
Hive 
o 
Pig 
o 
Cascading 
o 
Spark 
o 
SparkSQL (Shark)
• 
Import to HDFS 
hadoopdfs-copyFromLocal 
<path-to-local> <path-to-remote> 
hadoopdfs–copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/ 
HDFS commands 
DATA 
ACQUISITION 
BATCH
• 
Tool designed for transferring data between HDFS/HBase and structural datastores 
• 
Based in MapReduce 
• 
Includes connectors for multiple databases 
o 
MySQL, 
o 
PostgreSQL, 
o 
Oracle, 
o 
SQL Server and 
o 
DB2 
o 
Generic JDBC connector 
• 
Java API 
Sqoop 
DATA 
ACQUISITION 
BATCH
import-all-tables--connectjdbc:mysql://localhost/testDatabase--target-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 
1) Import data from database to HDFS 
export--connectjdbc:mysql://localhost/testDatabase--export-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 
3) Export results to database 
2) Analyzedata (HADOOP) 
Sqoop 
DATA 
ACQUISITION 
BATCH
• 
Service for collecting, aggregating, and moving large amounts of log data 
• 
Simple and flexible architecture based on streaming data flows 
• 
Reliability, scalability, extensibility, manageability 
• 
Support log stream types 
o 
Avro 
o 
Syslog 
o 
Netcast 
Flume 
DATA 
ACQUISITION 
BATCH
Sources 
Channels 
Sinks 
Avro 
Memory 
HDFS 
Thrift 
JDBC 
Logger 
Exec 
File 
Avro 
JMS 
Thrift 
NetCat 
IRC 
Syslog TCP/UDP 
File Roll 
HTTP 
Null 
HBase 
Custom 
Custom 
• 
Architecture 
o 
Source 
• 
Waitingforevents. 
o 
Sink 
• 
Sendstheinformationtowardsanotheragentorsystem. 
o 
Channel 
• 
Storestheinformationuntilitisconsumedbythesink. 
Flume 
DATA 
ACQUISITION 
BATCH
Stations send the information to the servers. Flume collects this information and move it into the HDFS for further analsys 
 
Air quality syslogs 
Flume 
DATA 
ACQUISITION 
BATCH 
Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
• 
Server for aggregating log datastreamed in real time from a large number of servers 
• 
There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. 
• 
The central scribe server(s) can write the messages to the files that are their final destination 
Scribe 
DATA 
ACQUISITION 
BATCH
category=‘mobile‘; 
// '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' 
message= sensor_log.readLine(); 
log_entry= scribe.LogEntry(category, message) 
// Createa Scribe Client 
client= scribe.Client(iprot=protocol, oprot=protocol) 
transport.open() 
result= client.Log(messages=[log_entry]) 
transport.close() 
• 
Sending a sensor message to a Scribe Server 
Scribe 
DATA 
ACQUISITION 
BATCH
• 
Distributed FileSystem for Hadoop 
• 
Master-Slaves Architecture (NameNode–DataNodes) 
o 
NameNode: Manage the directory tree and regulates access to files by clients 
o 
DataNodes: Store the data 
• 
Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes 
HDFS 
DATA 
STORAGE 
BATCH
• 
Open-source non-relational distributed column-oriented databasemodeled after Google’s BigTable. 
• 
Random, realtime read/write access to the data. 
• 
Not a relational database. 
o 
Very light «schema» 
• 
Rows are stored in sorted order. 
DATA 
STORAGE 
BATCH 
HBase
• 
Framework for processing large amount of datain parallel 
across a distributed cluster 
• 
Slightly inspired in the Divide and Conquer (D&C) classic strategy 
• 
Developer has to implement Map and Reduce functions: 
o 
Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodesparsed to the format <K, V> 
o 
Reduce: It collects the <K, List(V)> and generates the results 
MapReduce 
DATA 
ANALYTICS 
BATCH
• 
Design Patterns 
o 
Joins 
o 
Reduce side Join 
o 
Replicated join 
o 
Semi join 
o 
Sorting: 
o 
Secondary sort 
o 
Total Order Sort 
o 
Filtering 
MapReduce 
o 
Statistics 
o 
AVG 
o 
VAR 
o 
Count 
o 
… 
o 
Top-K 
o 
Binning 
o 
… 
DATA 
ANALYTICS 
BATCH
• 
Obtain the S02average of each station 
MapReduce 
Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23"; 
DATA 
ANALYTICS 
BATCH
Input Data 
Mapper 
Mapper 
Mapper 
<1, 6> 
… 
… 
… 
Shuffling 
<1, 2> 
<3, 1> 
<1, 9> 
<3, 9> 
<2, 6> 
<2, 6> 
<1, 6> 
<2, 0> 
<2, 8> 
<1, 2> 
<3,9> 
<Station_ID, S02_VALUE> 
MapReduce 
DATA 
ANALYTICS 
BATCH 
• 
Maps get records and produce the SO2 value in <Station_Id, SO2_value>
Station_ID, AVG_SO2 
1, 2,013 
2, 2,695 
3, 3,562 
Reducer 
Sum 
Divide 
Shuffling 
Reducer 
Sum 
Divide 
… 
<Station_ID, [SO1, SO2,…,SOn> 
• 
Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station 
MapReduce 
DATA 
ANALYTICS 
BATCH
Hive 
• 
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets 
• 
Abstractionlayer on top of MapReduce 
• 
SQL-like language called HiveQL. 
• 
Metastore: Central repository of Hive metadata. 
DATA 
ANALYTICS 
BATCH
CREATE TABLE air_quality(Estacionint, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' 
LINES TERMINATED BY 'n' 
STORED AS TEXTFILE; 
LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire; 
Hive 
• 
Obtain the S02average of each station 
SELECT Titulo, avg(SO2) 
FROM air_quality 
GROUP BY Estacion 
DATA 
ANALYTICS 
BATCH
• 
Platform for analyzinglarge data sets 
• 
High-level language for expressing data analysis programs. Pig Latin. Data flow programming language. 
• 
Abstraction layer on top of MapReduce 
• 
Procedurallanguage 
Pig 
DATA 
ANALYTICS 
BATCH
Pig 
DATA 
ANALYTICS 
BATCH 
• 
Obtain the S02average of each station 
calidad_aire= load'/CalidadAire_Gijon' usingPigStorage(';') AS(estacion:chararray, titulo:chararray, latitud:chararray, longitud:chararray, fecha:chararray, so2:chararray, no:chararray, co:chararray, pm10:chararray, o3:chararray, dd:chararray, vv:chararray, tmp:chararray, hr:chararray, prb:chararray, rs:chararray, ll:chararray, ben:chararray, tol:chararray, mxil:chararray, pm25:chararray); 
grouped = GROUPair_qualityBYestacion; 
avg= FOREACHgrouped GENERATEgroup, AVG(so2); 
dumpavg;
• 
Cascading is a data processing APIand processing query planner used for defining, sharing, and executing data-processing workflows 
• 
Makes development of complex Hadoop MapReduce workflows easy 
• 
In the same way that Pig 
DATA 
ANALYTICS 
BATCH 
Cascading
// define source and sink Taps. 
Tap source= new Hfs( sourceScheme, inputPath); 
Scheme sinkScheme= new TextLine( new Fields( “Estacion", “SO2" ) ); 
Tap sink= new Hfs( sinkScheme, outputPath, SinkMode.REPLACE); 
Pipe assembly= new Pipe( “avgSO2" ); 
assembly = new GroupBy( assembly, new Fields( “Estacion" ) ); 
// For every Tuple group 
Aggregator avg= new Average( new Fields( “SO2" ) ); 
assembly = new Every( assembly, avg); 
// Tell Hadoopwhich jar file to use 
Flow flow= flowConnector.connect( “avg-SO2", source, sink, assembly ); 
// execute the flow, block until complete 
flow.complete(); 
DATA 
ANALYTICS 
BATCH 
• 
Obtain the S02average of each station 
Cascading
Spark 
• 
Cluster computing systems for faster data analytics 
• 
Nota modified version of Hadoop 
• 
Compatible with HDFS 
• 
In-memorydata storage for very fast iterativeprocessing 
• 
MapReduce-likeengine 
• 
API in Scala, Java and Python 
DATA 
ANALYTICS 
BATCH
Spark 
DATA 
ANALYTICS 
BATCH 
• 
Hadoop is slow due to replication, serialization and IO tasks
Spark 
DATA 
ANALYTICS 
BATCH 
• 
10x-100x faster
Spark SQL 
• 
Large-scale data warehouse system for Spark 
• 
SQLon top of Spark (akaSHARK) 
• 
ActuallyHiveQLover Spark 
• 
Up to100 x faster than Hive 
DATA 
ANALYTICS 
BATCH
Pros 
• 
FasterthanHadoopecosystem 
• 
Easierto developnew applications 
o 
(Scala, Java and Python API) 
Cons 
• 
Not tested in extremely large clusters yet 
• 
Problems when Reducer’s data does not fit in memory 
DATA 
ANALYTICS 
BATCH 
Spark
1. 
Big Data processing 
2. 
Batch processing 
3. 
Streaming processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
Real-time processing technologies 
DATA 
ACQUISITION 
DATA 
STORAGE 
DATA 
ANALYSIS 
RESULTS 
o 
Flume 
o 
Kafka 
o 
Kestrel 
o 
Flume 
o 
Storm 
o 
Trident 
o 
S4 
o 
Spark Streaming
Flume 
DATA 
ACQUISITION 
STREAM
• 
Kafka is a distributed, partitioned, replicated commit log service 
o 
Producer/Consumermodel 
o 
Kafka maintains feeds of messages in categories called topics 
o 
Kafka is run as a cluster 
Kafka 
DATA 
STORAGE 
STREAM
Insert AirQuality sensor log file into Kafka cluster and consume the info. 
// new Producter 
Producer<String, String> producer= new Producer<String, String>(config); 
//Open sensor log file 
BufferedReaderbr… 
Stringline; 
while(true) 
{ 
line = br.readLine(); 
if(line ==null) 
… //wait; 
else 
producer.send(new KeyedMessage<String, String>(topic, line)); 
} 
Kafka 
DATA 
STORAGE 
STREAM
AirQuality Consumer 
ConsumerConnectorconsumer= Consumer.createJavaConsumerConnector(config); 
Map<String, Integer> topicCountMap= new HashMap<String, Integer>(); 
topicCountMap.put(topic, new Integer(1)); 
Map<String, List<KafkaMessageStream>> consumerMap= consumer.createMessageStreams(topicCountMap); 
KafkaMessageStreamstream= consumerMap.get(topic).get(0); 
ConsumerIteratorit= stream.iterator(); 
while(it.hasNext()){ 
// consume it.next() 
Kafka 
DATA 
STORAGE 
STREAM
• 
Simple distributed message queue 
• 
A single Kestrel server has a set of queues (strictly-ordered FIFO) 
• 
On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication 
• 
Kestrel vsKafka 
o 
Kafka consumers cheaper (basically just the bandwidth usage) 
o 
Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation. 
o 
Kafka has significantly better throughput. 
o 
Kestrel does not support ordered consumption 
Kestrel 
DATA 
STORAGE 
STREAM
Interceptor 
• 
Interface org.apache.flume.interceptor.Interceptor 
• 
Can modify or even drop events based on any criteria 
• 
Flume supports chainingof interceptors. 
• 
Types: 
o 
Timestamp interceptor 
o 
Host interceptor 
o 
Static interceptor 
o 
UUID interceptor 
o 
Morphline interceptor 
o 
Regex Filtering interceptor 
o 
Regex Extractor interceptor 
DATA 
ANALYTICS 
STREAM 
Flume
• 
The sensors’ information must be filtered by "Station 2" 
o 
An interceptor will filter information between Sourceand Channel. 
Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; 
"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35";"981"; "23"; 
"3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; 
"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62";"983"; "23"; 
Flume 
DATA 
ANALYTICS 
STREAM
# Writeformatcan be textorwritable 
… 
#Definingchannel–Memory type …1 
… 
#Definingsource–Syslog… 
… 
# Definingsink–HDFS … 
… 
#Defininginterceptor 
agent.sources.source.interceptors= i1 
agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter 
class StationFilter implements Interceptor 
… 
if(!"Station".equals("2")) 
discard data; 
else 
save data; 
Flume 
DATA 
ANALYTICS 
STREAM
Hadoop 
Storm 
JobTracker 
Nimbus 
TaskTracker 
Supervisor 
Job 
Topology 
• 
Distributed and scalable realtime computation system 
• 
Doing for real-time processing what Hadoop did for batch processing 
• 
Topology:processinggraph.Eachnodecontainsprocessinglogic(spoutsandbolts).Linksbetweennodesarestreamsofdata 
o 
Spout:Sourceofstreams.Readadatasourceandemitthedataintothetopologyasastream 
o 
Bolts:Processingunit.Readdatafromseveralstreams,doessomeprocessingandpossiblyemitsnewstreams 
o 
Stream:Unboundedsequenceoftuples.Tuplescancontainanyserializableobject 
Storm 
DATA 
ANALYTICS 
STREAM
CAReader 
LineProcessor 
AvgValues 
• 
AirQuality average values 
o 
Step 1: build the topology 
Storm 
Spout 
Bolt 
Bolt 
DATA 
ANALYTICS 
STREAM
• 
AirQuality average values 
o 
Step 1: build the topology 
TopologyBuilderAirAVG= new TopologyBuilder(); 
builder.setSpout("ca-reader", new CAReader(), 1); 
//shuffleGrouping-> evendistribution 
AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3) 
.shuffleGrouping("ca-reader"); 
//fieldsGrouping-> fieldswiththesamevaluegoestothesametask 
AirAVG.setBolt("ca-avg-values", new AvgValues(), 2) 
.fieldsGrouping("ca-line-processor", new Fields("id")); 
Storm 
DATA 
ANALYTICS 
STREAM
public void open(Map conf, TopologyContextcontext, 
SpoutOutputCollectorcollector) { 
//Initializefile 
BufferedReaderbr= new … 
… 
} 
publicvoidnextTuple() { 
Stringline = br.readLine(); 
if(line == null) { 
return; 
} else 
collector.emit(new Values(line)); 
} 
Storm 
• 
AirQuality average values 
o 
Step 2: CAReader implementation (IRichSpout interface) 
DATA 
ANALYTICS 
STREAM
publicvoiddeclareOutputFields(OutputFieldsDeclarerdeclarer) 
{ 
declarer.declare(new 
Fields("id", "stationName", "lat", … 
} 
publicvoidexecute(Tupleinput, BasicOutputCollectorcollector) 
{ 
collector.emit(new Values(input.getString(0).split(";"); 
} 
Storm 
• 
AirQuality average values 
o 
Step 3: LineProcessor implementation (IBasicBolt interface) 
DATA 
ANALYTICS 
STREAM
public void execute (Tuple input, BasicOutputCollector collector) 
{ 
//totals and count are hashmaps with each station accumulated values 
if (totals.containsKey(id)) { 
item = totals.get(id); 
count = counts.get(id); 
} 
else { 
//Create new item 
} 
//update values 
item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2"))); 
item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no"))); 
… 
} 
Storm 
• 
AirQuality average values 
oStep 4: AvgValues implementation (IBasicBolt interface) 
DATA 
ANALYTICS 
STREAM 
66
• 
High level abstraction on top of Storm 
o 
Provides high level operations (joins, filters, projections, aggregations, functions…) 
Pros 
o 
Easy, powerful and flexible 
o 
Incremental topology development 
o 
Exactly-once semantics 
Cons 
o 
Very few built-in functions 
o 
Lower performance and higher latency than Storm 
Trident 
DATA 
ANALYTICS 
STREAM
 
Simple Scalable Streaming System 
 
Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data 
 
Inspired by MapReduce and Actor models of computation 
o 
Data processing is based on Processing Elements (PE) 
o 
Messages are transmitted between PEs in the form of events (Key, Attributes) 
o 
Processing Nodes are the logical hosts to PEs 
S4 
DATA 
ANALYTICS 
STREAM
… 
<bean id="split" class="SplitPE"> 
<property name="dispatcher" ref="dispatcher"/> 
<property name="keys"> 
<!--Listen for both words and sentences --> 
<list> 
<value>LogLines *</value> 
</list> 
</property> 
</bean> 
<bean id="average" class="AveragePE"> 
<property name="keys"> 
<list> 
<value>CAItem stationId</value> 
</list> 
</property> 
</bean> 
… 
• 
AirQuality average values 
S4 
DATA 
ANALYTICS 
STREAM
Spark Streaming 
• 
Spark for real-time processing 
• 
Streaming computation as a series of very short batch jobs (windows) 
• 
Keep state in memory 
• 
API similar to Spark 
DATA 
ANALYTICS 
STREAM
1. 
Big Data processing 
2. 
Batch processing 
3. 
Streaming processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
• 
We are in the beginning of this generation 
• 
Short-term Big Data processing goal 
• 
Abstraction layer over the Lambda Architecture 
• 
Promising technologies 
o 
SummingBird 
o 
Lambdoop 
Hybrid Computation Model
SummingBird 
• 
Library to write MapReduce-likeprocess that can be executed on Hadoop, Stormor hybrid model 
• 
Scalasyntaxis 
• 
Same logic can be executed in batch, real-time and hybrid bath/real mode 
HYBRID 
COMPUTATION 
MODEL
SummingBird 
HYBRID 
COMPUTATION 
MODEL
Pros 
• 
Hybrid computation model 
• 
Same programing model for all proccesing paradigms 
• 
Extensible 
Cons 
• 
MapReduce-like programing 
• 
Scala 
• 
Not as abstract as some users would like 
SummingBird 
HYBRID 
COMPUTATION 
MODEL
 
Software abstraction layer over Open Source technologies 
o 
Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident 
 
Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process 
 
Same single APIfor the three processing paradigms 
o 
Batch processing similar to Pig / Cascading 
o 
Real time processing using built-in functions easier than Trident 
o 
Hybrid computation model transparent for the developer 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
Lambdoop 
Data 
Operation 
Data 
Workflow 
Streamingdata 
Staticdata 
HYBRID 
COMPUTATION 
MODEL
DataInput db_historical = new StaticCSVInput(URI_db); 
Datahistorical = new Data(db_historical); 
Workflowbatch = new Workflow(historical); 
Operation filter = new Filter(“Station",“=", 2); 
Operation select = new Select(“Titulo“, “SO2"); 
Operation group = new Group(“Titulo"); 
Operation average = new Average(“SO2"); 
batch.add(filter); 
batch.add(select); 
batch.add(group); 
batch.add(variance); 
batch.run(); 
Dataresults = batch.getResults(); 
… 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
DataInput stream_sensor = new StreamXMLInput(URI_sensor); 
Datasensor = new Data(stream_sensor) 
Workflowstreaming = new Workflow (sensor, new WindowsTime(100) ); 
Operation filter = new Filter("Station","=", 2); 
Operation select = new Select("Titulo", "S02"); 
Operation group = new Group("Titulo"); 
Operation average = new Average("S02"); 
streaming.add(filter); 
streaming.add(select); 
streaming.add(group); 
streaming.add(average); 
streaming.run(); 
While (true) 
{ 
Data live_results = streaming.getResults(); 
… 
} 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
DataInput historical= new StaticCSVInput(URI_folder); 
DataInput stream_sensor= new StreamXMLInput(URI_sensor); 
Data all_info = new Data (historical, stream_sensor); 
Workflow hybrid= new Workflow (all_info, new WindowsTime(1000) ); 
Operation filter = new Filter ("Station","=", 2); 
Operation select = new Select ("Titulo", "SO2"); 
Operation group = new Group("Titulo"); 
Operation average = new Average("SO2"); 
hybrid.add(filter); 
hybrid.add(select); 
hybrid.add(group); 
hybrid.add(variance); 
hybrid.run(); 
Data updated_results = hybrid.getResults(); 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
Pros 
• 
High abstraction layer for all processing model 
• 
All steps in the data processing pipeline 
• 
Same Java API for all programing paradigms 
• 
Extensible 
Cons 
• 
Ongoing project 
• 
Not open-source yet 
• 
Not tested in larger cluster yet 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
1. 
Big Data processing 
2. 
Batch processing 
3. 
Streaming processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
Open Issues 
• 
Interoperabilitybetween well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark) 
• 
European technologies (Stratosphere / Apache Flink) 
• 
Massive StreamingMachine Learning 
• 
Real-time Interactive Visual Analytics 
• 
Vertical (domain-driven) solutions
Conclusions 
Casado R., Younas M. Emergingtrendsand technologiesin big data processing. ConcurrencyComputat.: Pract. Exper. 2014
Conclusions 
• 
Big Data is notonlyHadoop 
• 
Identify the processing requirements of your project 
• 
Analyzethe alternatives for all steps in the data pipeline 
• 
The battle for real-time processing is open 
• 
Stay tuned for the hybrid computation model
Thanks for your attention! Questions? 
ruben.casado@treelogic.com 
ruben_casado

More Related Content

What's hot

20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Allen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 

What's hot (20)

Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series Analysis
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
druid.io
druid.iodruid.io
druid.io
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 

Viewers also liked

Viewers also liked (15)

Estudio de la robustez frente a SEUs de algoritmos auto-convergentes
Estudio de la robustez frente a SEUs de algoritmos auto-convergentesEstudio de la robustez frente a SEUs de algoritmos auto-convergentes
Estudio de la robustez frente a SEUs de algoritmos auto-convergentes
 
Big Data - 25 Facts to Know
Big Data - 25 Facts to KnowBig Data - 25 Facts to Know
Big Data - 25 Facts to Know
 
25 Amazing Facts Everyone Should Know about David Bowie
25 Amazing Facts Everyone Should Know about David Bowie25 Amazing Facts Everyone Should Know about David Bowie
25 Amazing Facts Everyone Should Know about David Bowie
 
A brief history of data processing
A brief history of data processingA brief history of data processing
A brief history of data processing
 
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...
 
7 Financial KPIs Everyone Needs To Know
7 Financial KPIs Everyone Needs To Know7 Financial KPIs Everyone Needs To Know
7 Financial KPIs Everyone Needs To Know
 
5 Customer Metrics Every Manager Should Know
5 Customer Metrics Every Manager Should Know5 Customer Metrics Every Manager Should Know
5 Customer Metrics Every Manager Should Know
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
25 KPIs Every Manager Needs To Know
25 KPIs Every Manager Needs To Know25 KPIs Every Manager Needs To Know
25 KPIs Every Manager Needs To Know
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 

Similar to Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

Hadoop & no sql new generation database systems
Hadoop & no sql   new generation database systemsHadoop & no sql   new generation database systems
Hadoop & no sql new generation database systems
ramazan fırın
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 

Similar to Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades (20)

Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Hadoop & no sql new generation database systems
Hadoop & no sql   new generation database systemsHadoop & no sql   new generation database systems
Hadoop & no sql new generation database systems
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 

More from Facultad de Informática UCM

More from Facultad de Informática UCM (20)

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
 

Recently uploaded

Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learning
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdf
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

  • 1. Dr. Rubén Casado ruben.casado@treelogic.com ruben_casado Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades UNIVERSIDAD COMPLUETENSEMADRID 19 de Noviembre de 2014
  • 2. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 3.  PhD in Software Engineering  MSc in Computer Science  BSc in Computer Science Academics Work Experience
  • 4. 1. Big Data processing 2. Batchprocessing 3. Streamingprocessing 4. Hybridcomputationmodel 5. Open Issues & Conclusions Agenda
  • 5. A massivevolume of both structuredand unstructureddata that is so large to process with traditional database and software techniques What is Big Data?
  • 6. Big Data are high-volume, high-velocity, and/or high-varietyinformation assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization How is Big Data? -Gartner IT Glossary -
  • 7. 3 problems Volume Variety Velocity
  • 8. 3 solutions Batch processing NoSQL Streaming processing
  • 9. 3 solutions Batch processing NoSQL Streaming processing
  • 10. Volume Variety Velocity Science or Engineering?
  • 11. Science or Engineering? Volume Variety Value Velocity
  • 12. Science or Engineering? Volume Variety Value Velocity Software Engineering Data Science
  • 13. 13  Relational Databases  Schema based  ACID (Atomicity, Consistency, Isolation, Durability)  Performance penalty  Scalability issues  NoSQL  Not Only SQL  Families of solutions  Google BigTable, Amazon Dynamo  BASE = Basically Available, Soft state, Eventually consistent  CAP= Consistency, Availability, Partition tolerance NoSQL
  • 14. 14  Key-value  Key: ID  Value: associateddata  Diccionario  LinkedIn Voldemort  Riak, Redis  Memcache, Membase  Document  More complex tan K-V  Documents are indexed by ID  Multiple index  MongoDB  CouchDB  Column  Tables with predefined families of fields  Fields within families are flexible  Vertical and horizontal partitioning  HBase  Cassandra  Graph  Nodes  Relationships  Neo4j  FlockDB  OrientDB CR7: ‘Cristiano Ronaldo’ CR7:{Name: ’Cristiano’ Surname: ‘Ronaldo’ Age: 29} CR7: [Personal:{Name: ’Cristiano’ Surname: ‘Ronaldo’ Edad: 29} Job: {Team: ‘R. Madrid’ Salary: 20.000.000}] NoSQL [CR] [R.Madrid] se_llama juega [Cristiano]
  • 15. • Scalable • Large amount of staticdata • Distributed • Parallel • Fault tolerant • High latency Batch processing Volume
  • 16. • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Streaming processing Velocity
  • 17. • Lowlatency:real-time • Massivedata-at-rest+ data-in-motion • Scalable • Combinebatchand streamingresults Hybrid computation model Volume Velocity
  • 18. All data New data Batch processing Streamingprocessing Batch results Stream results Combination Final results Hybrid computation model
  • 19.  Batchprocessing  Largeamountof staticsdata  Scalablesolution  Volume  Streamingprocessing  Computing streamingdata  Lowlatency  Velocity  Hybridcomputation  Lambda Architecture  Volume+ Velocity 2006 2010 2014 1ª Generation 2ª Generation 3ª Generation Inception 2003 Processing Paradigms
  • 20. Batch +10 years of Big Data processing technologies 2003 2004 2005 2013 2011 2010 2008 The Google File System MapReduce: Simplified Data Processing on Large Clusters Doug Cutting starts developing Hadoop 2006 Yahoo! starts working on Hadoop Apache Hadoop is in production Nathan Marzcreates Storm Yahoo! creates S4 2009 Facebook creates Hive Yahoo! creates Pig MillWheel: Fault-Tolerant Stream Processing at Internet Scale LinkedIn presents Samza LinkedIn presents KafkA Clouderapresents Flume 2012 Nathan Marzdefines the Lambda Architecture Streaming Hybrid 2014 Spark stack is open sourced Lambdoop & Summinbgirdfirst steps StratospherebecomesApache Flink
  • 21. Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS
  • 22.  Static stations and mobile sensors in Asturias sending streaming data  Historical data of > 10 years  Monitoring, trends identification, predictions Air Quality case study
  • 23. 1. Big Data processing overview 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 24. Batch processing technologies DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS o HDFS commands o Sqoop o Flume o Scribe o HDFS o HBase o MapReduce o Hive o Pig o Cascading o Spark o SparkSQL (Shark)
  • 25. • Import to HDFS hadoopdfs-copyFromLocal <path-to-local> <path-to-remote> hadoopdfs–copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/ HDFS commands DATA ACQUISITION BATCH
  • 26. • Tool designed for transferring data between HDFS/HBase and structural datastores • Based in MapReduce • Includes connectors for multiple databases o MySQL, o PostgreSQL, o Oracle, o SQL Server and o DB2 o Generic JDBC connector • Java API Sqoop DATA ACQUISITION BATCH
  • 27. import-all-tables--connectjdbc:mysql://localhost/testDatabase--target-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 1) Import data from database to HDFS export--connectjdbc:mysql://localhost/testDatabase--export-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 3) Export results to database 2) Analyzedata (HADOOP) Sqoop DATA ACQUISITION BATCH
  • 28. • Service for collecting, aggregating, and moving large amounts of log data • Simple and flexible architecture based on streaming data flows • Reliability, scalability, extensibility, manageability • Support log stream types o Avro o Syslog o Netcast Flume DATA ACQUISITION BATCH
  • 29. Sources Channels Sinks Avro Memory HDFS Thrift JDBC Logger Exec File Avro JMS Thrift NetCat IRC Syslog TCP/UDP File Roll HTTP Null HBase Custom Custom • Architecture o Source • Waitingforevents. o Sink • Sendstheinformationtowardsanotheragentorsystem. o Channel • Storestheinformationuntilitisconsumedbythesink. Flume DATA ACQUISITION BATCH
  • 30. Stations send the information to the servers. Flume collects this information and move it into the HDFS for further analsys  Air quality syslogs Flume DATA ACQUISITION BATCH Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  • 31. • Server for aggregating log datastreamed in real time from a large number of servers • There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. • The central scribe server(s) can write the messages to the files that are their final destination Scribe DATA ACQUISITION BATCH
  • 32. category=‘mobile‘; // '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' message= sensor_log.readLine(); log_entry= scribe.LogEntry(category, message) // Createa Scribe Client client= scribe.Client(iprot=protocol, oprot=protocol) transport.open() result= client.Log(messages=[log_entry]) transport.close() • Sending a sensor message to a Scribe Server Scribe DATA ACQUISITION BATCH
  • 33. • Distributed FileSystem for Hadoop • Master-Slaves Architecture (NameNode–DataNodes) o NameNode: Manage the directory tree and regulates access to files by clients o DataNodes: Store the data • Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes HDFS DATA STORAGE BATCH
  • 34. • Open-source non-relational distributed column-oriented databasemodeled after Google’s BigTable. • Random, realtime read/write access to the data. • Not a relational database. o Very light «schema» • Rows are stored in sorted order. DATA STORAGE BATCH HBase
  • 35. • Framework for processing large amount of datain parallel across a distributed cluster • Slightly inspired in the Divide and Conquer (D&C) classic strategy • Developer has to implement Map and Reduce functions: o Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodesparsed to the format <K, V> o Reduce: It collects the <K, List(V)> and generates the results MapReduce DATA ANALYTICS BATCH
  • 36. • Design Patterns o Joins o Reduce side Join o Replicated join o Semi join o Sorting: o Secondary sort o Total Order Sort o Filtering MapReduce o Statistics o AVG o VAR o Count o … o Top-K o Binning o … DATA ANALYTICS BATCH
  • 37. • Obtain the S02average of each station MapReduce Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23"; DATA ANALYTICS BATCH
  • 38. Input Data Mapper Mapper Mapper <1, 6> … … … Shuffling <1, 2> <3, 1> <1, 9> <3, 9> <2, 6> <2, 6> <1, 6> <2, 0> <2, 8> <1, 2> <3,9> <Station_ID, S02_VALUE> MapReduce DATA ANALYTICS BATCH • Maps get records and produce the SO2 value in <Station_Id, SO2_value>
  • 39. Station_ID, AVG_SO2 1, 2,013 2, 2,695 3, 3,562 Reducer Sum Divide Shuffling Reducer Sum Divide … <Station_ID, [SO1, SO2,…,SOn> • Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station MapReduce DATA ANALYTICS BATCH
  • 40. Hive • Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets • Abstractionlayer on top of MapReduce • SQL-like language called HiveQL. • Metastore: Central repository of Hive metadata. DATA ANALYTICS BATCH
  • 41. CREATE TABLE air_quality(Estacionint, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' LINES TERMINATED BY 'n' STORED AS TEXTFILE; LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire; Hive • Obtain the S02average of each station SELECT Titulo, avg(SO2) FROM air_quality GROUP BY Estacion DATA ANALYTICS BATCH
  • 42. • Platform for analyzinglarge data sets • High-level language for expressing data analysis programs. Pig Latin. Data flow programming language. • Abstraction layer on top of MapReduce • Procedurallanguage Pig DATA ANALYTICS BATCH
  • 43. Pig DATA ANALYTICS BATCH • Obtain the S02average of each station calidad_aire= load'/CalidadAire_Gijon' usingPigStorage(';') AS(estacion:chararray, titulo:chararray, latitud:chararray, longitud:chararray, fecha:chararray, so2:chararray, no:chararray, co:chararray, pm10:chararray, o3:chararray, dd:chararray, vv:chararray, tmp:chararray, hr:chararray, prb:chararray, rs:chararray, ll:chararray, ben:chararray, tol:chararray, mxil:chararray, pm25:chararray); grouped = GROUPair_qualityBYestacion; avg= FOREACHgrouped GENERATEgroup, AVG(so2); dumpavg;
  • 44. • Cascading is a data processing APIand processing query planner used for defining, sharing, and executing data-processing workflows • Makes development of complex Hadoop MapReduce workflows easy • In the same way that Pig DATA ANALYTICS BATCH Cascading
  • 45. // define source and sink Taps. Tap source= new Hfs( sourceScheme, inputPath); Scheme sinkScheme= new TextLine( new Fields( “Estacion", “SO2" ) ); Tap sink= new Hfs( sinkScheme, outputPath, SinkMode.REPLACE); Pipe assembly= new Pipe( “avgSO2" ); assembly = new GroupBy( assembly, new Fields( “Estacion" ) ); // For every Tuple group Aggregator avg= new Average( new Fields( “SO2" ) ); assembly = new Every( assembly, avg); // Tell Hadoopwhich jar file to use Flow flow= flowConnector.connect( “avg-SO2", source, sink, assembly ); // execute the flow, block until complete flow.complete(); DATA ANALYTICS BATCH • Obtain the S02average of each station Cascading
  • 46. Spark • Cluster computing systems for faster data analytics • Nota modified version of Hadoop • Compatible with HDFS • In-memorydata storage for very fast iterativeprocessing • MapReduce-likeengine • API in Scala, Java and Python DATA ANALYTICS BATCH
  • 47. Spark DATA ANALYTICS BATCH • Hadoop is slow due to replication, serialization and IO tasks
  • 48. Spark DATA ANALYTICS BATCH • 10x-100x faster
  • 49. Spark SQL • Large-scale data warehouse system for Spark • SQLon top of Spark (akaSHARK) • ActuallyHiveQLover Spark • Up to100 x faster than Hive DATA ANALYTICS BATCH
  • 50. Pros • FasterthanHadoopecosystem • Easierto developnew applications o (Scala, Java and Python API) Cons • Not tested in extremely large clusters yet • Problems when Reducer’s data does not fit in memory DATA ANALYTICS BATCH Spark
  • 51. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 52. Real-time processing technologies DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS o Flume o Kafka o Kestrel o Flume o Storm o Trident o S4 o Spark Streaming
  • 54. • Kafka is a distributed, partitioned, replicated commit log service o Producer/Consumermodel o Kafka maintains feeds of messages in categories called topics o Kafka is run as a cluster Kafka DATA STORAGE STREAM
  • 55. Insert AirQuality sensor log file into Kafka cluster and consume the info. // new Producter Producer<String, String> producer= new Producer<String, String>(config); //Open sensor log file BufferedReaderbr… Stringline; while(true) { line = br.readLine(); if(line ==null) … //wait; else producer.send(new KeyedMessage<String, String>(topic, line)); } Kafka DATA STORAGE STREAM
  • 56. AirQuality Consumer ConsumerConnectorconsumer= Consumer.createJavaConsumerConnector(config); Map<String, Integer> topicCountMap= new HashMap<String, Integer>(); topicCountMap.put(topic, new Integer(1)); Map<String, List<KafkaMessageStream>> consumerMap= consumer.createMessageStreams(topicCountMap); KafkaMessageStreamstream= consumerMap.get(topic).get(0); ConsumerIteratorit= stream.iterator(); while(it.hasNext()){ // consume it.next() Kafka DATA STORAGE STREAM
  • 57. • Simple distributed message queue • A single Kestrel server has a set of queues (strictly-ordered FIFO) • On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication • Kestrel vsKafka o Kafka consumers cheaper (basically just the bandwidth usage) o Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation. o Kafka has significantly better throughput. o Kestrel does not support ordered consumption Kestrel DATA STORAGE STREAM
  • 58. Interceptor • Interface org.apache.flume.interceptor.Interceptor • Can modify or even drop events based on any criteria • Flume supports chainingof interceptors. • Types: o Timestamp interceptor o Host interceptor o Static interceptor o UUID interceptor o Morphline interceptor o Regex Filtering interceptor o Regex Extractor interceptor DATA ANALYTICS STREAM Flume
  • 59. • The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Sourceand Channel. Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35";"981"; "23"; "3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62";"983"; "23"; Flume DATA ANALYTICS STREAM
  • 60. # Writeformatcan be textorwritable … #Definingchannel–Memory type …1 … #Definingsource–Syslog… … # Definingsink–HDFS … … #Defininginterceptor agent.sources.source.interceptors= i1 agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter class StationFilter implements Interceptor … if(!"Station".equals("2")) discard data; else save data; Flume DATA ANALYTICS STREAM
  • 61. Hadoop Storm JobTracker Nimbus TaskTracker Supervisor Job Topology • Distributed and scalable realtime computation system • Doing for real-time processing what Hadoop did for batch processing • Topology:processinggraph.Eachnodecontainsprocessinglogic(spoutsandbolts).Linksbetweennodesarestreamsofdata o Spout:Sourceofstreams.Readadatasourceandemitthedataintothetopologyasastream o Bolts:Processingunit.Readdatafromseveralstreams,doessomeprocessingandpossiblyemitsnewstreams o Stream:Unboundedsequenceoftuples.Tuplescancontainanyserializableobject Storm DATA ANALYTICS STREAM
  • 62. CAReader LineProcessor AvgValues • AirQuality average values o Step 1: build the topology Storm Spout Bolt Bolt DATA ANALYTICS STREAM
  • 63. • AirQuality average values o Step 1: build the topology TopologyBuilderAirAVG= new TopologyBuilder(); builder.setSpout("ca-reader", new CAReader(), 1); //shuffleGrouping-> evendistribution AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3) .shuffleGrouping("ca-reader"); //fieldsGrouping-> fieldswiththesamevaluegoestothesametask AirAVG.setBolt("ca-avg-values", new AvgValues(), 2) .fieldsGrouping("ca-line-processor", new Fields("id")); Storm DATA ANALYTICS STREAM
  • 64. public void open(Map conf, TopologyContextcontext, SpoutOutputCollectorcollector) { //Initializefile BufferedReaderbr= new … … } publicvoidnextTuple() { Stringline = br.readLine(); if(line == null) { return; } else collector.emit(new Values(line)); } Storm • AirQuality average values o Step 2: CAReader implementation (IRichSpout interface) DATA ANALYTICS STREAM
  • 65. publicvoiddeclareOutputFields(OutputFieldsDeclarerdeclarer) { declarer.declare(new Fields("id", "stationName", "lat", … } publicvoidexecute(Tupleinput, BasicOutputCollectorcollector) { collector.emit(new Values(input.getString(0).split(";"); } Storm • AirQuality average values o Step 3: LineProcessor implementation (IBasicBolt interface) DATA ANALYTICS STREAM
  • 66. public void execute (Tuple input, BasicOutputCollector collector) { //totals and count are hashmaps with each station accumulated values if (totals.containsKey(id)) { item = totals.get(id); count = counts.get(id); } else { //Create new item } //update values item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2"))); item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no"))); … } Storm • AirQuality average values oStep 4: AvgValues implementation (IBasicBolt interface) DATA ANALYTICS STREAM 66
  • 67. • High level abstraction on top of Storm o Provides high level operations (joins, filters, projections, aggregations, functions…) Pros o Easy, powerful and flexible o Incremental topology development o Exactly-once semantics Cons o Very few built-in functions o Lower performance and higher latency than Storm Trident DATA ANALYTICS STREAM
  • 68.  Simple Scalable Streaming System  Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data  Inspired by MapReduce and Actor models of computation o Data processing is based on Processing Elements (PE) o Messages are transmitted between PEs in the form of events (Key, Attributes) o Processing Nodes are the logical hosts to PEs S4 DATA ANALYTICS STREAM
  • 69. … <bean id="split" class="SplitPE"> <property name="dispatcher" ref="dispatcher"/> <property name="keys"> <!--Listen for both words and sentences --> <list> <value>LogLines *</value> </list> </property> </bean> <bean id="average" class="AveragePE"> <property name="keys"> <list> <value>CAItem stationId</value> </list> </property> </bean> … • AirQuality average values S4 DATA ANALYTICS STREAM
  • 70. Spark Streaming • Spark for real-time processing • Streaming computation as a series of very short batch jobs (windows) • Keep state in memory • API similar to Spark DATA ANALYTICS STREAM
  • 71. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 72. • We are in the beginning of this generation • Short-term Big Data processing goal • Abstraction layer over the Lambda Architecture • Promising technologies o SummingBird o Lambdoop Hybrid Computation Model
  • 73. SummingBird • Library to write MapReduce-likeprocess that can be executed on Hadoop, Stormor hybrid model • Scalasyntaxis • Same logic can be executed in batch, real-time and hybrid bath/real mode HYBRID COMPUTATION MODEL
  • 75. Pros • Hybrid computation model • Same programing model for all proccesing paradigms • Extensible Cons • MapReduce-like programing • Scala • Not as abstract as some users would like SummingBird HYBRID COMPUTATION MODEL
  • 76.  Software abstraction layer over Open Source technologies o Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident  Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process  Same single APIfor the three processing paradigms o Batch processing similar to Pig / Cascading o Real time processing using built-in functions easier than Trident o Hybrid computation model transparent for the developer Lambdoop HYBRID COMPUTATION MODEL
  • 77. Lambdoop Data Operation Data Workflow Streamingdata Staticdata HYBRID COMPUTATION MODEL
  • 78. DataInput db_historical = new StaticCSVInput(URI_db); Datahistorical = new Data(db_historical); Workflowbatch = new Workflow(historical); Operation filter = new Filter(“Station",“=", 2); Operation select = new Select(“Titulo“, “SO2"); Operation group = new Group(“Titulo"); Operation average = new Average(“SO2"); batch.add(filter); batch.add(select); batch.add(group); batch.add(variance); batch.run(); Dataresults = batch.getResults(); … Lambdoop HYBRID COMPUTATION MODEL
  • 79. DataInput stream_sensor = new StreamXMLInput(URI_sensor); Datasensor = new Data(stream_sensor) Workflowstreaming = new Workflow (sensor, new WindowsTime(100) ); Operation filter = new Filter("Station","=", 2); Operation select = new Select("Titulo", "S02"); Operation group = new Group("Titulo"); Operation average = new Average("S02"); streaming.add(filter); streaming.add(select); streaming.add(group); streaming.add(average); streaming.run(); While (true) { Data live_results = streaming.getResults(); … } Lambdoop HYBRID COMPUTATION MODEL
  • 80. DataInput historical= new StaticCSVInput(URI_folder); DataInput stream_sensor= new StreamXMLInput(URI_sensor); Data all_info = new Data (historical, stream_sensor); Workflow hybrid= new Workflow (all_info, new WindowsTime(1000) ); Operation filter = new Filter ("Station","=", 2); Operation select = new Select ("Titulo", "SO2"); Operation group = new Group("Titulo"); Operation average = new Average("SO2"); hybrid.add(filter); hybrid.add(select); hybrid.add(group); hybrid.add(variance); hybrid.run(); Data updated_results = hybrid.getResults(); Lambdoop HYBRID COMPUTATION MODEL
  • 81. Pros • High abstraction layer for all processing model • All steps in the data processing pipeline • Same Java API for all programing paradigms • Extensible Cons • Ongoing project • Not open-source yet • Not tested in larger cluster yet Lambdoop HYBRID COMPUTATION MODEL
  • 82. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 83. Open Issues • Interoperabilitybetween well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark) • European technologies (Stratosphere / Apache Flink) • Massive StreamingMachine Learning • Real-time Interactive Visual Analytics • Vertical (domain-driven) solutions
  • 84. Conclusions Casado R., Younas M. Emergingtrendsand technologiesin big data processing. ConcurrencyComputat.: Pract. Exper. 2014
  • 85. Conclusions • Big Data is notonlyHadoop • Identify the processing requirements of your project • Analyzethe alternatives for all steps in the data pipeline • The battle for real-time processing is open • Stay tuned for the hybrid computation model
  • 86. Thanks for your attention! Questions? ruben.casado@treelogic.com ruben_casado