The three generations of Big Data processing

  • 3,203 views
Uploaded on

Big Data is often characterized by the 3 “Vs”: variety, volume and velocity. While variety refers to the nature of the information (multiple sources, schema-less data, etc), both volume and velocity …

Big Data is often characterized by the 3 “Vs”: variety, volume and velocity. While variety refers to the nature of the information (multiple sources, schema-less data, etc), both volume and velocity refer to processing issues that have to be addressed by different processing paradigms.

Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, the processing solution break down broadly into massively parallel processing (batch processing). Batch processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced.

Several applications require real-time processing of data streams from heterogeneous sources, in contrast with the approach of batch processing. Real time processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time). Domains of application include smart cities, entertainment of disaster management. The low latency is the main goal of this processing paradigm.

Batch processing provides strong results since it can use more data and, for example, perform better training of predictive models. But it is not feasible for domains where a low response time is a critical issue. Real time processing solves this issue, but the analyzed information is limited in order to achieve low latency. Many domains require the benefit of both batch and real time processing approaches so a new processing paradigm is needed: the hybrid model. To obtain a complete result, the batch and real-time results must be queried and the results merged together. Synchronization, results composition and other non-trivial issues have to be addressed at this stage in which could be considered a key element of the hybrid modell.

This walk will overview the time-evolution of the big data processing techniques, identify main hits (both technologies and scientific publications) and give and introduction of the key technologies to understand the complex Big Data processing domain.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,203
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
213
Comments
0
Likes
18

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The three generations of Big Data processing Rubén Casado ruben.casado@treelogic.com
  • 2. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  • 3. About me :-)
  • 4.    PhD in Software Engineering MSc in Computer Science BSc in Computer Science Work Experience Academics
  • 5. About Treelogic
  • 6. Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledge to improve quality standards in our daily life
  • 7. TREELOGIC – Distributor and Sales
  • 8. International Projects National Projects Research Lines Computer Vision Regional Projects Solutions Security & Safety Big Data Teraherzt technology R&D Manag. System Justice Health Data science Social Media Analysis Semantics Internal Projects R&D Transport Financial services ICT tailored solutions
  • 9. 7 ongoing FP7 projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostars projects Coordinating all of them
  • 10. 7 years’ experience in R&D projects Research & INNOVATION
  • 11. www.datadopter.com
  • 12. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  • 13. What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques
  • 14. How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -
  • 15. 3 problems Volume Variety Velocity
  • 16. 3 solutions Batch processing Real-time NoSQL processing
  • 17. 3 solutions Batch processing Real-time NoSQL processing
  • 18. Batch processing • Scalable • Large amount of static data • Distributed • Parallel • Fault tolerant • High latency Volume
  • 19. Real-time processing • Low latency • Continuous unbounded streams of data • Distributed • Velocity • Parallel Fault-tolerant
  • 20. Hybrid computation model • Low latency • Massive data + Streaming data • Scalable • Combine batch and real-time results Volume Velocity
  • 21. Hybrid computation model All data Batch processing Batch results Final results Combination New data Real-time processing Stream results
  • 22. Processing Paradigms    Large amount of statics data Scalable solution Volume 2006 1ª Generation 2010 Real-time processing     Inception Batch processing   2003 Computing streaming data Low latency Velocity 2ª Generation 2014 Hybrid computation   Lambda Architecture Volume + Velocity 3ª Generation
  • 23. 10 years of Big Data processing technologies Batch MapReduce: Simplified Data Processing on Large Clusters Yahoo! starts working on Hadoop 2003 2004 2005 The Google File System 2006 Real-Time Yahoo! creates S4 Facebook creates Hive 2008 Apache Hadoop is in production Doug Cutting starts developing Hadoop 2009 Hybrid Nathan Marz defines the Lambda Architecture LinkedIn LinkedIn! presents presents Samza KafkA 2010 2011 Cloudera presents Flume 2012 2013 Nathan Marz creates Storm Yahoo! creates Pig Google publishes MillWheel: Fault-Tolerant Stream Processing at Internet Scale
  • 24. Processing Pipeline DATA DATA DATA ACQUISITION STORAGE ANALYSIS RESULTS
  • 25. Air Quality case study  Static stations and mobile sensors in Asturias sending streaming data  Historical data of > 10 years  Monitoring, trends identification, predictions
  • 26. Agenda 1. Big Data processing overview 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  • 27. Batch processing technologies DATA DATA DATA ACQUISITION STORAGE ANALYSIS o HDFS commands o Sqoop o Flume o Scribe o HDFS o MapReduce o HBase o Hive o Pig o Cascading o Spark o Shark RESULTS
  • 28. HDFS commands • B A T C H Import to HDFS hadoop dfs -copyFromLocal <path-to-local> <path-to-remote> hadoop dfs –copyFromLocal /home/hduser/AirQuality/ /hdfs/AirQuality/ DATA ACQUISITION
  • 29. Sqoop • Tool designed for transferring data between HDFS/HBase and structural datastores • Based in MapReduce • Includes connectors for multiple databases o o PostgreSQL, o Oracle, o SQL Server and o DB2 o • MySQL, Generic JDBC connector Java API B A T C H DATA ACQUISITION
  • 30. B A T C H Sqoop DATA ACQUISITION 3) Export results to database 1) Import data from database to HDFS export --connect jdbc:mysql://localhost/testDatabase jdbc:mysql://localhost/testDatabase --target-dir --export-dir hdfs://rootHDFS/testDatabase -- hdfs://rootHDFS/testDatabase -- username user1 --password pass1 -m 1 2) Analyze data (HADOOP) import -all-tables --connect username user1 --password pass1 -m 1
  • 31. Flume B A T C H DATA ACQUISITION • Service for collecting, aggregating, and moving large amounts of log data • Simple and flexible architecture based on streaming data flows • Reliability, scalability, extensibility, manageability Support log stream types • o o o Avro Syslog Netcast
  • 32. B A T C H Flume • Architecture o Waiting for events . Sink • o ACQUISITION Source • o DATA Sends the information towards another agent or system. Channel • Stores the information until it is consumed by the sink. Sources Avro Thrift Exec JMS NetCat Syslog TCP/UDP HTTP Custom Channels Memory JDBC File Sinks HDFS Logger Avro Thrift IRC File Roll Null HBase Custom
  • 33. B A T C H Flume Stations send the information to the servers. Flume collects DATA ACQUISITION this information and move it into the HDFS for further analsys  Air quality syslogs Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  • 34. Scribe B A T C H DATA ACQUISITION • Server for aggregating log data streamed in real time from a large number of servers • There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. • The central scribe server(s) can write the messages to the files that are their final destination
  • 35. Scribe • B A T C H DATA ACQUISITION Sending a sensor message to a Scribe Server category=‘mobile‘; // '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' message= sensor_log.readLine(); log_entry = scribe.LogEntry(category, message) // Create a Scribe Client client = scribe.Client(iprot=protocol, oprot=protocol) transport.open() result = client.Log(messages=[log_entry]) transport.close()
  • 36. HDFS • Distributed FileSystem for Hadoop • B A T C H DATA STORAGE Master-Slaves Architecture (NameNode – DataNodes) o o • NameNode: Manage the directory tree and regulates access to files by clients DataNodes: Store the data Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes
  • 37. HBase B A T C H DATA STORAGE • Open-source non-relational distributed column-oriented database modeled after Google’s BigTable. • Random, realtime read/write access to the data. • Not a relational database. o • Very light «schema» Rows are stored in sorted order.
  • 38. MapReduce B A T C H • Framework for processing large amount of data in parallel across a distributed cluster • ANALYTICS Slightly inspired in the Divide and Conquer (D&C) classic strategy • DATA Developer has to implement Map and Reduce functions: o Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes parsed to the format <K, V> o Reduce: It collects the <K, List(V)> and generates the results
  • 39. B A T C H MapReduce • Design Patterns o Joins o Statistics o o AVG o Replicated join o VAR o o Reduce side Join Semi join o Count o … Sorting: o o o o Secondary sort Total Order Sort o Filtering o Top-K Binning … DATA ANALYTICS
  • 40. MapReduce • B A T C H DATA ANALYTICS Obtain the S02 average of each station Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  • 41. B A T C H MapReduce ANALYTICS Maps get records and produce the SO2 value in <Station_Id, SO2_value> <Station_ID, S02_VALUE> Mapper Mapper Input Data Mapper <1, 6> <1, 2> <3, 1> <1, 9> <3, 9> <2, 6> <2, 6> <1, 6> <2, 0> <2, 8> <1, 2> <3,9> … … … Shuffling • DATA
  • 42. B A T C H MapReduce • DATA ANALYTICS Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station <Station_ID, [SO1, SO2,…,SOn> Reducer Divide Station_ID, AVG_SO2 Sum Divide 2,013 2, Reducer 1, 2,695 3, 3,562 … Shuffling Sum
  • 43. Hive B A T C H DATA ANALYTICS • Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets • Abstraction layer on top of MapReduce • SQL-like language called HiveQL. Metastore: Central repository of Hive metadata. •
  • 44. Hive • B A T C H DATA ANALYTICS Obtain the S02 average of each station SELECT Titulo, avg(SO2) FROM air_quality GROUP BY Estacion CREATE TABLE air_quality(Estacion int, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' LINES TERMINATED BY 'n' STORED AS TEXTFILE; LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire;
  • 45. Pig B A T C H • Platform for analyzing large data sets • High-level language for expressing data analysis programs. Pig Latin. Data flow programming language. • Abstraction layer on top of MapReduce • Procedural language DATA ANALYTICS
  • 46. Pig • B A T C H DATA ANALYTICS Obtain the S02 average of each station calidad_aire = load '/CalidadAire_Gijon' using PigStorage(';') AS (estacion:chararray, titulo:chararray, latitud:chararray, longitud:chararray, fecha:chararray, so2:chararray, no:chararray, co:chararray, pm10:chararray, o3:chararray, dd:chararray, vv:chararray, tmp:chararray, hr:chararray, prb:chararray, rs:chararray, ll:chararray, ben:chararray, tol:chararray, mxil:chararray, pm25:chararray); grouped = GROUP air_quality BY estacion; avg = FOREACH grouped GENERATE group, AVG(so2); dump avg;
  • 47. Cascading B A T C H DATA ANALYTICS • Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows • Makes development of complex Hadoop MapReduce workflows easy • In the same way that Pig
  • 48. Cascading • B A T C H DATA ANALYTICS Obtain the S02 average of each station // define source and sink Taps. Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextLine( new Fields( “Estacion", “SO2" ) ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( “avgSO2" ); assembly = new GroupBy( assembly, new Fields( “Estacion" ) ); // For every Tuple group Aggregator avg = new Average( new Fields( “SO2" ) ); assembly = new Every( assembly, avg ); // Tell Hadoop which jar file to use Flow flow = flowConnector.connect( “avg-SO2", source, sink, assembly ); // execute the flow, block until complete flow.complete();
  • 49. Spark B A T C H DATA ANALYTICS • Cluster computing systems for faster data analytics • Not a modified version of Hadoop • Compatible with HDFS In-memory data storage for very fast iterative processing MapReduce-like engine API in Scala, Java and Python • • •
  • 50. Spark • B A T C H DATA ANALYTICS Hadoop is slow due to replication, serialization and IO tasks
  • 51. Spark • 10x-100x faster B A T C H DATA ANALYTICS
  • 52. Shark B A T C H • Large-scale data warehouse system for Spark • SQL on top of Spark • Actually Hive QL over Spark • Up to 100 x faster than Hive DATA ANALYTICS
  • 53. Spark / Shark B A T C H DATA ANALYTICS Pros • Faster than Hadoop ecosystem • Easier to develop new applications o (Scala, Java and Python API) Cons • Not tested in extremely large clusters yet • Problems when Reducer’s data does not fit in memory
  • 54. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  • 55. Real-time processing technologies DATA DATA DATA ACQUISITION STORAGE ANALYSIS o Flume o Kafka o Flume o Kestrel o Storm o Trident o S4 o Spark Streaming RESULTS
  • 56. Flume R E A L DATA ACQUISITION
  • 57. Kafka • R E A L DATA STORAGE Kafka is a distributed, partitioned, replicated commit log service o Producer/Consumer model o Kafka maintains feeds of messages in categories called topics o Kafka is run as a cluster
  • 58. Kafka Insert AirQuality sensor log file into Kafka cluster and consume the info. R E A L DATA STORAGE // new Producter Producer<String, String> producer = new Producer<String, String>(config); //Open sensor log file BufferedReader br… String line; while(true) { line = br.readLine(); if(line ==null) … //wait; else producer.send(new KeyedMessage<String, String>(topic, line)); }
  • 59. Kafka AirQuality Consumer R E A L DATA STORAGE ConsumerConnector consumer = Consumer.createJavaConsumerConnector(config); Map<String, Integer> topicCountMap = new HashMap<String, Integer>(); topicCountMap.put(topic, new Integer(1)); Map<String, List<KafkaMessageStream>> consumerMap = consumer.createMessageStreams(topicCountMap); KafkaMessageStream stream = consumerMap.get(topic).get(0); ConsumerIterator it = stream.iterator(); while(it.hasNext()){ // consume it.next()
  • 60. Kestrel R E A L DATA STORAGE • Simple distributed message queue • A single Kestrel server has a set of queues (strictly-ordered FIFO) • On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication • Kestrel vs Kafka o Kafka consumers cheaper (basically just the bandwidth usage) o Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation. o Kafka has significantly better throughput. o Kestrel does not support ordered consumption
  • 61. Flume Interceptor • • • • Interface org.apache.flume.interceptor.Interceptor Can modify or even drop events based on any criteria Flume supports chaining of interceptors. Types: o Timestamp interceptor o Host interceptor o Static interceptor o UUID interceptor o Morphline interceptor o Regex Filtering interceptor o Regex Extractor interceptor R E A L DATA ANALYTICS
  • 62. Flume R E A L DATA ANALYTICS • The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Source and Channel. Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  • 63. Flume class StationFilter implements Interceptor … if(!"Station".equals("2")) discard data; else save data; # Write format can be text or writable … #Defining channel – Memory type …1 … #Defining source – Syslog … … # Defining sink – HDFS … … #Defining interceptor agent.sources.source.interceptors = i1 agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter R E A L DATA ANALYTICS
  • 64. Storm R E A L DATA ANALYTICS • Distributed and scalable realtime computation system • Doing for real-time processing what Hadoop did for batch processing • Topology: processing graph. Each node contains processing logic (spouts and bolts). Links between nodes are streams of data o Spout: Source of streams. Read a data source and emit the data into the topology as a stream o Bolts: Processing unit. Read data from several streams, does some processing and possibly emits new streams o Stream: Unbounded sequence of tuples. Tuples can contain any serializable object Hadoop Storm JobTracker Nimbus TaskTracker Supervisor Job Topology
  • 65. Storm • R E A L AirQuality average values o Step 1: build the topology CAReader Spout LineProcessor Bolt AvgValues Bolt DATA ANALYTICS
  • 66. Storm • AirQuality average values o Step R E A L DATA ANALYTICS 1: build the topology TopologyBuilder AirAVG= new TopologyBuilder(); builder.setSpout("ca-reader", new CAReader(), 1); //shuffleGrouping -> even distribution AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3) .shuffleGrouping("ca-reader"); //fieldsGrouping -> fields with the same value goes to the same task AirAVG.setBolt("ca-avg-values", new AvgValues(), 2) .fieldsGrouping("ca-line-processor", new Fields("id"));
  • 67. Storm • AirQuality average values o Step DATA ANALYTICS 2: CAReader implementation (IRichSpout interface) public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { } R E A L //Initialize file BufferedReader br = new … … public void nextTuple() { String line = br.readLine(); if (line == null) { return; } else collector.emit(new Values(line)); }
  • 68. Storm • AirQuality average values o Step 3: LineProcessor implementation (IBasicBolt interface) public void declareOutputFields (OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "stationName", "lat", … } public void execute (Tuple input, BasicOutputCollector collector) { collector.emit(new Values(input.getString(0).split(";"); } R E A L DATA ANALYTICS
  • 69. Storm • AirQuality average values o Step R E A L DATA ANALYTICS 4: AvgValues implementation (IBasicBolt interface) public void execute (Tuple input, BasicOutputCollector collector) { //totals and count are hashmaps with each station accumulated values if (totals.containsKey(id)) { item = totals.get(id); count = counts.get(id); } else { //Create new item } //update values item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2"))); item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no"))); … } 69
  • 70. Trident • R E A L DATA ANALYTICS High level abstraction on top of Storm o Provides high level operations (joins, filters, projections, aggregations, functions…) Pros o Easy, powerful and flexible o Incremental topology development o Exactly-once semantics Cons o Very few built-in functions Lower performance and higher latency than Storm o
  • 71. S4    Simple Scalable Streaming System R E A L DATA ANALYTICS Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data Inspired by MapReduce and Actor models of computation o Data processing is based on Processing Elements (PE) o Messages are transmitted between PEs in the form of events (Key, Attributes) o Processing Nodes are the logical hosts to PEs
  • 72. S4 • AirQuality average values … <bean id="split" class="SplitPE"> <property name="dispatcher" ref="dispatcher"/> <property name="keys"> <!-- Listen for both words and sentences --> <list> <value>LogLines *</value> </list> </property> </bean> <bean id="average" class="AveragePE"> <property name="keys"> <list> <value>CAItem stationId</value> </list> </property> </bean> … R E A L DATA ANALYTICS
  • 73. Spark Streaming R E A L DATA ANALYTICS • Spark for real-time processing • Streaming computation as a series of very short batch jobs (windows) • Keep state in memory • API similar to Spark
  • 74. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  • 75. Hybrid Computation Model • We are in the beginning of this generation • Short-term Big Data processing goal • Abstraction layer over the Lambda Architecture • Promising technologies o SummingBird o Lambdoop
  • 76. SummingBird HYBRID COMPUTATION MODEL • Library to write MapReduce-like process that can be executed on Hadoop, Storm or hybrid model • Scala syntaxis • Same logic can be executed in batch, real-time and hybrid bath/real mode
  • 77. SummingBird HYBRID COMPUTATION MODEL
  • 78. SummingBird HYBRID COMPUTATION MODEL Pros • Hybrid computation model • Same programing model for all proccesing paradigms • Extensible Cons • MapReduce-like programing • • Scala Not as abstract as some users would like
  • 79. Lambdoop  Software abstraction layer over Open Source technologies o   HYBRID COMPUTATION MODEL Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process Same single API for the three processing paradigms o Batch processing similar to Pig / Cascading o Real time processing using built-in functions easier than Trident o Hybrid computation model transparent for the developer
  • 80. Lambdoop HYBRID COMPUTATION MODEL Streaming data Workflow Data Static data Operation Data
  • 81. Lambdoop HYBRID COMPUTATION MODEL DataInput db_historical = new StaticCSVInput(URI_db); Data historical = new Data (db_historical); Workflow batch = new Workflow (historical); Operation filter = new Filter (“Station", “=", 2); Operation select = new Select (“Titulo“, “SO2"); Operation group = new Group(“Titulo"); Operation average = new Average (“SO2"); batch.add(filter); batch.add(select); batch.add(group); batch.add(variance); batch.run(); Data results = batch.getResults(); …
  • 82. Lambdoop DataInput stream_sensor = new StreamXMLInput(URI_sensor); HYBRID COMPUTATION MODEL Data sensor = new Data(stream_sensor) Workflow streaming = new Workflow (sensor, new WindowsTime(100) ); Operation filter = new Filter ("Station", "=", 2); Operation select = new Select ("Titulo", "S02"); Operation group = new Group("Titulo"); Operation average = new Average ("S02"); streaming.add(filter); streaming.add(select); streaming.add(group); streaming.add(average); streaming.run(); While (true) { Data live_results = streaming.getResults(); … }
  • 83. Lambdoop DataInput historical= new StaticCSVInput(URI_folder); HYBRID COMPUTATION MODEL DataInput stream_sensor= new StreamXMLInput(URI_sensor); Data all_info = new Data (historical, stream_sensor); Workflow hybrid = new Workflow (all_info, new WindowsTime(1000) ); Operation filter = new Filter ("Station", "=", 2); Operation select = new Select ("Titulo", "SO2"); Operation group = new Group("Titulo"); Operation average = new Average ("SO2"); hybrid.add(filter); hybrid.add(select); hybrid.add(group); hybrid.add(variance); hybrid.run(); Data updated_results = hybrid.getResults();
  • 84. Lambdoop HYBRID COMPUTATION MODEL Pros • High abstraction layer for all processing model • All steps in the data processing pipeline • Same Java API for all programing paradigms • Extensible Cons • • • Ongoing project Not open-source yet Not tested in larger cluster yet
  • 85. Agenda 1. Big Data processing 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Conclusions
  • 86. Conclusions • Big Data is not only Hadoop • Identify the processing requirements of your project • Analyze the alternatives for all steps in the data pipeline • The battle for real-time processing is open • Stay tuned for the hybrid computation model
  • 87. Thanks for your attention! Contact us: ruben.casado@treelogic.com info@datadopter.com www.datadopter.com www.treelogic.com MADRID Avda. de Manoteras, 38 Oficina D507 28050 Madrid · España ASTURIAS Parque Tecnológico de Asturias Parcela 30 33428 Llanera - Asturias · España 902 286 386