Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

920 views

Published on

Charla Invitada del Dr. Rubén Casado impartida el 19 de Noviembre de 2014 en la Facultad de Informática

Published in: Education
  • Be the first to comment

  • Be the first to like this

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

  1. 1. Dr. Rubén Casado ruben.casado@treelogic.com ruben_casado Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades UNIVERSIDAD COMPLUETENSEMADRID 19 de Noviembre de 2014
  2. 2. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  3. 3.  PhD in Software Engineering  MSc in Computer Science  BSc in Computer Science Academics Work Experience
  4. 4. 1. Big Data processing 2. Batchprocessing 3. Streamingprocessing 4. Hybridcomputationmodel 5. Open Issues & Conclusions Agenda
  5. 5. A massivevolume of both structuredand unstructureddata that is so large to process with traditional database and software techniques What is Big Data?
  6. 6. Big Data are high-volume, high-velocity, and/or high-varietyinformation assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization How is Big Data? -Gartner IT Glossary -
  7. 7. 3 problems Volume Variety Velocity
  8. 8. 3 solutions Batch processing NoSQL Streaming processing
  9. 9. 3 solutions Batch processing NoSQL Streaming processing
  10. 10. Volume Variety Velocity Science or Engineering?
  11. 11. Science or Engineering? Volume Variety Value Velocity
  12. 12. Science or Engineering? Volume Variety Value Velocity Software Engineering Data Science
  13. 13. 13  Relational Databases  Schema based  ACID (Atomicity, Consistency, Isolation, Durability)  Performance penalty  Scalability issues  NoSQL  Not Only SQL  Families of solutions  Google BigTable, Amazon Dynamo  BASE = Basically Available, Soft state, Eventually consistent  CAP= Consistency, Availability, Partition tolerance NoSQL
  14. 14. 14  Key-value  Key: ID  Value: associateddata  Diccionario  LinkedIn Voldemort  Riak, Redis  Memcache, Membase  Document  More complex tan K-V  Documents are indexed by ID  Multiple index  MongoDB  CouchDB  Column  Tables with predefined families of fields  Fields within families are flexible  Vertical and horizontal partitioning  HBase  Cassandra  Graph  Nodes  Relationships  Neo4j  FlockDB  OrientDB CR7: ‘Cristiano Ronaldo’ CR7:{Name: ’Cristiano’ Surname: ‘Ronaldo’ Age: 29} CR7: [Personal:{Name: ’Cristiano’ Surname: ‘Ronaldo’ Edad: 29} Job: {Team: ‘R. Madrid’ Salary: 20.000.000}] NoSQL [CR] [R.Madrid] se_llama juega [Cristiano]
  15. 15. • Scalable • Large amount of staticdata • Distributed • Parallel • Fault tolerant • High latency Batch processing Volume
  16. 16. • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Streaming processing Velocity
  17. 17. • Lowlatency:real-time • Massivedata-at-rest+ data-in-motion • Scalable • Combinebatchand streamingresults Hybrid computation model Volume Velocity
  18. 18. All data New data Batch processing Streamingprocessing Batch results Stream results Combination Final results Hybrid computation model
  19. 19.  Batchprocessing  Largeamountof staticsdata  Scalablesolution  Volume  Streamingprocessing  Computing streamingdata  Lowlatency  Velocity  Hybridcomputation  Lambda Architecture  Volume+ Velocity 2006 2010 2014 1ª Generation 2ª Generation 3ª Generation Inception 2003 Processing Paradigms
  20. 20. Batch +10 years of Big Data processing technologies 2003 2004 2005 2013 2011 2010 2008 The Google File System MapReduce: Simplified Data Processing on Large Clusters Doug Cutting starts developing Hadoop 2006 Yahoo! starts working on Hadoop Apache Hadoop is in production Nathan Marzcreates Storm Yahoo! creates S4 2009 Facebook creates Hive Yahoo! creates Pig MillWheel: Fault-Tolerant Stream Processing at Internet Scale LinkedIn presents Samza LinkedIn presents KafkA Clouderapresents Flume 2012 Nathan Marzdefines the Lambda Architecture Streaming Hybrid 2014 Spark stack is open sourced Lambdoop & Summinbgirdfirst steps StratospherebecomesApache Flink
  21. 21. Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS
  22. 22.  Static stations and mobile sensors in Asturias sending streaming data  Historical data of > 10 years  Monitoring, trends identification, predictions Air Quality case study
  23. 23. 1. Big Data processing overview 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  24. 24. Batch processing technologies DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS o HDFS commands o Sqoop o Flume o Scribe o HDFS o HBase o MapReduce o Hive o Pig o Cascading o Spark o SparkSQL (Shark)
  25. 25. • Import to HDFS hadoopdfs-copyFromLocal <path-to-local> <path-to-remote> hadoopdfs–copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/ HDFS commands DATA ACQUISITION BATCH
  26. 26. • Tool designed for transferring data between HDFS/HBase and structural datastores • Based in MapReduce • Includes connectors for multiple databases o MySQL, o PostgreSQL, o Oracle, o SQL Server and o DB2 o Generic JDBC connector • Java API Sqoop DATA ACQUISITION BATCH
  27. 27. import-all-tables--connectjdbc:mysql://localhost/testDatabase--target-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 1) Import data from database to HDFS export--connectjdbc:mysql://localhost/testDatabase--export-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 3) Export results to database 2) Analyzedata (HADOOP) Sqoop DATA ACQUISITION BATCH
  28. 28. • Service for collecting, aggregating, and moving large amounts of log data • Simple and flexible architecture based on streaming data flows • Reliability, scalability, extensibility, manageability • Support log stream types o Avro o Syslog o Netcast Flume DATA ACQUISITION BATCH
  29. 29. Sources Channels Sinks Avro Memory HDFS Thrift JDBC Logger Exec File Avro JMS Thrift NetCat IRC Syslog TCP/UDP File Roll HTTP Null HBase Custom Custom • Architecture o Source • Waitingforevents. o Sink • Sendstheinformationtowardsanotheragentorsystem. o Channel • Storestheinformationuntilitisconsumedbythesink. Flume DATA ACQUISITION BATCH
  30. 30. Stations send the information to the servers. Flume collects this information and move it into the HDFS for further analsys  Air quality syslogs Flume DATA ACQUISITION BATCH Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  31. 31. • Server for aggregating log datastreamed in real time from a large number of servers • There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. • The central scribe server(s) can write the messages to the files that are their final destination Scribe DATA ACQUISITION BATCH
  32. 32. category=‘mobile‘; // '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' message= sensor_log.readLine(); log_entry= scribe.LogEntry(category, message) // Createa Scribe Client client= scribe.Client(iprot=protocol, oprot=protocol) transport.open() result= client.Log(messages=[log_entry]) transport.close() • Sending a sensor message to a Scribe Server Scribe DATA ACQUISITION BATCH
  33. 33. • Distributed FileSystem for Hadoop • Master-Slaves Architecture (NameNode–DataNodes) o NameNode: Manage the directory tree and regulates access to files by clients o DataNodes: Store the data • Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes HDFS DATA STORAGE BATCH
  34. 34. • Open-source non-relational distributed column-oriented databasemodeled after Google’s BigTable. • Random, realtime read/write access to the data. • Not a relational database. o Very light «schema» • Rows are stored in sorted order. DATA STORAGE BATCH HBase
  35. 35. • Framework for processing large amount of datain parallel across a distributed cluster • Slightly inspired in the Divide and Conquer (D&C) classic strategy • Developer has to implement Map and Reduce functions: o Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodesparsed to the format <K, V> o Reduce: It collects the <K, List(V)> and generates the results MapReduce DATA ANALYTICS BATCH
  36. 36. • Design Patterns o Joins o Reduce side Join o Replicated join o Semi join o Sorting: o Secondary sort o Total Order Sort o Filtering MapReduce o Statistics o AVG o VAR o Count o … o Top-K o Binning o … DATA ANALYTICS BATCH
  37. 37. • Obtain the S02average of each station MapReduce Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23"; DATA ANALYTICS BATCH
  38. 38. Input Data Mapper Mapper Mapper <1, 6> … … … Shuffling <1, 2> <3, 1> <1, 9> <3, 9> <2, 6> <2, 6> <1, 6> <2, 0> <2, 8> <1, 2> <3,9> <Station_ID, S02_VALUE> MapReduce DATA ANALYTICS BATCH • Maps get records and produce the SO2 value in <Station_Id, SO2_value>
  39. 39. Station_ID, AVG_SO2 1, 2,013 2, 2,695 3, 3,562 Reducer Sum Divide Shuffling Reducer Sum Divide … <Station_ID, [SO1, SO2,…,SOn> • Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station MapReduce DATA ANALYTICS BATCH
  40. 40. Hive • Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets • Abstractionlayer on top of MapReduce • SQL-like language called HiveQL. • Metastore: Central repository of Hive metadata. DATA ANALYTICS BATCH
  41. 41. CREATE TABLE air_quality(Estacionint, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' LINES TERMINATED BY 'n' STORED AS TEXTFILE; LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire; Hive • Obtain the S02average of each station SELECT Titulo, avg(SO2) FROM air_quality GROUP BY Estacion DATA ANALYTICS BATCH
  42. 42. • Platform for analyzinglarge data sets • High-level language for expressing data analysis programs. Pig Latin. Data flow programming language. • Abstraction layer on top of MapReduce • Procedurallanguage Pig DATA ANALYTICS BATCH
  43. 43. Pig DATA ANALYTICS BATCH • Obtain the S02average of each station calidad_aire= load'/CalidadAire_Gijon' usingPigStorage(';') AS(estacion:chararray, titulo:chararray, latitud:chararray, longitud:chararray, fecha:chararray, so2:chararray, no:chararray, co:chararray, pm10:chararray, o3:chararray, dd:chararray, vv:chararray, tmp:chararray, hr:chararray, prb:chararray, rs:chararray, ll:chararray, ben:chararray, tol:chararray, mxil:chararray, pm25:chararray); grouped = GROUPair_qualityBYestacion; avg= FOREACHgrouped GENERATEgroup, AVG(so2); dumpavg;
  44. 44. • Cascading is a data processing APIand processing query planner used for defining, sharing, and executing data-processing workflows • Makes development of complex Hadoop MapReduce workflows easy • In the same way that Pig DATA ANALYTICS BATCH Cascading
  45. 45. // define source and sink Taps. Tap source= new Hfs( sourceScheme, inputPath); Scheme sinkScheme= new TextLine( new Fields( “Estacion", “SO2" ) ); Tap sink= new Hfs( sinkScheme, outputPath, SinkMode.REPLACE); Pipe assembly= new Pipe( “avgSO2" ); assembly = new GroupBy( assembly, new Fields( “Estacion" ) ); // For every Tuple group Aggregator avg= new Average( new Fields( “SO2" ) ); assembly = new Every( assembly, avg); // Tell Hadoopwhich jar file to use Flow flow= flowConnector.connect( “avg-SO2", source, sink, assembly ); // execute the flow, block until complete flow.complete(); DATA ANALYTICS BATCH • Obtain the S02average of each station Cascading
  46. 46. Spark • Cluster computing systems for faster data analytics • Nota modified version of Hadoop • Compatible with HDFS • In-memorydata storage for very fast iterativeprocessing • MapReduce-likeengine • API in Scala, Java and Python DATA ANALYTICS BATCH
  47. 47. Spark DATA ANALYTICS BATCH • Hadoop is slow due to replication, serialization and IO tasks
  48. 48. Spark DATA ANALYTICS BATCH • 10x-100x faster
  49. 49. Spark SQL • Large-scale data warehouse system for Spark • SQLon top of Spark (akaSHARK) • ActuallyHiveQLover Spark • Up to100 x faster than Hive DATA ANALYTICS BATCH
  50. 50. Pros • FasterthanHadoopecosystem • Easierto developnew applications o (Scala, Java and Python API) Cons • Not tested in extremely large clusters yet • Problems when Reducer’s data does not fit in memory DATA ANALYTICS BATCH Spark
  51. 51. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  52. 52. Real-time processing technologies DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS o Flume o Kafka o Kestrel o Flume o Storm o Trident o S4 o Spark Streaming
  53. 53. Flume DATA ACQUISITION STREAM
  54. 54. • Kafka is a distributed, partitioned, replicated commit log service o Producer/Consumermodel o Kafka maintains feeds of messages in categories called topics o Kafka is run as a cluster Kafka DATA STORAGE STREAM
  55. 55. Insert AirQuality sensor log file into Kafka cluster and consume the info. // new Producter Producer<String, String> producer= new Producer<String, String>(config); //Open sensor log file BufferedReaderbr… Stringline; while(true) { line = br.readLine(); if(line ==null) … //wait; else producer.send(new KeyedMessage<String, String>(topic, line)); } Kafka DATA STORAGE STREAM
  56. 56. AirQuality Consumer ConsumerConnectorconsumer= Consumer.createJavaConsumerConnector(config); Map<String, Integer> topicCountMap= new HashMap<String, Integer>(); topicCountMap.put(topic, new Integer(1)); Map<String, List<KafkaMessageStream>> consumerMap= consumer.createMessageStreams(topicCountMap); KafkaMessageStreamstream= consumerMap.get(topic).get(0); ConsumerIteratorit= stream.iterator(); while(it.hasNext()){ // consume it.next() Kafka DATA STORAGE STREAM
  57. 57. • Simple distributed message queue • A single Kestrel server has a set of queues (strictly-ordered FIFO) • On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication • Kestrel vsKafka o Kafka consumers cheaper (basically just the bandwidth usage) o Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation. o Kafka has significantly better throughput. o Kestrel does not support ordered consumption Kestrel DATA STORAGE STREAM
  58. 58. Interceptor • Interface org.apache.flume.interceptor.Interceptor • Can modify or even drop events based on any criteria • Flume supports chainingof interceptors. • Types: o Timestamp interceptor o Host interceptor o Static interceptor o UUID interceptor o Morphline interceptor o Regex Filtering interceptor o Regex Extractor interceptor DATA ANALYTICS STREAM Flume
  59. 59. • The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Sourceand Channel. Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35";"981"; "23"; "3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62";"983"; "23"; Flume DATA ANALYTICS STREAM
  60. 60. # Writeformatcan be textorwritable … #Definingchannel–Memory type …1 … #Definingsource–Syslog… … # Definingsink–HDFS … … #Defininginterceptor agent.sources.source.interceptors= i1 agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter class StationFilter implements Interceptor … if(!"Station".equals("2")) discard data; else save data; Flume DATA ANALYTICS STREAM
  61. 61. Hadoop Storm JobTracker Nimbus TaskTracker Supervisor Job Topology • Distributed and scalable realtime computation system • Doing for real-time processing what Hadoop did for batch processing • Topology:processinggraph.Eachnodecontainsprocessinglogic(spoutsandbolts).Linksbetweennodesarestreamsofdata o Spout:Sourceofstreams.Readadatasourceandemitthedataintothetopologyasastream o Bolts:Processingunit.Readdatafromseveralstreams,doessomeprocessingandpossiblyemitsnewstreams o Stream:Unboundedsequenceoftuples.Tuplescancontainanyserializableobject Storm DATA ANALYTICS STREAM
  62. 62. CAReader LineProcessor AvgValues • AirQuality average values o Step 1: build the topology Storm Spout Bolt Bolt DATA ANALYTICS STREAM
  63. 63. • AirQuality average values o Step 1: build the topology TopologyBuilderAirAVG= new TopologyBuilder(); builder.setSpout("ca-reader", new CAReader(), 1); //shuffleGrouping-> evendistribution AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3) .shuffleGrouping("ca-reader"); //fieldsGrouping-> fieldswiththesamevaluegoestothesametask AirAVG.setBolt("ca-avg-values", new AvgValues(), 2) .fieldsGrouping("ca-line-processor", new Fields("id")); Storm DATA ANALYTICS STREAM
  64. 64. public void open(Map conf, TopologyContextcontext, SpoutOutputCollectorcollector) { //Initializefile BufferedReaderbr= new … … } publicvoidnextTuple() { Stringline = br.readLine(); if(line == null) { return; } else collector.emit(new Values(line)); } Storm • AirQuality average values o Step 2: CAReader implementation (IRichSpout interface) DATA ANALYTICS STREAM
  65. 65. publicvoiddeclareOutputFields(OutputFieldsDeclarerdeclarer) { declarer.declare(new Fields("id", "stationName", "lat", … } publicvoidexecute(Tupleinput, BasicOutputCollectorcollector) { collector.emit(new Values(input.getString(0).split(";"); } Storm • AirQuality average values o Step 3: LineProcessor implementation (IBasicBolt interface) DATA ANALYTICS STREAM
  66. 66. public void execute (Tuple input, BasicOutputCollector collector) { //totals and count are hashmaps with each station accumulated values if (totals.containsKey(id)) { item = totals.get(id); count = counts.get(id); } else { //Create new item } //update values item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2"))); item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no"))); … } Storm • AirQuality average values oStep 4: AvgValues implementation (IBasicBolt interface) DATA ANALYTICS STREAM 66
  67. 67. • High level abstraction on top of Storm o Provides high level operations (joins, filters, projections, aggregations, functions…) Pros o Easy, powerful and flexible o Incremental topology development o Exactly-once semantics Cons o Very few built-in functions o Lower performance and higher latency than Storm Trident DATA ANALYTICS STREAM
  68. 68.  Simple Scalable Streaming System  Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data  Inspired by MapReduce and Actor models of computation o Data processing is based on Processing Elements (PE) o Messages are transmitted between PEs in the form of events (Key, Attributes) o Processing Nodes are the logical hosts to PEs S4 DATA ANALYTICS STREAM
  69. 69. … <bean id="split" class="SplitPE"> <property name="dispatcher" ref="dispatcher"/> <property name="keys"> <!--Listen for both words and sentences --> <list> <value>LogLines *</value> </list> </property> </bean> <bean id="average" class="AveragePE"> <property name="keys"> <list> <value>CAItem stationId</value> </list> </property> </bean> … • AirQuality average values S4 DATA ANALYTICS STREAM
  70. 70. Spark Streaming • Spark for real-time processing • Streaming computation as a series of very short batch jobs (windows) • Keep state in memory • API similar to Spark DATA ANALYTICS STREAM
  71. 71. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  72. 72. • We are in the beginning of this generation • Short-term Big Data processing goal • Abstraction layer over the Lambda Architecture • Promising technologies o SummingBird o Lambdoop Hybrid Computation Model
  73. 73. SummingBird • Library to write MapReduce-likeprocess that can be executed on Hadoop, Stormor hybrid model • Scalasyntaxis • Same logic can be executed in batch, real-time and hybrid bath/real mode HYBRID COMPUTATION MODEL
  74. 74. SummingBird HYBRID COMPUTATION MODEL
  75. 75. Pros • Hybrid computation model • Same programing model for all proccesing paradigms • Extensible Cons • MapReduce-like programing • Scala • Not as abstract as some users would like SummingBird HYBRID COMPUTATION MODEL
  76. 76.  Software abstraction layer over Open Source technologies o Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident  Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process  Same single APIfor the three processing paradigms o Batch processing similar to Pig / Cascading o Real time processing using built-in functions easier than Trident o Hybrid computation model transparent for the developer Lambdoop HYBRID COMPUTATION MODEL
  77. 77. Lambdoop Data Operation Data Workflow Streamingdata Staticdata HYBRID COMPUTATION MODEL
  78. 78. DataInput db_historical = new StaticCSVInput(URI_db); Datahistorical = new Data(db_historical); Workflowbatch = new Workflow(historical); Operation filter = new Filter(“Station",“=", 2); Operation select = new Select(“Titulo“, “SO2"); Operation group = new Group(“Titulo"); Operation average = new Average(“SO2"); batch.add(filter); batch.add(select); batch.add(group); batch.add(variance); batch.run(); Dataresults = batch.getResults(); … Lambdoop HYBRID COMPUTATION MODEL
  79. 79. DataInput stream_sensor = new StreamXMLInput(URI_sensor); Datasensor = new Data(stream_sensor) Workflowstreaming = new Workflow (sensor, new WindowsTime(100) ); Operation filter = new Filter("Station","=", 2); Operation select = new Select("Titulo", "S02"); Operation group = new Group("Titulo"); Operation average = new Average("S02"); streaming.add(filter); streaming.add(select); streaming.add(group); streaming.add(average); streaming.run(); While (true) { Data live_results = streaming.getResults(); … } Lambdoop HYBRID COMPUTATION MODEL
  80. 80. DataInput historical= new StaticCSVInput(URI_folder); DataInput stream_sensor= new StreamXMLInput(URI_sensor); Data all_info = new Data (historical, stream_sensor); Workflow hybrid= new Workflow (all_info, new WindowsTime(1000) ); Operation filter = new Filter ("Station","=", 2); Operation select = new Select ("Titulo", "SO2"); Operation group = new Group("Titulo"); Operation average = new Average("SO2"); hybrid.add(filter); hybrid.add(select); hybrid.add(group); hybrid.add(variance); hybrid.run(); Data updated_results = hybrid.getResults(); Lambdoop HYBRID COMPUTATION MODEL
  81. 81. Pros • High abstraction layer for all processing model • All steps in the data processing pipeline • Same Java API for all programing paradigms • Extensible Cons • Ongoing project • Not open-source yet • Not tested in larger cluster yet Lambdoop HYBRID COMPUTATION MODEL
  82. 82. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  83. 83. Open Issues • Interoperabilitybetween well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark) • European technologies (Stratosphere / Apache Flink) • Massive StreamingMachine Learning • Real-time Interactive Visual Analytics • Vertical (domain-driven) solutions
  84. 84. Conclusions Casado R., Younas M. Emergingtrendsand technologiesin big data processing. ConcurrencyComputat.: Pract. Exper. 2014
  85. 85. Conclusions • Big Data is notonlyHadoop • Identify the processing requirements of your project • Analyzethe alternatives for all steps in the data pipeline • The battle for real-time processing is open • Stay tuned for the hybrid computation model
  86. 86. Thanks for your attention! Questions? ruben.casado@treelogic.com ruben_casado

×