Lambdoop, a framework for easy development of big data applications

9,505 views
9,433 views

Published on

Most of the existing Big Data technologies are focused on managing large amount of static data (e.g. Hadoop, Hive, Pig). On the other hand, trending approaches try to deal with real time processing of dynamic data (e.g Storm, S4). Batch processing of massive static data provides strong results since they can take into account more information and, for example, perform better training of predictive models. But batch processing takes time and is not feasible for domains where the response time is a critical issue. Real time processing solves this issue, but it uses a weak approach where the analyzed information is limited in order to achieve low latency. Many domains require the benefit of both batch and real time processing approaches. It is not an easy issue to develop software architecture by tailoring suitable technologies, software layers, data sources, data storage solutions, smart algorithms and so on to achieve the good scalable solution. That is where Lambdoop comes in. Lambdoop is a software framework for easing developing Big Data applications by combining real time and batch processing approaches. It implements a Lambda based architecture that provide an abstraction layer to the developers. Developers do not have to deal with different technologies, configurations, data formats … They just use Lambdoop framework as the only needed API. Lambdoop also includes extra tools such as input/output drivers, visualization tools, cluster management tools and widely accepted AI algorithms. To evaluate the effectiveness of Lambdoop we have applied our framework to different real scenarios: 1) Analysis and prediction of data air quality information; 2) Social networks based identification of emergent situations and 3) Quantum Chemistry molecular dynamics simulations. Conclusions of the evaluations provide good feedback to improve the development of the framework.

Published in: Technology, Business
1 Comment
25 Likes
Statistics
Notes
No Downloads
Views
Total views
9,505
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
257
Comments
1
Likes
25
Embeds 0
No embeds

No notes for slide

Lambdoop, a framework for easy development of big data applications

  1. 1. A framework for easy development of Big Data applications Rubén Casado ruben.casado@treelogic.com @ruben_casado
  2. 2. Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions
  3. 3. About me :-)
  4. 4.  PhD in Software Engineering  MSc in Computer Science  BSc in Computer Science Work Experience Academics
  5. 5. About Treelogic
  6. 6. Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledge to improve quality standards in our daily life
  7. 7. TREELOGIC – Distributor and Sales
  8. 8. International Projects National Projects Research Lines Computer Vision Regional Projects Solutions Security & Safety Big Data Teraherzt technology R&D Manag. System Justice Health Data science Social Media Analysis Semantics Internal Projects R&D Transport Financial services ICT tailored solutions
  9. 9. 7 ongoing FP7 projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostars projects Coordinating all of them
  10. 10. 7 years’ experience in R&D projects Research & INNOVATION
  11. 11. www.datadopter.com
  12. 12. Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions
  13. 13. What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques
  14. 14. How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -
  15. 15. 3 problems Volume Variety Velocity
  16. 16. 3 solutions Batch processing Real-time NoSQL processing
  17. 17. 3 solutions Batch processing Real-time NoSQL processing
  18. 18. Batch processing • Scalable • Large amount of static data • Distributed • Parallel • Fault tolerant • High latency Volume
  19. 19. Real-time processing • Low latency • Continuous unbounded streams of data • Distributed • Velocity • Parallel Fault-tolerant
  20. 20. Hybrid computation model • Low latency • Massive data + Streaming data • Scalable • Combine batch and real-time results Volume Velocity
  21. 21. Hybrid computation model All data Batch processing Batch results Final results Combination New data Real-time processing Stream results
  22. 22. Processing Paradigms    Large amount of statics data Scalable solution Volume 2006 1ª Generation 2010 Real-time processing     Inception Batch processing   2003 Computing streaming data Low latency Velocity 2ª Generation 2014 Hybrid computation   Lambda Architecture Volume + Velocity 3ª Generation
  23. 23. Processing Pipeline DATA DATA DATA ACQUISITION STORAGE ANALYSIS RESULTS
  24. 24. Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions
  25. 25. What is Lambdoop?  Open source framework  Software abstraction layer over Open Source technologies o Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis  Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process  Same single API for the three processing paradigms o Batch processing similar to Pig / Cascading o Real time processing using built-in functions easier than Trident o Hybrid computation model transparent for the developer
  26. 26. Why Lambdoop? • Building a batch processing application requires o MapReduce developing o Use other Hadoop related tools (Sqoop, Zookeper, HCatalog …) o Storage systems (Hbase, MongoDB, HDFS, Cassandra…) • o Streaming computing (S4, Storm, Samza) o o Real-time processing requires Unboundend input (Flume, Scribe) Temporal data stores (In-memory, Kafka, Kestrel)
  27. 27. Why Lambdoop? • Building a hybrid computation system (Lambda Architecture) requires o Application logic has to be defined in two different systems using different frameworks o Data must be serialized consistently and kept in sync between each system o Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results
  28. 28. Why Lambdoop? “One of the most interesting areas of future work is high level abstractions that map to a batch processing component and a real-time processing component. There's no reason why you shouldn't have the conciseness of a declarative language with the robustness of the batch/real-time architecture”. Nathan Marz “Lambda Architecture is a implementation challenge. In many real-world situations a stumbling block for switching to a Lambda Architecture lies with a scalable batch processing layer. Technologies like Hadoop (…) are there but there is a shortage of people with the Rajat Jain expertise to leverage them.
  29. 29. Lambdoop Streaming data Workflow Data Static data Operation Data
  30. 30. Lambdoop Batch Hybrid Real-Time
  31. 31. Data Input  Information represented as Data objects o Types: o StaticData o StreamingData o Every Data object has a Schema to describe the Data fields (types, nulleables, keys…) o A Data object is composed by Datasets.
  32. 32. Data Input  Dataset o A Data object is formed by one or more Datasets. o All Datasets of a Data object share the same Schema o Datasets are formed by Register objects, o A Register is composed by RegisterFields.
  33. 33. Data Input  Schema o Very similar to Avro definition schemas. o Allow to define input data’s structure, fields, types, nulleables… o Json format Station Title Lat. Lon. Date SO2 NO CO PM10 O3 dd vv TMP HR PRB 23 street 43.5 29 5.67 3 2011-0104 7 8 0.35 13 67 158 3.87 18.8 34 982 32 road 44.5 5.72 2011-0104 7 8.6 0.4 12 68 158 3.87 19 33 975 { "type": "csv", "name": "AirQuality records", "fieldSeparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, … {"name": "PRB","type": "double","index": 20,"nullable": "true"} ] }
  34. 34. Data Input  Importing data into Lambdoop o Loaders: Import information from multiple sources and store it into the HDFS as Data objects o Producers: Get streaming data and represent it as Data objects o Heterogeneous sources. o Serialize information into Avro format
  35. 35. Data Input • Static Data example: Importing a Air Quality dataset from local logs to HDFS o Loader o Schema’s path is files/csv/Air_quality_schema //Read schema from a file String schema = readSchemaFile(schema_file); Loader loader = new CSVLoader("AQ.avro", uri, schema) Data input = new StaticData(loader);
  36. 36. Data Input • Streaming Data example: Reading streaming sensor data from TCP port o Producer o Weather stations emit messages to port 8080 o Schema’s path is files/csv/Air_quality_schema int port = 8080; //Read schema String schema = readSchemaFile (schema_file); Producer producer = new TCPProducer ("AirQualityListener", refresh, port, schema); // Create Data object Data data = new StreamingData(producer)
  37. 37. Data Input  Extensibility o Users can implement their own data loaders/producers 1) Extend Loader/Producer interface 2) Read data from original source 3) Get and serialize information (Avro format) considering Schemas
  38. 38. Operations  Unitary actions to process data  An Operation takes Data as input, processes the Data and produces another Data as output  Types of operations:  Aggregation: Produces a single value per DataSet  Filter: Output data has the same schema as input data  Group: Produces several DataSet, grouping registers together  Projection: Changes the Data schema, but preserves the records and their values  Join: Combines different Data objects
  39. 39. Operations Operations Aggregation(1) Aggregation(2) Filter Group Projection Join Count Skewness Filter Group Select Inner Join Average Z-Test Limit RollUp Frecuency Left Join Sum Stderror TopN Cube Variation Right Join MinValue Variance BottomN N-Til MaxValue Covariance Max Mode Min Outer Join
  40. 40. Operations  Extensibility (User Defined Operations): New operations can be defined implementing a set of interfaces:  OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed  BatchOperation: Provides MapReduce logic to process the input Data  StreamingOperation: Provides Storm/Trident based functions to process streaming registers  HybridOperation: Provides merging logic between streaming and batch results
  41. 41. Operations  User Defined Operation interfaces
  42. 42. Workflows  Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations o BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output o StreamingWorkflow: Operates on a StreamingData to produce another StreamingData o HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData)  Workflow connections Data Workflow Data Workflow Data Workflow Data Workflow Data Workflow Workflow Data
  43. 43. Workflows // Batch processing example String schema = readSchemaFile(schema_file); Loader loader = new CSVLoader("AQ.avro",uri, schema) Data input = new StaticData(loader); Workflow wf = new BatchWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue(«street 45")); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Run the workflow wf.run(); //Get the results Data output = wf.getResults();
  44. 44. Workflows //Real-time processing example Producer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer); Workflow wf = new StreamingWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("Estación Av. Castilla")); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Runs the workflow wf.run(); //Gets the results While (!stop){ Data output = wf.getResults(); … }
  45. 45. Workflows // Hybrid computation example Producer producer = new PortProducer("catest", schema1, config); StreamingData streamInput = new StreamingData(producer); Loader loader = new CSVLoader("AQ.avro",uri, schema2) StaticData batchInput = new StaticData(loader); Data input = new HybridData(streamInput, batchInput); Workflow wf = new HybridWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34")); wf.addOperation(filter); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addOperation(avg); //Run the workflow wf.run(); //Get the results While (!stop) { Data output = wf.getResults();}
  46. 46. Results exploitation Filter RollUp StdError Avg Select Data Cube Variance join … EXPORT CSV, JSON, …
  47. 47. Results exploitation  Visualization /* Produce from Twitter */ TwitterProducer producer = new TwitterProducer(…); Data data = new StreamingData(producer); StreamingWorkflow wf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); … /* Get results from workflow*/ Data results = wf.getResults(); /* Show results. Set dashboard refresh*/ Dashboard d = new Dashboard(config); d.addChart(LambdoopChart.createBarChart(results, new RegisterField("count"), “Tweetscount");
  48. 48. Results exploitation  Visualization
  49. 49. Results exploitation  Visualization
  50. 50. Results exploitation  Export Data data = new StaticData(loader); Workflow wf = new BatchWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); … /* Get results from workflow*/ CSV, JSON, … Data results = wf.getResults(); /* Export results */ Exporter.asCSV(results, File); MongoExport(results, Map<String, String> conf); PostgresExport(results, Map<String, String> conf);
  51. 51. Results exploitation  Alarms Data data = new StreamingData(producer); StreamingWorkflow wf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); … /* Get results from workflow*/ Data results = wf.getResults(); /* Set alarm condition: T/F (e.g time or certain value) action: execution (e.g. show results, send an email)*/ AlarmFactory.setAlert(results, condition, action);
  52. 52. Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions
  53. 53.  Change configurations and easily manage the cluster  Friendly tools for monitoring the health of the cluster  Wizard-driven Lambdoop installation of new nodes 55
  54. 54.  Visual editor for defining workflows and scheduling tasks o Plugin for Eclipse o Visual elements for: – Input Sources – Loader – Operations – Operation parameters o o – RegisterFields Static values Visualization elements o Generates workflow code o XML Import/Export o Scheduling of workflows
  55. 55. • Tool for working with messy big data, cleaning it and transforming it. • • • • • Import data in different formats Explore datasets Apply advanced cell transformations Refine inconsistencies Filter and partition your big data
  56. 56. Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions
  57. 57. Social Awareness Based Emergency Situation Solver  Objective: To create event assessment and decision-making supporting tools which improve quickness and efficiency when facing emergency situations making.  Exploit the information available in Social Networks to complement data about emergency situations  Real-time processing
  58. 58. Alert detection Locations Information “Attached” resources (photo, video, links,…)
  59. 59.  Static stations and mobile sensors in Asturias sending streaming data  Historical data of > 10 years  Monitoring, trends identification, predictions  Batch processing + Real processing+ Hybrid computation
  60. 60.  Quantum Mechanics Molecular Dynamics  Computer simulation of physical movements of microscopic elements  Large amount of data as streaming in each time-step  Real-time interaction (query, visual exploration) during the simulation  Data analytics on the whole dataset  Real time processing + Batch processing + Hybrid computation
  61. 61. Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions
  62. 62. Conclusions • Big Data is not only batch processing • To implement a Lambda Architecture is not trivial • Lambdoop: Big Data made easy • High abstraction layer for all processing model • All steps in the data processing pipeline • Same Java API for all programing paradigms • Extensible
  63. 63. Conclusions • Roadmap – Now • • Get feedback from the community • – Release a early version of Lambdoop Framework as Open Source Increase the set of built-in functions Next o o Stable versions of Lambdoop ecosystem o – Move all components to YARN Models (Mahout, Jubatus, Samoa, R) Beyond • Configurable processing engines (Spark, S4, Samza …) • Configurable data stores (Cassandra, MongoDB, ElephantDB, VoltDB …)
  64. 64. If you want stay tuned about Lambdoop register in www.lambdoop.com ruben.casado@treelogic.com info@datadopter.com www.lambdoop.com www.datadopter.com @ruben_casado @datadopter www.treelogic.com @treelogic

×