Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Implementing the Lambda Architecture with Spring XD

5,490 views

Published on

Recorded at SpringOne2GX 2014.
Speaker: Carlos Queiroz
Big Data Track
The lambda architecture has been proposed as a general purpose data system that aims to solve the problem of computing arbitrary functions on an arbitrary dataset in (near) real-time. In this talk we introduce the lambda architecture and show how it can be implemented with SpringXD, GemFireXD and Hadoop (HDFS and MapReduce) as the foundation of the architecture implementation. To validate the architecture we introduce a CDR (Call Detail Record) mining application as use case of the lambda architecture. We finalise by showing a demo of the CDR (Call Detail Record) mining application.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Implementing the Lambda Architecture with Spring XD

  1. 1. Implementing the Lambda Architecture with Spring XD By Carlos Queiroz © 2014 SpringOne 2GX. All rights reserved. Do not distribute without permission.
  2. 2. Me Advanced Field Engineer 2 Carlos Queiroz
  3. 3. Agenda ! • Introduction to the Lambda Architecture • Applying Lambda architecture to a business case • Implementation details 3
  4. 4. What is Lambda Architecture? Implementation of a set of desired properties on general purpose big data systems. ! Generic architecture addressing common requirements in big data applications. ! Set of design patterns for dealing with historical and operational data. 4
  5. 5. Approach is new, ideas not so… 5 Stream computing Analytics on Data in Motion Non-Traditional / Non-Relational Data Feeds Non-Traditional / MapReduce like models Non-Relational Data Sources Traditional / Relational Data Sources Internet- Scale Data Sets Hadoop Traditional Warehouse Analytics on Traditional / Relational Data Sources Streaming Data Data at Rest Data Warehouse Analytics on Structured Data The Big Data Ecosystem
  6. 6. Why Lambda Architecture? To build a data system that can answer questions by running functions that take at the entire dataset as input. A general purpose data system 6
  7. 7. Desired Properties of such systems Fault-tolerance Generic Scale linearly, horizontally. Extensible Be able to achieve low latency updates when necessary. Be able to “ask” arbitrary questions to the system 7
  8. 8. How to support those properties? Decompose the problem into pieces 8 Speed Layer Batch Layer Serving Layer
  9. 9. Batch Layer 9 Master dataset Batch Layer Batch view Batch view Batch view runBatchLayer() { while(true) recomputeFunctions() }
  10. 10. The Master dataset 10 Master dataset Source of Truth Recom pute Protect Write once Read many
  11. 11. Master dataset requirements 11 Operation Requisite Comments Writes Efficient appends of new data Basically add new pieces of data. Easy to append new data Writes Scalable storage Need to handle possibly, petabytes of data Reads Support for parallel processing Functions usually work on the entire dataset. Need to support handling large amounts of data Reads Vertically partition data Not necessary to look all the data all the time. Some functions may need to look at only relevant data (e.g. 1 week of calls) Writes/Reads Costs for processing. Flexible storage Storage costs money (a lot). Need flexibility on how to store and compress data.
  12. 12. The Master dataset 12 HDFS - Hadoop Distributed File System
  13. 13. Serving Layer Batch updates and random reads Makes the batch views “queryable” 13 Batch view Batch view Distributed database
  14. 14. Desired properties on the Serving layer 14 Batch view Batch view Low Latency High Throughput
  15. 15. Serving layer database requirements 15 Batch writable Scalable Random reads Fault-tolerant
  16. 16. Batch and serving layers 16 Master dataset Batch Layer Batch view Batch view Batch view analytical functions, immutable storage pre-computed analytical functions (Views) Serving Layer Only property missing - low latency updates
  17. 17. Speed layer Feed Feed Allows arbitrary functions computed on arbitrary data on (near) real-time. 17 Feed Feed Feed Feed
  18. 18. Speed layer 18 Narrow but more up to date view RRRReeeaeaalalTlTlTiTimimimmeee e F F FuFuununncncctctitiotioiononnn RReeaallTTiimmee FFuunnccttiioonn RRRReeeaeaalalTlTlTiTimimimmeee e V V VViieieiewewww RRReeeaaallTlTTiimimmeee V VViieieewww Depending of functions complexity incremental computation approach is recommended
  19. 19. Incremental computation realtime view = function(new data, realtime view) 19
  20. 20. Eventual Accuracy ! Some computations are harder to compute ! For such cases approximations are used. Results are approximate to the correct answer ! Sophisticated queries such as realtime machine learning are usually done with eventual accuracy. Too complex to be computed exactly. 20
  21. 21. Approximation & Randomisation ! ! Approximation find an answer correct within some factor (answer that is within 10% of correct result) ! Randomisation allow a small probability of failure (1 in 10,000 give the wrong answer) ! ! 21
  22. 22. Synopses structures • Sampling • Sketches • Histograms • Wavelets ! Library implementations • algebird (https://github.com/twitter/algebird) • samoa (https://github.com/yahoo/samoa/wiki) 22 http://charuaggarwal.net/synopsis.pdf
  23. 23. Stream processing (real-time function) Run the realtime functions to update the realtime views 23 Data Ingestion Stream Processing (realtime function)
  24. 24. One-at-a-time Divide your processing into worker processes, and put queues between the worker processes 24 Queues and workers paradigm
  25. 25. Generalised one-at-a-time approach • Works on a higher level • Stream computation defined as a graph (usually). • Storm, InfoSphere Streams models • Filters and pipes • Spring XD • At least once in case of failures. 25 cut -d" " -f1 < access.log | sort | uniq -c | sort -rn | less
  26. 26. Micro-batching • Small batches of events processed one at a time ! • Exactly one processing ! • Implementations: • Spark Streaming 26 batch #1 batch #2 Data Ingestion
  27. 27. Lambda Architecture 27 Ad-hoc queries analytical functions analytical functions, immutable storage raw data Data Ingestion Speed Layer Batch Layer Serving Layer raw data raw data raw data raw data pre-computed analytical functions
  28. 28. Principles of Lambda Architecture Store data in it’s rawest form Immutability and perpetuity ! Re-computation ! Query = function(all data) 28
  29. 29. Implementing the Lambda Architecture The ACM DEBS 2014 Grand Challenge1 To demonstrate the applicability of event-based systems to provide scalable, real-time analytics over high volume sensor data 29 1 http://www.cse.iitb.ac.in/debs2014/
  30. 30. ACM DEBS 2014 Challenge Application Scenario ! Analysis of energy consumption measurement 30 house_id dev_id Short-term load forecasting plug plug makes load forecasts based on current load and what was learned over historical data Load statistics for real-time demand management finds outliers based on the energy consumption plug_id
  31. 31. Data model 31 Data source Field Comments ID UNIQUE IDENTIFIER TIMESTAMP TIMESTAMP OF MEASUREMENT VALUE MEASUREMENT PROPERTY TYPE: 0 - WORK, 1 - LOAD PLUG_ID UNIQUE IDENTIFIER OF A PLUG HOUSEHOLD_ID UNIQUE IDENTIFIER OF A HOUSEHOLD HOUSE_ID UNIQUE IDENTIFIER OF A HOUSE PL - House Field Comments TS TIMESTAMP OF STARTING TIME PREDICTION HOUSE_ID HOUSE ID PREDICTED_LOAD PREDICTED LOAD PL - Plug Field Comments TS TIMESTAMP OF STARTING TIME PREDICTION HOUSE_ID HOUSE ID HOUSEHOLD_ID PLUG_ID PREDICTED_LOAD PREDICTED LOAD Outliers Field Comments TS_START TIMESTAMP OF START TIME WINDOW TS_STOP TIMESTAMP OF STO PTIME WINDOW HOUSE_ID HOUSE ID PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL
  32. 32. Analytical models 32 Load prediction per house, per plug It is based on the average load of the current window and the median of the average loads of windows covering the same time of all past days L(si+2) = (avgLoad(si) + median(avgLoad(sj)))/2 Outliers sj = s(i+2–n⇤k) For each house calculate the percentage of plugs which have a median load during the last hour greater than the median load of all plugs (in all households of all houses) during the last hour
  33. 33. How others did 33 Storm-based implementation
  34. 34. How others did 34 Oracle DB + Oracle Event Processing (OEP)
  35. 35. How others did SEEP: Real-Time Stateful Big Data Processing 35 http://lsds.doc.ic.ac.uk/projects/seep/debs14-gc
  36. 36. Implementation details Fork from https://github.com/pivotalsoftware/gfxd-demo 36 Thanks to Jens Deppe and William Markito
  37. 37. Our approach - ACM DEBS 2014 system architecture 37 RabbitMQ Spring XD Hadoop GemFire XD Web App (Spring boot) Batch layer Serving layer Speed layer Sensor EvSeenntsor EvSeenntsor EvSeenntsor EvSeenntsor Event
  38. 38. Data ingestion 38 Sensor Events RabbitMQ publishes Stream App Spring XD subscribes S S S S S S S S
  39. 39. Speed layer 39 Application flow Remove duplicates Filter Load Forecast Tap Fix missing values transform Store to GFXD Sink Find outliers Tap Load pred. model Processor Store to GFXD Sink Outliers Model Processor not a flow Store to GFXD Sink RT view RT view RT view RT view RT view RT view Load Forecast Tap Load pred. model Processor Store to GFXD Sink RT view RT view RT view H P
  40. 40. Stream actions 40 Duplicates filtering enrichment (compute time slice) compute model Produce RT View
  41. 41. Spring XD streams 41 pumpin - sends all data to a master queue sensoreventenricher - Consumes the data from the queue, filter and transform the data before store on HDFS findoutliers - Taps from master_ds stream to compute outlier model loadpred{h,p} - Taps from master_ds stream to compute load prediction for house and plug (2 streams) 5 streams
  42. 42. Batch Layer 42 Batch job that starts from Spring XD Uses cron to run every X hour/min/day? Load pred. model Job Hadoop Batch view Batch view Batch view Hadoop based system Store an immutable and constantly growing master dataset (of all datasets) ! Compute arbitrary functions (models) on the existing datasets. ! Essentially, runs the MR models.
  43. 43. Spring XD jobs • A unique job to run the MR model to compute the historical aspect of the models. • MR job runs every “day/hour/min” • Job is completely independent from the other streams 43 1 job Load raw_sensor table compute model update avg, outliers table
  44. 44. Serving layer (RT and Batch views) 44 Field Comments TS TIMESTAMP OF STARTING TIME PREDICTION HOUSE_ID HOUSE ID PREDICTED_LOAD PREDICTED LOAD Gemfire XD Field Comments TS_START TIMESTAMP OF START TIME WINDOW TS_STOP TIMESTAMP OF STO PTIME WINDOW HOUSE_ID HOUSE ID PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL Field Comments TS TIMESTAMP OF STARTING TIME PREDICTION HOUSE_ID HOUSE ID HOUSEHOLD_ID PLUG_ID PREDICTED_LOAD PREDICTED LOAD Get updates from MR and SP models Holds both batch and real-time views
  45. 45. Web UI 45 Web App (Spring boot) view view view
  46. 46. Why Gemfire XD? ‣ Distributed database ! ‣ Tightly integrated with Pivotal Hadoop ! ‣ SQL support ! ‣ Fault-tolerant ! ‣ In-memory (fast access) 46 Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type GemFire XD Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Hadoop column_name data_type column_name data_type column_name data_type column_name data_type column_name data_type column_name data_type
  47. 47. Why Pivotal HD (PHD) 47 ‣ Hadoop based ! ‣ Tightly integrated with Gemfire XD Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type GemFire XD Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Table1 Model: Data has_many_and_belongs_to_many :teams column_name data_type Hadoop column_name data_type column_name data_type column_name data_type column_name data_type column_name data_type column_name data_type
  48. 48. Why Spring XD? ‣ Data Ingestion ! ‣ Real-time Analytics ! ‣ Workflow Orchestration ! ‣ Integration 48
  49. 49. Full Open source implementation 49 HBase?? Hadoop SpringXD
  50. 50. Is Lambda Architecture perfect? • Suitable for specific use cases • Doesn't contemplate reference data access. • Not always possible to use same model for Real-time and Batch • Other ideas to improve the lambda architecture • Multiple batch layers • incremental batch layers 50
  51. 51. Source code 51 https://bitbucket.org/caxqueiroz/debs2014-challenge
  52. 52. Thank you!! 52

×