Big Data and Fast Data - big and fast combined, is it possible?

  • 2,776 views
Uploaded on

Big Data (volume) and real-time information processing (velocity) are two important aspects of Big Data systems. At first sight, these two aspects seem to be incompatible. Are traditional software …

Big Data (volume) and real-time information processing (velocity) are two important aspects of Big Data systems. At first sight, these two aspects seem to be incompatible. Are traditional software architectures still the right choice? Do we need new, revolutionary architectures to tackle the requirements of Big Data. This presentation discusses the idea of the so-called lambda architecture for Big Data, which acts on the assumption of a bisection of the data-processing: in a batch-phase a temporally bounded, large dataset is processed either through traditional ETL or MapReduce. In parallel, a real-time, online processing is constantly calculating the values of the new data coming in during the batch phase. The combination of the two results, batch and online processing is giving the constantly up-to-date view. This talk presents how such an architecture can be implemented using Oracle products such as Oracle NoSQL, Hadoop and Oracle Event Processing.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,776
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
215
Comments
1
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 2013 © TrivadisBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
WELCOME Big Data and Fast Databig and fast combined – is itpossible?Guido Schmutz und Albert Blarer24. April 201324. April 2013Big Data und Fast Data1
  • 2. 2013 © TrivadisGuido Schmutz•  Working for Trivadis for more than 16 years•  Oracle ACE Director for Fusion Middleware and SOA•  Co-Author of different books•  Consultant, Trainer Software Architect for Java, Oracle, SOAand EDA•  Member of Trivadis Architecture Board•  Technology Manager @ Trivadis•  More than 25 years of software development 
experience•  Contact: guido.schmutz@trivadis.com•  Blog: http://guidoschmutz.wordpress.com•  Twitter: gschmutz14.06.20122Where and When should I use the Oracle Service Bus (OSB)
  • 3. 2013 © TrivadisBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

  • 4. 2013 © TrivadisMit über 600 IT- und Fachexperten bei Ihnen vor Ort.411 Trivadis Niederlassungen mit
über 600 Mitarbeitenden200 Service Level AgreementsMehr als 4000 TrainingsteilnehmerForschungs- und Entwicklungs-budget: CHF 5.0 / EUR 4 Mio.Finanziell unabhängig und
nachhaltig profitabelErfahrung aus mehr als 1900Projekten pro Jahr bei über 800KundenStand 12/2012HamburgDüsseldorfFrankfurtFreiburgMünchenWienBaselZürichBernLausanne4StuttgartDatumTrivadis – das Unternehmen
  • 5. 2013 © TrivadisCreditsNathan MarzAuthor of „Big Data – Principles and best practics of scalablerealtime data systems“ – Manning PressUsed to be working at Backtype and TwitterCreator of•  Storm•  Cascalog•  ElephantDB24. April 2013Big Data und Fast Data5
  • 6. 2013 © TrivadisAgenda1.  Big Data, what is it?2.  Motivation3.  The Lambda Architecture4.  Implementing the Lambda Architecture5.  Summary24. April 2013Big Data und Fast Data6
  • 7. 2013 © TrivadisBig Data Definition (Gartner et al)14.02.2013Big Data 4 Sales7VelocityTera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing“Traditional” computing in RDBMS 
is not scalable enough. 
We search for “linear scalability”“Only … structured information 
is not enough” – “95% of produced data inunstructured”Characteristics of Big Data: ItsVolume, Velocity and Variety incombination+ Veracity (IBM) - information uncertainty+ Time to action ? – Big Data + Event Processing = Fast Data
  • 8. 2013 © TrivadisBig Data Emerging Technologies24. April 2013Big Data und Fast Data8§  MapReduce (e.g. Apache Hadoop)§  Event Stream Processing & CEP (e.g. Storm or Esper)§  New messaging systems (e.g. Apache Kafka)§  Integration tools (e.g. Spring or Camus)§  New database paradigms (e.g. NoSQL or NewSQL)§  Data mining tools (e.g. Apache Mahout )§  Data extraction and detection tools (e.g. Apache Tika )
  • 9. 2013 © Trivadis14.02.2013Big Data 4 Sales9
  • 10. 2013 © TrivadisVolume Development020406080100020004000600080002005 2007 2009 2011 2013 2015AggregateUncertainty%GlobalDataVolumeinExabytesYearSensors:“internet ofthings”Social Media:video, audio,textVoIP:Skype, MSN,ICQ, ...Enterprise Data:data dictionary,ERD, ...24. April 2013Big Data und Fast Data10
  • 11. 2013 © TrivadisVelocity24. April 2013Big Data und Fast Data11§  Velocity requirement examples:§  Recommendation Engine§  Predictive Analytics§  Marketing Campaign Analysis§  Customer Retention and Churn Analysis§  Social Graph Analysis§  Capital Markets Analysis§  Risk Management§  Rogue Trading§  Fraud Detection§  Retail Banking§  Network Monitoring§  Research and Development
  • 12. 2013 © TrivadisAgenda1.  Big Data, what is it?2.  Motivation3.  The Lambda Architecture4.  Implementing the Lambda Architecture5.  Summary24. April 2013Big Data und Fast Data12
  • 13. 2013 © TrivadisWhat is a data system?•  A system that manages the storage and querying of data with alifetime measured in years encompassing every version of theapplication to ever exist, every hardware failure and every humanmistake ever made.•  A data system answers questions based on information that wasacquired in the past•  Not all bits of information are equal•  Some information is derived from other24. April 2013Big Data und Fast Data13
  • 14. 2013 © TrivadisDesired Properties of a (Big) Data SystemRobust and fault-tolerantLow latency reads and updatesScalableGeneralExtensibleAllows ad hoc queriesMinimal maintenanceDebuggable24. April 2013Big Data und Fast Data14
  • 15. 2013 © TrivadisTypical problem in today’s
architecture/systemsBugs will be deployed to production over the lifetime of a data systemOperational mistakes will be madeHumans are part of the overall system•  Just like hard disks, CPUs, memory, software•  design for human error like you design for any other faultExamples of human error•  Deploy a bug that increments counters by two instead of by one•  Accidentally delete data from database•  Accidental DOS on important internal serviceWorst two consequences: data loss or data corruptionAs long as an error doesn‘t lose or corrupt good data, you can fix whatwent wrong24. April 2013Big Data und Fast Data15Lack of Human Fault Tolerance
  • 16. 2013 © TrivadisMutabilityThe U and D in CRUDA mutable system updates the current state of the worldMutable systems inherently lack human fault-toleranceEasy to corrupt or lose data24. April 2013Big Data und Fast Data16Capturing change traditionallyLack of Human Fault ToleranceName CityGuido BerneAlbert ZurichName CityGuido BaselAlbert Zurich
  • 17. 2013 © TrivadisImmutabilityAn immutable system captures historical records of eventsEach event happens at a particular time and is always true24. April 2013Big Data und Fast Data17Capturing change by storing eventsLack of Human Fault ToleranceName City TimestampGuido Berne 1.8.1999Albert Zurich 10.5.1988Name City TimestampGuido Berne 1.8.1999Albert Zurich 10.5.1988Guido Basel 1.4.2013
  • 18. 2013 © TrivadisImmutabilityImmutability greatly restricts the range of errors that can cause data loss ordata corruptionVastly more human fault-tolerantMuch easier to reason about systems based on immutabilityConclusion: Your source of truth should always be immutable24. April 2013Big Data und Fast Data18Lack of Human Fault Tolerance
  • 19. 2013 © TrivadisWhat about traditional/today’s architectures ? 
Source of Truth is mutable!Rather than build systems like this ….24. April 2013Big Data und Fast Data19MutableDatabaseApplication(Query)RDBMSNoSQLNewSQLMobileWebRIARich ClientSource of TruthSource of Truth
  • 20. 2013 © TrivadisA different kind of architecture with immutable source of truth… why not building them like this24. April 2013Big Data und Fast Data20HDFSNoSQLNewSQLRDBMSView onDataMobileWebRIARich ClientSource of TruthImmutabledataView onDataApplication(Query)Source of Truth
  • 21. 2013 © TrivadisHow to create the views on the Immutable data?On the fly ?Materialized, i.e. Pre-computed ?24. April 2013Big Data und Fast Data21ImmutabledataViewImmutabledataPre-
Computed
ViewsQueryQuery
  • 22. 2013 © TrivadisData = the most raw informationData is information which is not derived from anywhere else•  The most raw form of information•  Data is the special information from which everything else is derivedQuestions on data can be answered by running functions that take dataas inputThe most general purpose data system can answer questions by runningfunctions that take the entire dataset as inputquery = function (all data)The lambda architecture provides a general purpose approach forimplementing arbitrary functions on an arbitrary datasets24. April 2013Big Data und Fast Data22
  • 23. 2013 © TrivadisData = the most raw information24. April 2013Big Data und Fast Data231.2.13 Add iPAD 64GB10.3.13 Add Sony RX-10011..3.13 Add Canon GX-1011.3.13 Remove Sony RX-10012.3.13 Add Nikon S-10014.4.13 Add BoseQC-1515.4.13 Add MacBook Pro 1520.4.13 Remove Canon GX10iPAD 64GBNikon S-100BoseQC-15MacBook Pro 154derive deriveFavorite Product List ChangesCurrent Favorite 
Product ListCurrentProductCountRaw information => dataInformation => derived
  • 24. 2013 © TrivadisBig Data and Batch Processing24. April 2013Big Data und Fast Data24ImmutabledataBatchViewQuery??IncomingDataHow to compute the batch views ?How to compute queries from the views ?
  • 25. 2013 © TrivadisBig Data and Batch Processing24. April 2013Big Data und Fast Data25Fully processed data Last fullbatch periodTime for
batch jobtimenownon-processed datatimenowbatch-processed data§  Using only batch processing, leaves you always with a portion of non-processed data.Adapted from Ted Dunning (March 2012):http://www.youtube.com/watch?v=7PcmbI5aC20But we are not done yet …
  • 26. 2013 © TrivadisAdding Real-Time Processing24. April 2013Big Data und Fast Data26ImmutabledataBatchViewsQuery?DataStreamRealtimeViewsIncomingDataHow to compute queries 
from the views ?How to compute real-time views
  • 27. 2013 © TrivadisAdding Real-Time Processing24. April 2013Big Data und Fast Data271.2.13 Add iPAD 64GB10.3.13 Add Sony RX-10011..3.13 Add Canon GX-1011.3.13 Remove Sony RX-10012.3.13 Add Nikon S-10014.4.13 Add BoseQC-1515.4.13 Add MacBook Pro 1520.4.13 Remove Canon GX10Now Add Canon ScanneriPAD 64GBNikon S-100BoseQC-15MacBook Pro 155computeFavorite Product List ChangesCurrent Favorite 
Product ListCurrentProductCountNow Canon ScannercomputeAdd Canon ScannerStream ofFavorite Product List ChangesImmutable dataViewsData StreamQuery
  • 28. 2013 © TrivadisBig Data and Real Time Processing24. April 2013Big Data und Fast Data28timeFully processed data Last fullbatch periodnowTime for
batch jobbatch processing
worked fine here(e.g. Hadoop)real time processing
works hereblended view for end userAdapted from Ted Dunning (March 2012):http://www.youtube.com/watch?v=7PcmbI5aC20
  • 29. 2013 © TrivadisAgenda1.  Big Data, what is it?2.  Motivation3.  The Lambda Architecture4.  Implementing the Lambda Architecture5.  Summary24. April 2013Big Data und Fast Data29
  • 30. 2013 © TrivadisLambda Architecture24. April 2013Big Data und Fast Data30ImmutabledataBatchViewQueryDataStreamRealtimeViewIncomingDataServing LayerSpeed LayerBatch LayerABC DEFG
  • 31. 2013 © TrivadisLambda ArchitectureA.  All data is sent to both the batch and speed layerB.  Master data set is an immutable, append-only set of dataC.  Batch layer pre-computes query functions from scratch, result is called BatchViews. Batch layer constantly re-computes the batch views.D.  Batch views are indexed and stored in a scalable database to get particularvalues very quickly. Swaps in new batch views when they are availableE.  Speed layer compensates for the high latency of updates to the Batch Views inthe Serving layer.F.  Uses fast incremental algorithms and read/write databases to produce real-time viewsG.  Queries are resolved by getting results from both batch and real-time views24. April 2013Big Data und Fast Data31
  • 32. 2013 © TrivadisLayered ArchitectureStores the immutable constantly growing datasetComputes arbitrary views from this dataset using BigDatatechnologies (can take hours)Can be always recreatedResponsible for indexing and exposing the pre-computed batchviews so that they can be queriedExposes the incremented real-time viewsMerges the batch and the real-time views into a consistent resultComputes the views from the constant stream of data it receivesNeeded to compensate for the high latency of the batch layerIncremental model and views are transient24. April 2013Big Data und Fast Data32Serving LayerBatch LayerSpeed Layer
  • 33. 2013 © TrivadisAgenda1.  Big Data, what is it?2.  Motivation3.  The Lambda Architecture4.  Implementing the Lambda Architecture5.  Summary24. April 2013Big Data und Fast Data33
  • 34. 2013 © TrivadisLambda Architecture24. April 2013Big Data und Fast Data34Speed LayerPrecomputeViewsquerySource: Marz, N. & Warren, J. (2013) Big Data. Manning.Batch LayerPrecomputedinformationAll dataIncrementedinformationProcess streamIncomingDataBatchrecomputeRealtimeincrementServing Layerbatch viewbatch viewreal time viewreal time viewMerge
  • 35. 2013 © TrivadisLambda Architecture24. April 2013Big Data und Fast Data35one possible product/framework mappingSpeed LayerPrecomputeViewsqueryBatch LayerPrecomputedinformationAll dataIncrementedinformationProcess streamIncomingDataBatchrecomputeRealtimeincrementServing Layerbatch viewbatch viewreal time viewreal time viewMerge
  • 36. 2013 © TrivadisImplementing Batch LayerImmutable Data•  Append only•  Normalized•  Stores master copy of all dataPre-computed information•  Function that takes all data as inputquery = function(all-data)•  High Latency, Batch processing•  Unrestrained computation•  Horizontal scalable24. April 2013Big Data und Fast Data36ImmutabledataBatch
ViewscomputePrecomputeViewsBatch LayerPrecomputedinformationAll dataBatchrecomputeBatch Layer Serving Layer
  • 37. 2013 © TrivadisApache Hadoop HDFSHDFS = the Hadoop Distributed File SystemA distributed file storage systemRedundant storageDesigned to reliably store data using commodity hardwareDesigned to expect hardware failuresIntended for large filesDesigned for batch inserts24. April 2013Big Data und Fast Data37Batch Layer
  • 38. 2013 © TrivadisApache Hadoop Map Reduce24. April 2013Big Data und Fast Data38§  Hadoop Map Reduce is an open source implementation of theMapReduce framework.§  Map Reduce is§  a programming model, introduced by Google, for processing large data sets,in a distributed environment§  De-facto standard to compute huge amounts of data§  An execution framework for organizing and performing such computationsMAPmasternodeREDUCEworker node 1worker node 2worker node 3problemdatasolutiondataBatch Layer
  • 39. 2013 © TrivadisHadoop MapReduce Flow24. April 2013Big Data und Fast Data39Source: Bill Graham, Twitter Inc.Batch Layer
  • 40. 2013 © TrivadisHadoop MapReduce24. April 2013Big Data und Fast Data40Batch Layer
  • 41. 2013 © TrivadisCascadingApplication framework for Java developers to simply develop robust DataAnalytics and Data Management applications on Apache Hadoopadds an abstraction layer over the Hadoop APIcore concepts of the cascading API:•  Pipe: a series of processing steps (parsing, looping, filtering, etc) defining thedata processing to be done•  Flow: association of a pipe (or set of pipes) with a data-source and data-sink24. April 2013Big Data und Fast Data41Batch Layer
  • 42. 2013 © TrivadisCasading24. April 2013Big Data und Fast Data42
  • 43. 2013 © TrivadisApache PigApache Pig is a platform for analyzing large data setsKey Properties•  Ease of programming•  Optimization opportunities•  Extensibility24. April 2013Big Data und Fast Data43Batch Layer
  • 44. 2013 © TrivadisImplementing Serving Layer
for Batch ViewsNeed a database that•  Is batch-writable•  Adding new information is atomic•  Has fast random reads•  Is scalable•  Is highly available•  Can be optimized for Storage•  Information can be de-normalized•  But no Random writes required!•  Can be a simple database24. April 2013Big Data und Fast Data44Serving Layerbatch viewbatch viewBatch LayerPrecomputedinformationImmutabledataBatch
ViewscomputeBatch Layer Serving Layer
  • 45. 2013 © TrivadisSploutSQLFull SQL => unlike NoSQLFor BigData => unlike RDBMSWeb latency & throughput => unlike Apache Hive, Apache DrillWhy does it scale•  Data is partitioned•  Partitions are distributed 
across nodes•  Adding more nodes 
increase capacity•  Generation does not 
impact serving24. April 2013Big Data und Fast Data45Serving LayerSource: Datasalt.
  • 46. 2013 © TrivadisSploutSQL24. April 2013Big Data und Fast Data46Serving Layer
  • 47. 2013 © TrivadisImplementing Speed Layer
Stream ProcessingContinuous computationTransactionalStoring a limited window of data•  Compensating for the last few 
hours of dataAll the complexity is isolated in the 
speed layer•  If anything goes wrong, it‘s 
autocorrected by the next batch run24. April 2013Big Data und Fast Data47Speed LayerIncrementedinformationProcess streamRealtimeincrementDataStreamRealtime
ViewsderiveSpeed Layer Serving Layer
  • 48. 2013 © TrivadisApache KafkaA high throughput distributed messaging systemOriginated at LinkedInSequential disk access24. April 2013Big Data und Fast Data48
  • 49. 2013 © TrivadisTwitter Storm – the “real-time Hadoop”24. April 2013Big Data und Fast Data49§  Strom is a distributed and fault-tolerant real-time computing platform§  data flow model, data flows through network of transformation entities§  Key concepts§  Tuple: ordered list of elements§  Streams: unbounded sequence of tuples§  Spouts: Source of streams§  Bolts: Process tuples and create new streams§  Topologies: directed graph of Spouts and Bolts§  Use Cases§  Stream Processing§  Continuous Computation§  Distributed RPCSPOUTBOLT„MAP“ „REDUCE“„PERSIST“problemdatadatasourcesolutiondataSpeed LayerServing LayerBOLTBOLT
  • 50. 2013 © TrivadisTwitter Storm24. April 2013Big Data und Fast Data50Speed LayerServing Layer
  • 51. 2013 © TrivadisTwitter TridentHigher level abstraction over StormTrident StateGrouped StreamFunctions, FiltersAggregatorsQuerySimilar to Pig and Cascading24. April 2013Big Data und Fast Data51Speed LayerServing Layer
  • 52. 2013 © TrivadisTwitter Trident24. April 2013Big Data und Fast Data52Speed LayerServing Layer
  • 53. 2013 © TrivadisImplementing Serving Layer
for Real-Time ViewsIncremental updates are made available as real-time viewsRequires a database that support random read and random writes•  Relational, NoSQL or NewSQL (in memory) databases can be used•  Here we are typically not in the BigData rangeResults are only needed until the data made it through the batch layerComplexity isolation24. April 2013Big Data und Fast Data53DataStreamRealtime
ViewsderiveSpeed Layer Serving LayerSpeed Layer Serving Layerreal time viewreal time viewIncrementedinformation
  • 54. 2013 © TrivadisCassandraFully distributed, no single-point-of-failureLinearly scalableFault tolerantPerformantDurableIntegrated cachingTunable consistency24. April 2013Big Data und Fast Data54Serving Layer
  • 55. 2013 © TrivadisImplementing Serving Layer
Merge of Batch and Realtime ViewsAn interesting feature of Storm /Trident is the ability to executedistributed RPC (DRPC) calls inparallelThis can be used to implement themerge functionality when a query isexecuted24. April 2013Big Data und Fast Data55Serving Layerbatch viewbatch viewreal time viewreal time viewRealtime
ViewsServing LayerBatch ViewsMergequery
  • 56. 2013 © TrivadisStorm / Trident DRPC24. April 2013Big Data und Fast Data56Serving Layer
  • 57. 2013 © TrivadisAgenda1.  Big Data, what is it?2.  Motivation3.  The Lambda Architecture4.  Implementing the Lambda Architecture5.  Summary24. April 2013Big Data und Fast Data57
  • 58. 2013 © TrivadisSummary – The lambda architecture24. April 2013Big Data und Fast Data58§  The Lambda Architecture§  Can discard batch views and real-time views and recreate everything fromscratch§  Mistakes corrected via re-computation§  Data storage layer optimized independently from query resolution layer§  Still in a very early …. But a very interesting idea!-  Today a zoo of technologies are needed => Operations won‘t like it§  Different query language for batch and real time§  An abstraction over batch and speed layer needed-  Cascading and Trident are already similar§  Industry standards needed!
  • 59. 2013 © TrivadisBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
THANK YOU.Trivadis AGGuido Schmutz & Albert BlarerEuropa-Strasse 5
CH-8095 Glattbrugginfo@trivadis.com
www.trivadis.com24. April 2013Big Data und Fast Data59