@CasertaConcepts
Real Time Big Data
Processing on AWS
Presented by:
@CasertaConcepts
About Caserta Concepts
• Consulting firm focused on Data Innovation, Modern Data Engineering to solve
highly complex business data challenges
• Award-winning company
• Internationally recognized work force
• Mentoring, Training, Knowledge Transfer
• Strategy, Architecture, Implementation
• Innovation Partner
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Leader in architecting and implementing enterprise data solutions
• Data Warehousing
• Business Intelligence
• Big Data Analytics
• Data Science
• Data on the Cloud
• Data Interaction & Visualization
• Strategic Consulting
• Technical Design
• Build & Deploy Solutions
@CasertaConcepts
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance
@CasertaConcepts
Partners
@CasertaConcepts
Awards & Recognition
@CasertaConcepts
Come out and Play
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Established in 2012 in NYC
• Meet monthly to share data best
practices, experiences
• 3,300+ Members
http://www.meetup.com/Big-Data-Warehousing/
Examples of Previous Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data Monitoring
with Storm & Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/Claudia
Perlcih & Revolution Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN
@CasertaConcepts
REALTIME Analytics
Presented by:
Elliott Cordo
Chief Architect, Caserta Concepts
@CasertaConcepts
What is real-time?
• Latency between data creation and analytics?
• Is it the speed with which we can retrieve the answer?
In most cases it’s both..
@CasertaConcepts
So, how real time?
How do we measure:
• 1 Hour?
• 5 Minutes?
• Seconds?
• Microseconds?
For all practical purposes:
• As fast as possible
• Fast enough to deliver the required insights
• “Near-Real-Time”
@CasertaConcepts
Real time
Two main methods:
•Micro-batch  “traditional” ETL, just faster
•Events based  events are “pushed” or “pulled”
through a pipeline
@CasertaConcepts
Microbatch
• Traditional batch ETL concepts
• Identify and accrue a batch of data that needs to be processed
• Batch Control  where did I last leave off
• CDC – Change Data Capture  what changed
• Process all accrued data in a single batch
Rinse and Repeat
@CasertaConcepts
Pros and Cons to Microbatch
• Pros:
• Leverage existing batch ETL code
• Data can have a known cutoff window  “Sales as of 10pm”
• Wide array of technologies
• Easy to troubleshoot and debug
• Easy to recover from failures  replay the batch
• Cons
• Results are not real time  as snapshot “as of” some time prior
• Can be difficult to support increasingly tight SLA’s
@CasertaConcepts
Technologies for Microbatch
• All the usual suspects:
• Traditional ETL tools
• Hadoop Ecosystem  PIG and Hive
• Code  Python, SQL, Scala, etc.
• Apache Spark (batch, streaming*)
• New AWS Services  Kinesis Firehose
• Load Data to S3 and Redshift Directly from a Kinesis Stream
@CasertaConcepts
Events based
• Data is processed as it is ingested  not accrued and processed as a
batch
• As close to real-time as you can get
• Typically the source is a message queue
@CasertaConcepts
Events Based Pros and Cons
Pros:
• Near real time processing
Cons:
• Generally more difficult (development and administrative)
• Generally does not eliminate batch ETL
• Typically a different code base than existing batch ETL
• Can be difficult to recover from failure
@CasertaConcepts
Technologies for Event Based
• Apache Storm
• Apache Spark*
• CEP Engines
• New AWS Services 
• AWS Lambda
@CasertaConcepts
Lambda Architecture
Speed and Batch Layer
• Batch ETL and Real-time are used together
• Real-time insights from Speed
• Cleanup/correction and advanced calculations performed by Batch
@CasertaConcepts
Data Stores
• Microbatch architecture  many options, based on data size
and usage patterns
• Events Based  NOSQL, In-Memory, Search:
• Write throughput requirements
• Fast reads
• Simplicity
• But we sacrifice query flexibility:
• Decisions about what metrics are “real-time”
• More ETL
@CasertaConcepts
Thank You / Q&A
Elliott Cordo
Chief Architect, Caserta Concepts
1-855-755-2246
elliott@casertaconcepts.com

Real Time Big Data Processing on AWS

  • 1.
    @CasertaConcepts Real Time BigData Processing on AWS Presented by:
  • 2.
    @CasertaConcepts About Caserta Concepts •Consulting firm focused on Data Innovation, Modern Data Engineering to solve highly complex business data challenges • Award-winning company • Internationally recognized work force • Mentoring, Training, Knowledge Transfer • Strategy, Architecture, Implementation • Innovation Partner • Transformative Data Strategies • Modern Data Engineering • Advanced Architecture • Leader in architecting and implementing enterprise data solutions • Data Warehousing • Business Intelligence • Big Data Analytics • Data Science • Data on the Cloud • Data Interaction & Visualization • Strategic Consulting • Technical Design • Build & Deploy Solutions
  • 3.
    @CasertaConcepts Client Portfolio Retail/eCommerce & Manufacturing DigitalMedia/AdTech Education & Services Finance. Healthcare & Insurance
  • 4.
  • 5.
  • 6.
    @CasertaConcepts Come out andPlay CIL - Caserta Innovations Lab Experience Big Data Warehousing Meetup • Established in 2012 in NYC • Meet monthly to share data best practices, experiences • 3,300+ Members http://www.meetup.com/Big-Data-Warehousing/ Examples of Previous Topics • Data Governance, Compliance & Security in Hadoop w/Cloudera • Real Time Trade Data Monitoring with Storm & Cassandra • Predictive Analytics • Exploring Big Data Analytics Techniques w/Datameer • Using a Graph DB for MDM & Relationship Mgmt • Data Science w/Claudia Perlcih & Revolution Analytics • Processing 1.4 Trillion Events in Hadoop • Building a Relevance Engine using Hadoop, Mahout & Pig • Big Data 2.0 – YARN Distributed ETL & SQL w/Hadoop • Intro to NoSQL w/10GEN
  • 7.
    @CasertaConcepts REALTIME Analytics Presented by: ElliottCordo Chief Architect, Caserta Concepts
  • 8.
    @CasertaConcepts What is real-time? •Latency between data creation and analytics? • Is it the speed with which we can retrieve the answer? In most cases it’s both..
  • 9.
    @CasertaConcepts So, how realtime? How do we measure: • 1 Hour? • 5 Minutes? • Seconds? • Microseconds? For all practical purposes: • As fast as possible • Fast enough to deliver the required insights • “Near-Real-Time”
  • 10.
    @CasertaConcepts Real time Two mainmethods: •Micro-batch  “traditional” ETL, just faster •Events based  events are “pushed” or “pulled” through a pipeline
  • 11.
    @CasertaConcepts Microbatch • Traditional batchETL concepts • Identify and accrue a batch of data that needs to be processed • Batch Control  where did I last leave off • CDC – Change Data Capture  what changed • Process all accrued data in a single batch Rinse and Repeat
  • 12.
    @CasertaConcepts Pros and Consto Microbatch • Pros: • Leverage existing batch ETL code • Data can have a known cutoff window  “Sales as of 10pm” • Wide array of technologies • Easy to troubleshoot and debug • Easy to recover from failures  replay the batch • Cons • Results are not real time  as snapshot “as of” some time prior • Can be difficult to support increasingly tight SLA’s
  • 13.
    @CasertaConcepts Technologies for Microbatch •All the usual suspects: • Traditional ETL tools • Hadoop Ecosystem  PIG and Hive • Code  Python, SQL, Scala, etc. • Apache Spark (batch, streaming*) • New AWS Services  Kinesis Firehose • Load Data to S3 and Redshift Directly from a Kinesis Stream
  • 14.
    @CasertaConcepts Events based • Datais processed as it is ingested  not accrued and processed as a batch • As close to real-time as you can get • Typically the source is a message queue
  • 15.
    @CasertaConcepts Events Based Prosand Cons Pros: • Near real time processing Cons: • Generally more difficult (development and administrative) • Generally does not eliminate batch ETL • Typically a different code base than existing batch ETL • Can be difficult to recover from failure
  • 16.
    @CasertaConcepts Technologies for EventBased • Apache Storm • Apache Spark* • CEP Engines • New AWS Services  • AWS Lambda
  • 17.
    @CasertaConcepts Lambda Architecture Speed andBatch Layer • Batch ETL and Real-time are used together • Real-time insights from Speed • Cleanup/correction and advanced calculations performed by Batch
  • 18.
    @CasertaConcepts Data Stores • Microbatcharchitecture  many options, based on data size and usage patterns • Events Based  NOSQL, In-Memory, Search: • Write throughput requirements • Fast reads • Simplicity • But we sacrifice query flexibility: • Decisions about what metrics are “real-time” • More ETL
  • 19.
    @CasertaConcepts Thank You /Q&A Elliott Cordo Chief Architect, Caserta Concepts 1-855-755-2246 elliott@casertaconcepts.com

Editor's Notes

  • #6 a consequence of having built a strong innovative business - Awards & recognition - recognized in the market in 2013, 2014, 2015 They demonstrate sustained recognition over the years and not just many years ago - recent 5th of IT in NYC
  • #7 developing next new set of best practices, talking to practitioners, understanding current trends in the marketplaces staying relevant and ahead of the curve create a sense of community, sharing best practices, past experiences