Simon Ambridge
Data Pipelines With Spark & DSE
An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines
Certified Apache Cassandra and DataStax enthusiast who enjoys
explaining that the traditional approaches to data management just
don’t cut it anymore in the new always on, no single point of failure,
high volume, high velocity, real time distributed data management
world.
Previously 25 years implementing Oracle relational data management
solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle
Linux and OBIEE
simon.ambridge@datastax.com
@stratman1958
Simon Ambridge
Pre-Sales Solution Engineer, Datastax UK
Big Data Pipelining: Why Analytics?
• To be able to react to customers faster and with more accuracy
• To reduce the business risk through more accurate understanding of the
market
• Optimise return on marketing investment via better targeted campaigns
• Faster time to market with the right products at the right time
• Improve efficiency – commerce, plant and people
Recent survey found that more than half of respondents wanted:
85% wanted analytics to handle ‘real-time’ data changing at <1s intervals
Big, Static Data
Fast, Streaming Data
Big Data Pipelining: Classification
Big Data Pipelines can mean different things to different people
Repeated analysis on a static but massive dataset
• Typically an element of research – e.g. genomics, clinical
trial, demographic data
• Something that is typically repetitive, iterative, shared
amongst data scientists for analysis
Real-time analytics on streaming data
• Typically an industrialised processes – e.g. sensors, tick
data, bioinformatics, transactional data, real-time
personalisation
• Something that is happening in real-time that usually
cannot be dropped or lost
Static Datasets
All You Can Eat?
Really.
Static Analytics: Traditional Approach
Repeated iterations, at each stage
Run/debug cycle can be slow
Sampling Modeling InterpretTuning Reporting
Re-sample
Typical traditional ‘static’ data analysis model
Data
Results
Analytics: Traditional Scaling
DATA
DATA
DATA
Small datasets, small servers
Large datasets, large servers
Big datasets, big servers
Static Analytics: Scale Up Challenges
• Sampling and analysis often run on a single machine
• CPU and memory limitations – finite resources on a single machine
• Offers limited sampling of a large dataset because of data size limitations
• Multiple iterations over large datasets is frequently not an ideal approach
Static Analytics: Big Data Problems
• Data really is getting Big!
• Data is getting bigger!
• The number of data sources is exploding!
• More data is arriving faster!
Scaling up is becoming impractical – physical limits
• The validity of the analysis becomes obsolete, faster
• Analysis too slow to get any real ROI from the data
Big Data Analytics: Big Data Needs
We need scalable infrastructure + distributed technologies
• Data volumes can be scaled – we can distribute the data
across multiple low-cost machines or cloud instances
• Faster processing – distributed smaller datasets
• More complex processing – distributed across multiple
machines
• No single point of failure
Big Data Analytics: DSE Delivers
Building a distributed data processing framework can be a complex task!
It needs to be:
• Scalable
• Have fast in-memory processing
• Able to handle real-time or streaming data feeds
• Able to handle high throughput and low-latency
• Ideally be able to handle ad-hoc queries
• Ideally be replicated across multiple data centers for resiliency
DataStax Enterprise: Standard Edition
DataStax
Enterprise
• Certified Cassandra – delivers trusted, tested and
certified versions of Cassandra ready for
production environments.
• Expert Support – answers and assistance from
the Cassandra experts for all production needs.
• Enterprise Security – supplies full protection for
sensitive data.
• Automatic Management Services – automates key
maintenance functions to keep the database
running smoothly.
• OpsCenter – provides advanced management and
monitoring functionality for production
applications.
DataStax
Enterprise
• Advanced Analytics – provides ability to run real-
time and batch analytic operations on Cassandra
data, as well as integrate DSE with external
Hadoop deployments.
• Enterprise Search – supplies built-in enterprise
and distributed search capabilities on Cassandra
data.
• In-Memory Option – delivers all the benefits of
Cassandra to in-memory computing.
• Workload Isolation – allows for analytics and
search functions to run separately from
transactional workloads, with no need to ETL
data to different systems.
DataStax Enterprise: Max Edition
Intro To Cassandra: THE Cloud Database
What is Apache Cassandra?
• Originally started at Facebook in 2008
• Top level Apache project since 2010
• Open source distributed database
• Clusters can handle large amounts of data (PB’s)
• Performant at high velocity
• Extremely resilient:
• Across multiple data centres
• No single point of failure
• Continuous Availability, disaster avoidance
• Enterprise Cassandra platform from Datastax
Intro To Spark: THE Analytics Engine
What is Apache Spark?
• Started at UC Berkeley in 2009
• Apache Project since 2010
• Distributed in-memory processing
• Rich Scala, Java and Python APIs
• Fast - 10x-100x faster than Hadoop MapReduce
• 2x-5x less code than R
• Batch and streaming analytics
• Interactive shell (REPL)
• Tightly integrated with DSE
Spark: Dayton Gray Sort Contest
October 2014
Daytona Gray benchmark tests how fast a system can sort 100 TB of data
(1 trillion records)
• Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72
minutes
• Spark completed the benchmark in 23 minutes on just 206 EC2 nodes. All the
sorting took place on disk (HDFS), without using Spark’s in-memory cache (3X
faster using 10X fewer machines)
• Spark also sorted 1 PB (10 trillion records) on 190 machines in under 4 hours.
This beats previously results based on Hadoop MapReduce in 16 hours on 3800
machines (4X faster using 20X fewer machines)
DataStax Enterprise: Analytics Integration
Cassandra Cluster
Spark Cluster
ETL
Spark Cluster
• Tight integration
• Data locality
• Microsecond response times
X
• Apache Cassandra for Distributed Persistent Storage
• Integrated Apache Spark for Distributed Real-Time Analytics
• Analytics nodes close to data - no ETL required
X
• Loose integration
• Data separate from processing
• Millisecond response times
“Latency	
  when	
  transferring	
  data	
  is	
  unavoidable.	
  The	
  trick	
  is	
  to	
  
reduce	
  the	
  latency	
  to	
  as	
  close	
  to	
  zero	
  as	
  possible…”
Intro To Parquet: Quick History
What is Parquet?
• Started at Twitter and Cloudera in 2013
• Databases traditionally store information in rows and are optimized for
working with one record at a time
• Columnar storage systems optimised to store data by column
• A compressed, efficient columnar data representation
• Compression schemes can be specified on a per-column level
• Allows complex data to be encoded efficiently
• Netflix - 7 PB of warehoused data in Parquet format
• Not as compressed as ORC (Hortonworks) but faster read/analysis
Intro To Akka: Distributed Apps
What is Akka?
• Open source toolkit first released in 2009
• Simplifies the construction of highly concurrent and distributed Java apps
• Makes it easier to build concurrent, fault-tolerant and scalable applications
• Based on the ‘actor’ model
• Highly performant event-driven programming
• Hierarchical - each actor is created and supervised by its parent
• Process failures treated as events handled by an actor's supervisor
• Language bindings exist for both Java and Scala
Big Data Pipelining: Static Datasets
Valid data pipeline analysis methods must be:
Auditable
• Reproducible – essential for any science, so too for Data Science
• Documented – important to understand the how and why
• Controlled
• Suitable for version control
• Collaborative
• Easily accessible
Intro To Notebooks: Features
What are Notebooks?
• Drive your data analysis from the browser
• Increasingly popular
• Highly interactive
• Tight integration with Apache Spark
• Handy tools for analysts:
• Reproducible visual analysis
• Code in Scala, CQL, SparkSQL
• Charting – pie, bar, line etc
• Extensible with custom libraries
Intro To Notebooks: Features
Big Data Pipelining: Static Datasets
Example architecture & requirements
1. Optimised source data format
2. Distributed in-memory analytics
3. Interactive and flexible data analysis tool
4. Persistent data store
5. Visualisation tools
Big Data Pipelining: Pipeline Flow
ADAM Notebook Notebook Datastore Visualisation
Example: Genome research platform (SHAR3)
Big Data Pipelining: Pipeline Scalability
• Add more (physical or virtual) nodes as
required to add capacity
• Container tools can ease configuration
management
• Scale out quickly
Big Data Pipelining: Pipeline Process Flow
3. Persistent data storage
2. Interactive, flexible and reproducible analysis
1. Source data
4. Visualise and analyse
Analytics: Static Data Pipeline Process
• No longer an iterative process constrained by hardware limitations
• Now a more scalable, resilient, dynamic, interactive process, easily shareable
Analyse
The new model for large-scale static data analysis
Share
X
Load
Real-Time Datasets
If it’s Not “Now”, Then It’s Probably Already Too Late
Big Data Pipelining: Real-Time Analytics
• Capture, prepare, and process fast streaming data
• Needs different approach from traditional batch processing
• Has to be at the speed of now – cannot wait even for seconds
• Immediate insight & instant decisions - offers huge commercial
and engineering advantages
What problem are we trying to solve?
Big Data Analytics: Streams
Data tidal waves!Netflix
• Ingests Petabytes of data per day
• Over 1 TRILLION transactions per day (>10 m per second) into DSE
Data streams?
Data deluge?
Big Data Pipelining: Real-Time Use Cases
Social media
• Commercial value - trending products, sentiment analysis
• Reaction time is critical as the value of data quickly diminishes over time
e.g. Twitter, Facebook
Sensor data (IoT)
• Critical safety and monitoring
• Missed data could have significant safety implications
• Utility billing, engineering management e.g. power plant, vehicles
Examples of use cases for streaming analytics…
Big Data Pipelining: Real-Time Use Cases
Transactional data
• Missed data could have huge financial implications e.g. market data
• Credit card transactions, fraud detection– if it’s not now, its too late
User Experience
• Personalising the user experience
• Commercial benefit to customise the user experience
• Netflix, Spotify, eBay, mobile apps etc.
Examples of use cases for streaming analytics…
Big Data Pipelining: Real-Time architecture
Analytics in real-time at scale demand fast processing, with low latencies
Common solution is to use in-memory distributed architecture
Increasingly using a technology stack comprising Kafka, Spark and Cassandra
• Scalable
• Distributed
• Resilient
Streaming analytics architecture - what do we need?
Intro To Kafka: Quick History
What is Apache Kafka?
• Originally developed by LinkedIn
• Open sourced since 2011
• Top level project since 2012
• Enterprise support from Confluent
• Fast?- single Kafka broker handles hundreds of MB/s of reads/writes from thousands
of clients
• Scalable? - can be elastically and transparently expanded without downtime. Data
streams are distributed over a cluster of machines
• Durable? - messages persisted on disk, replicated within the cluster, prevent data loss
• Powerful? - each broker can handle TBs of messages without performance impact.
• Distributed? - modern cluster-centric design, strong durability and fault-tolerance
Intro To Kafka: Architecture
How Does Kafka Work?
Producers send messages to the Kafka cluster, which in
turn serves them up to consumers
• Kafka maintains feeds of messages in categories called topics
• Processes that publish messages to Kafka are called producers
• Processes that subscribe to topics and process the feed are called consumers
• A Kafka cluster is comprised of one or more servers called a broker
• Java API, other languages supported
Intro To Kafka: Streaming Flow
How Does Kafka Work With Spark?
• Publish-subscribe messaging system implemented as a replicated commit
log
• Messages are simply byte arrays so can store any object in any format
• Each topic partition stored as a log (an ordered set of messages)
• Each message in a partition is assigned a unique offset
• Consumers are responsible to track their location in each topic log
Spark consumes messages as a stream, in micro batches, saved as RDD’s
DataStax Enterprise: Streaming Schematic
Sensor Network
Signal
Aggregation
Messaging Queue
Sensor Data Queue
Management
Broker
Broker
Collection
Data Processing
& Storage
DataStax Enterprise: Streaming Analytics
Real-time
Analytics
Data Processing
& Storage
Near real-time,
batch Analytics
Analytics / BI
!$£€!
Personalisation
Actionable insight Monitoring
DataStax Enterprise: Multi-DC Uses
DC: EUROPEDC: USA
Real-time active-activegeo-replication
across physical datacentres
4 3
25
1
4 3
25
1
8
1
2
3
4
5
6
7
1
2
3
OLTP:
Cassandra
5
4
Analytics:
Cassandra + Spark
Replication
Replication
Workload separation via virtual datacentres
Real-Time Analytics: DSE Multi-DC
Workload Management and Separation With DSE
Analytics / BI
Analytics
Datacentre
OLTP
Datacentre
100% Uptime, Global Scale
OLTP
Real-Time Analytics
Mixed Load OLTP and Analytics Platform
Replication
Replication
JDBC
ODBC
Separation of OLTP
from Analytics
Social Media
IoT
Personalisation & Persistence
Personalisation
!$£€!
Actionable insight
Monitoring
App, Web
OLTP
Feed
OLTP Layer
100% Uptime, Global Scale
High Velocity
Ingestion Layer
Lambda &Big Data: DSE &Hadoop
Data Stores -
Active& Legacy
Batch Analytics Analytics / BI
• Scalable
• Fault-tolerant
• Fast
JDBC
ODBC
Real-Time
Analytic/Integration
Layer
Social Media IoT
Web, App
Oracle
IBMSAP
OLTP
Feed
OLTP
Feed
OLTP
Feed
Big Data Use Case: DSE &SAP
Data Stores -
Active& Legacy
Hot Data
Storage / Query
Analytics / BI
SAP/Hana Smart Data Access
OLTP Layer
100% Uptime, Global Scale
High Velocity
Ingestion Layer
Social Media IoT
• Scalable
• Fault-tolerant
• Fast
Oracle
IBMSAP
Real-Time
Analytic/Integration
Layer
Web, App
JDBC
ODBC
Thank you!

20160331 sa introduction to big data pipelining berlin meetup 0.3

  • 1.
    Simon Ambridge Data PipelinesWith Spark & DSE An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines
  • 2.
    Certified Apache Cassandraand DataStax enthusiast who enjoys explaining that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world. Previously 25 years implementing Oracle relational data management solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE simon.ambridge@datastax.com @stratman1958 Simon Ambridge Pre-Sales Solution Engineer, Datastax UK
  • 3.
    Big Data Pipelining:Why Analytics? • To be able to react to customers faster and with more accuracy • To reduce the business risk through more accurate understanding of the market • Optimise return on marketing investment via better targeted campaigns • Faster time to market with the right products at the right time • Improve efficiency – commerce, plant and people Recent survey found that more than half of respondents wanted: 85% wanted analytics to handle ‘real-time’ data changing at <1s intervals
  • 4.
    Big, Static Data Fast,Streaming Data Big Data Pipelining: Classification Big Data Pipelines can mean different things to different people Repeated analysis on a static but massive dataset • Typically an element of research – e.g. genomics, clinical trial, demographic data • Something that is typically repetitive, iterative, shared amongst data scientists for analysis Real-time analytics on streaming data • Typically an industrialised processes – e.g. sensors, tick data, bioinformatics, transactional data, real-time personalisation • Something that is happening in real-time that usually cannot be dropped or lost
  • 5.
    Static Datasets All YouCan Eat? Really.
  • 6.
    Static Analytics: TraditionalApproach Repeated iterations, at each stage Run/debug cycle can be slow Sampling Modeling InterpretTuning Reporting Re-sample Typical traditional ‘static’ data analysis model Data Results
  • 7.
    Analytics: Traditional Scaling DATA DATA DATA Smalldatasets, small servers Large datasets, large servers Big datasets, big servers
  • 8.
    Static Analytics: ScaleUp Challenges • Sampling and analysis often run on a single machine • CPU and memory limitations – finite resources on a single machine • Offers limited sampling of a large dataset because of data size limitations • Multiple iterations over large datasets is frequently not an ideal approach
  • 9.
    Static Analytics: BigData Problems • Data really is getting Big! • Data is getting bigger! • The number of data sources is exploding! • More data is arriving faster! Scaling up is becoming impractical – physical limits • The validity of the analysis becomes obsolete, faster • Analysis too slow to get any real ROI from the data
  • 10.
    Big Data Analytics:Big Data Needs We need scalable infrastructure + distributed technologies • Data volumes can be scaled – we can distribute the data across multiple low-cost machines or cloud instances • Faster processing – distributed smaller datasets • More complex processing – distributed across multiple machines • No single point of failure
  • 11.
    Big Data Analytics:DSE Delivers Building a distributed data processing framework can be a complex task! It needs to be: • Scalable • Have fast in-memory processing • Able to handle real-time or streaming data feeds • Able to handle high throughput and low-latency • Ideally be able to handle ad-hoc queries • Ideally be replicated across multiple data centers for resiliency
  • 12.
    DataStax Enterprise: StandardEdition DataStax Enterprise • Certified Cassandra – delivers trusted, tested and certified versions of Cassandra ready for production environments. • Expert Support – answers and assistance from the Cassandra experts for all production needs. • Enterprise Security – supplies full protection for sensitive data. • Automatic Management Services – automates key maintenance functions to keep the database running smoothly. • OpsCenter – provides advanced management and monitoring functionality for production applications.
  • 13.
    DataStax Enterprise • Advanced Analytics– provides ability to run real- time and batch analytic operations on Cassandra data, as well as integrate DSE with external Hadoop deployments. • Enterprise Search – supplies built-in enterprise and distributed search capabilities on Cassandra data. • In-Memory Option – delivers all the benefits of Cassandra to in-memory computing. • Workload Isolation – allows for analytics and search functions to run separately from transactional workloads, with no need to ETL data to different systems. DataStax Enterprise: Max Edition
  • 14.
    Intro To Cassandra:THE Cloud Database What is Apache Cassandra? • Originally started at Facebook in 2008 • Top level Apache project since 2010 • Open source distributed database • Clusters can handle large amounts of data (PB’s) • Performant at high velocity • Extremely resilient: • Across multiple data centres • No single point of failure • Continuous Availability, disaster avoidance • Enterprise Cassandra platform from Datastax
  • 15.
    Intro To Spark:THE Analytics Engine What is Apache Spark? • Started at UC Berkeley in 2009 • Apache Project since 2010 • Distributed in-memory processing • Rich Scala, Java and Python APIs • Fast - 10x-100x faster than Hadoop MapReduce • 2x-5x less code than R • Batch and streaming analytics • Interactive shell (REPL) • Tightly integrated with DSE
  • 16.
    Spark: Dayton GraySort Contest October 2014 Daytona Gray benchmark tests how fast a system can sort 100 TB of data (1 trillion records) • Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72 minutes • Spark completed the benchmark in 23 minutes on just 206 EC2 nodes. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache (3X faster using 10X fewer machines) • Spark also sorted 1 PB (10 trillion records) on 190 machines in under 4 hours. This beats previously results based on Hadoop MapReduce in 16 hours on 3800 machines (4X faster using 20X fewer machines)
  • 17.
    DataStax Enterprise: AnalyticsIntegration Cassandra Cluster Spark Cluster ETL Spark Cluster • Tight integration • Data locality • Microsecond response times X • Apache Cassandra for Distributed Persistent Storage • Integrated Apache Spark for Distributed Real-Time Analytics • Analytics nodes close to data - no ETL required X • Loose integration • Data separate from processing • Millisecond response times “Latency  when  transferring  data  is  unavoidable.  The  trick  is  to   reduce  the  latency  to  as  close  to  zero  as  possible…”
  • 18.
    Intro To Parquet:Quick History What is Parquet? • Started at Twitter and Cloudera in 2013 • Databases traditionally store information in rows and are optimized for working with one record at a time • Columnar storage systems optimised to store data by column • A compressed, efficient columnar data representation • Compression schemes can be specified on a per-column level • Allows complex data to be encoded efficiently • Netflix - 7 PB of warehoused data in Parquet format • Not as compressed as ORC (Hortonworks) but faster read/analysis
  • 19.
    Intro To Akka:Distributed Apps What is Akka? • Open source toolkit first released in 2009 • Simplifies the construction of highly concurrent and distributed Java apps • Makes it easier to build concurrent, fault-tolerant and scalable applications • Based on the ‘actor’ model • Highly performant event-driven programming • Hierarchical - each actor is created and supervised by its parent • Process failures treated as events handled by an actor's supervisor • Language bindings exist for both Java and Scala
  • 20.
    Big Data Pipelining:Static Datasets Valid data pipeline analysis methods must be: Auditable • Reproducible – essential for any science, so too for Data Science • Documented – important to understand the how and why • Controlled • Suitable for version control • Collaborative • Easily accessible
  • 21.
    Intro To Notebooks:Features What are Notebooks? • Drive your data analysis from the browser • Increasingly popular • Highly interactive • Tight integration with Apache Spark • Handy tools for analysts: • Reproducible visual analysis • Code in Scala, CQL, SparkSQL • Charting – pie, bar, line etc • Extensible with custom libraries
  • 22.
  • 23.
    Big Data Pipelining:Static Datasets Example architecture & requirements 1. Optimised source data format 2. Distributed in-memory analytics 3. Interactive and flexible data analysis tool 4. Persistent data store 5. Visualisation tools
  • 24.
    Big Data Pipelining:Pipeline Flow ADAM Notebook Notebook Datastore Visualisation Example: Genome research platform (SHAR3)
  • 25.
    Big Data Pipelining:Pipeline Scalability • Add more (physical or virtual) nodes as required to add capacity • Container tools can ease configuration management • Scale out quickly
  • 26.
    Big Data Pipelining:Pipeline Process Flow 3. Persistent data storage 2. Interactive, flexible and reproducible analysis 1. Source data 4. Visualise and analyse
  • 27.
    Analytics: Static DataPipeline Process • No longer an iterative process constrained by hardware limitations • Now a more scalable, resilient, dynamic, interactive process, easily shareable Analyse The new model for large-scale static data analysis Share X Load
  • 28.
    Real-Time Datasets If it’sNot “Now”, Then It’s Probably Already Too Late
  • 29.
    Big Data Pipelining:Real-Time Analytics • Capture, prepare, and process fast streaming data • Needs different approach from traditional batch processing • Has to be at the speed of now – cannot wait even for seconds • Immediate insight & instant decisions - offers huge commercial and engineering advantages What problem are we trying to solve?
  • 30.
    Big Data Analytics:Streams Data tidal waves!Netflix • Ingests Petabytes of data per day • Over 1 TRILLION transactions per day (>10 m per second) into DSE Data streams? Data deluge?
  • 31.
    Big Data Pipelining:Real-Time Use Cases Social media • Commercial value - trending products, sentiment analysis • Reaction time is critical as the value of data quickly diminishes over time e.g. Twitter, Facebook Sensor data (IoT) • Critical safety and monitoring • Missed data could have significant safety implications • Utility billing, engineering management e.g. power plant, vehicles Examples of use cases for streaming analytics…
  • 32.
    Big Data Pipelining:Real-Time Use Cases Transactional data • Missed data could have huge financial implications e.g. market data • Credit card transactions, fraud detection– if it’s not now, its too late User Experience • Personalising the user experience • Commercial benefit to customise the user experience • Netflix, Spotify, eBay, mobile apps etc. Examples of use cases for streaming analytics…
  • 33.
    Big Data Pipelining:Real-Time architecture Analytics in real-time at scale demand fast processing, with low latencies Common solution is to use in-memory distributed architecture Increasingly using a technology stack comprising Kafka, Spark and Cassandra • Scalable • Distributed • Resilient Streaming analytics architecture - what do we need?
  • 34.
    Intro To Kafka:Quick History What is Apache Kafka? • Originally developed by LinkedIn • Open sourced since 2011 • Top level project since 2012 • Enterprise support from Confluent • Fast?- single Kafka broker handles hundreds of MB/s of reads/writes from thousands of clients • Scalable? - can be elastically and transparently expanded without downtime. Data streams are distributed over a cluster of machines • Durable? - messages persisted on disk, replicated within the cluster, prevent data loss • Powerful? - each broker can handle TBs of messages without performance impact. • Distributed? - modern cluster-centric design, strong durability and fault-tolerance
  • 35.
    Intro To Kafka:Architecture How Does Kafka Work? Producers send messages to the Kafka cluster, which in turn serves them up to consumers • Kafka maintains feeds of messages in categories called topics • Processes that publish messages to Kafka are called producers • Processes that subscribe to topics and process the feed are called consumers • A Kafka cluster is comprised of one or more servers called a broker • Java API, other languages supported
  • 36.
    Intro To Kafka:Streaming Flow How Does Kafka Work With Spark? • Publish-subscribe messaging system implemented as a replicated commit log • Messages are simply byte arrays so can store any object in any format • Each topic partition stored as a log (an ordered set of messages) • Each message in a partition is assigned a unique offset • Consumers are responsible to track their location in each topic log Spark consumes messages as a stream, in micro batches, saved as RDD’s
  • 37.
    DataStax Enterprise: StreamingSchematic Sensor Network Signal Aggregation Messaging Queue Sensor Data Queue Management Broker Broker Collection Data Processing & Storage
  • 38.
    DataStax Enterprise: StreamingAnalytics Real-time Analytics Data Processing & Storage Near real-time, batch Analytics Analytics / BI !$£€! Personalisation Actionable insight Monitoring
  • 39.
    DataStax Enterprise: Multi-DCUses DC: EUROPEDC: USA Real-time active-activegeo-replication across physical datacentres 4 3 25 1 4 3 25 1 8 1 2 3 4 5 6 7 1 2 3 OLTP: Cassandra 5 4 Analytics: Cassandra + Spark Replication Replication Workload separation via virtual datacentres
  • 40.
    Real-Time Analytics: DSEMulti-DC Workload Management and Separation With DSE Analytics / BI Analytics Datacentre OLTP Datacentre 100% Uptime, Global Scale OLTP Real-Time Analytics Mixed Load OLTP and Analytics Platform Replication Replication JDBC ODBC Separation of OLTP from Analytics Social Media IoT Personalisation & Persistence Personalisation !$£€! Actionable insight Monitoring App, Web
  • 41.
    OLTP Feed OLTP Layer 100% Uptime,Global Scale High Velocity Ingestion Layer Lambda &Big Data: DSE &Hadoop Data Stores - Active& Legacy Batch Analytics Analytics / BI • Scalable • Fault-tolerant • Fast JDBC ODBC Real-Time Analytic/Integration Layer Social Media IoT Web, App Oracle IBMSAP OLTP Feed
  • 42.
    OLTP Feed OLTP Feed Big Data UseCase: DSE &SAP Data Stores - Active& Legacy Hot Data Storage / Query Analytics / BI SAP/Hana Smart Data Access OLTP Layer 100% Uptime, Global Scale High Velocity Ingestion Layer Social Media IoT • Scalable • Fault-tolerant • Fast Oracle IBMSAP Real-Time Analytic/Integration Layer Web, App JDBC ODBC
  • 44.