Low-Latency Analytics with NoSQL
Joe Caserta
June 18, 2014
Storm / Cassandra
Quick Intro - Joe Caserta
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling
25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, HortonWorks, Datameer,
more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013
Launched Big Data Warehousing
Meetup in NYC – 950+ Members
2012
Established best practices for big
data ecosystem implementation
Listed as a Top 20 Data Analytics
Consulting Companies - CIO Review
Expertise & Offerings
Strategic Roadmap /
Assessment / Education /
Implementation
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data
Analytics
Client Portfolio
Finance. Healthcare
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services
Listed as one of the 20 Most Promising
Data Analytics Consulting Companies
CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.
Caserta Concepts
Sales
Marketing
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Others…
Why is Data Analytics so important?
Data Science
Enterprise
Data Warehouse
• Data is coming in so
fast, how do we monitor
it?
• Real real-time analytics
• Relevance engines,
financial fraud sensors,
early warning sensors
• Dealing with sparse,
incomplete, volatile,
and highly
manufactured data
• Agile to adapt quickly
to changing business
• Wider breadth of
datasets and sources in
scope requires larger
data repositories
• Most of world’s data is
unstructured,
semi-structured or
multi-structured
• Data volume is
growing so
processes must be
more reliant on
programmatic
administration
• Less people/process
dependence
Volume Variety
VelocityVeracity
Challenges With Big Data
What’s Important Today (according to Joe)
Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD
 Tools:
 Mahout: Machine learning
 Hive: Map data to structures and use SQL-like queries
 Pig: Data transformation language for big data, from Yahoo
 Storm: Real-time ETL
NoSQL:
Document: MongoDB, CouchDB
Graph: Neo4j, Titan
Key Value: Riak, Redis
Columnar: Cassandra, Hbase
 Languages: SQL, Python, SciPy, Java
 Predictive Modeling: R, SAS, SPSS
Why talk about Storm & Cassandra?
ERP
Finance
Legacy
ETL
Data Analytics
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Data Science
Big Data BI
NoSQL
Database
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Storm
High Volume Ingestion Project Overview
• The equity trading arm of a large US bank needed to
scale its infrastructure to enable the ability to
process/parse trade data real-time and calculate
aggregations/statistics
~ 1Million/second ~12 Billion messages/day ~240 Billon/month
• The solution needed to map the raw data to a data model
in memory or low latency (for real-time), while persisting
mapped data to disk (for end of day).
• The proposed solution also needed to
handle ad-hoc data requests for data
analytics.
The Data
• Primarily FIX messages: Financial Information Exchange 
• Established in early 90's as a standard for trade data
communication  widely used throughout the industry
• Basically a delimited file of variable attribute-value pairs
• Looks something like this:
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 |
11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 |
44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 |
10=128 |
• A single trade can be comprised of 1000's of such messages,
although typical trades have about a dozen
Additional Requirements
• Linearly scalable
• Highly available  no single point of failure ,quick recovery
• Quicker time to benefit
• Processing guarantees  NO DATA IS LOST!
Some Sample Analytic Use Cases
• Sum(Notional volume) by Ticker: Daily, Hourly, Minute
• Average trade latency (Execution TS – Order TS)
• Wash Sales (sell within x seconds of last buy) for same
Client/Ticker
NoSQL
Database
(Cassandra)
Messaging
Messaging
Aggregation
Framework
Storm
Ingestand
Computation
External Data
Real-time
Dashboards
Push Aggregates
To MQ
Stream
transaction
Detail
Higher
Latency Analytics
And Day End
Application Log
Data
High Volume Real-timeAnalytics - SolutionArchitecture
A little deeper…
Storm Cluster
Sensor
Data
d3.js Analytics
Hadoop Cluster
Low Latency
Analytics
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data
and derived data needed for analytics
• Redis is used as a reference data lookup cache
• Real time analytics are produced from the aggregated
data.
• Higher latency ad-hoc analytics are done in Hadoop
using Pig and Hive
Kafka
What is Storm
• Distributed Event Processor
• Real-time data ingestion and dissemination
• In-Stream ETL
• Reliably process unbounded streams of data
• Storm is fast: Clocked it at over a million tuples per second per node
• It is scalable, fault-tolerant, guarantees your data will be processed
• Preferred technology for real-time big data processing by organizations
worldwide:
• Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By
• Incubator:
• http://wiki.apache.org/incubator/StormProposal
Components of Storm
• Spout – Collects data from upstream feeds and submits
it for processing
• Tuple – A collection of data that is passed within Storm
• Bolt – Processes tuples (Transformations)
• Stream – Identifies outputs from Spouts/Bolts
• Storm usually outputs to a NoSQL database
Why NoSQL?
• Performance:
• Relational databases have a lot of features, overhead that we don’t
need in many cases. Although we will miss some…
• Scalability:
• Most relational databases scale vertically giving them limits to how
large they can get. Federation and Sharding is an awkward manual
process.
• Agile
• Sparse Data / Data with a lot of variation
• Most NoSQL scale horizontally on commodity hardware
• Column families are the equivalent to a table in a RDMS
• Primary unit of storage is a column, they are stored
contiguously
Skinny Rows: Most like relational database. Except
columns are optional and not stored if omitted:
Wide Rows: Rows can be billions of columns wide, used
for time series, relationships, secondary indexes:
What is Cassandra?
Deeper Dive: Cassandra as an Analytic
Database
• Based on a blend of Dynamo and BigTable
• Distributed, master-less
• Super fast writes  Can ingest lots of data!
• Very fast reads
Why did we choose it:
• Data throughput requirements
• High availability
• Simple expansion
• Interesting data models for time series data (more on this
later)
Design Practices
• Cassandra does not support aggregation or joins 
Data model must be tuned to usage
• Denormalize your data (flatten your primary dimensional
attributes into your fact)
• Storing the same data redundantly is OK
Might sound weird but we've been doing this all along
in the traditional world modeling our data to make
analytic queries simple!
Wide rows are our friends
• Cassandra composite columns are powerful for analytic
models
• Facilitate multi-dimensional analysis
• A wide row table may have N number of rows, and a
variable number of columns (millions of columns)
• And now with CQL3 we have “unpacked” wide rows into
named columns  Easy to work with!
20130101 20130102 20130103 20130104 20130104 20130105 …
ClientA 10003 9493 43143 45553 54553 34343 …
ClientB 45453 34313 54543 `23233 4233 34423 …
ClientC 3323 35313 43123 54543 43433 4343 …
… … … … … … .. …
More about wide rows!
• The left-most column is the ROW KEY
• It is the mechanism by which the row is distributed across the Cassandra cluster…
• Care must be taken to prevent hot spots: Dates for example are not generally good
candidates because all load will go to given set of servers on a particular day!
• Data can be filtered using equal and “in” clause
• The top row is the COLUMN KEY
• Their can be a variable number of columns
• It is acceptable to have millions/ even billions of columns in a table
• Columns keys are sorted and can accept a range query (greater than / less than)
20130101 20130102 20130103 20130104 20130104 20130105 …
ClientA 10003 9493 43143 45553 54553 34343 …
ClientB 45453 34313 54543 `23233 4233 34423 …
ClientC 3323 35313 43123 54543 43433 4343 …
… … … … … … .. …
Create table Client_Daily_Summary (
Client text,
Date_ID int,
Trade_Count int,
Primary key (Client, Date_ID))
Traditional CassandraAnalytic Model
If we wanted to track trade count by day, hour we could
stream our ETL to two (or more) summary fact tables
0900 1000 1100 1200 1300 1400
ClientA|20131101 1000 949 4314 4555 5455 3434
ClientA|20131102 4545 3431 5454 2323 423 3442
ClientB|20131101 332 3531 4312 5454 4343 434
20130101 20130102 20130103 20130104 20130104 20130105
ClientA 10003 9493 43143 45553 54553 34343
ClientB 45453 34313 54543 `23233 4233 34423
ClientC 3323 35313 43123 54543 43433 4343
Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3:
Select Date_ID, Trade_Count from Client_Hourly_Summary `
where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103
Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM
Select Hour, Trade_Count from Client_Hourly_Summary `
where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
Storing the Atomic data
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 |
20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING |
59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 |
• We must land all atomic data:
• Persistence
• Future replay (new metrics, corrections)
• Drill down capabilities/auditability
• The sparse nature of the FIX data fits the Cassandra data model very
well.
• We will store tags which are actually present in the data, saving space  a few
approaches depending on usage pattern.
Create table Trades_Skinny(
OrderID Text Primary_Key,
Date_ID int,
Ticker int,
Client text,
…Many more columns)
Create index ix_Date_ID on
Trade_Data_Skinny (Date_ID)
Create table Trades_Map(
OrderID Text Primary_Key,
Date_ID int,
Ticker int,
Client text,
Tags map <text, text>)
Create index ix_Date_ID on
Trade_Data_Map (Date_ID)
Create table Trades_Wide(
Order_ID Text,
Tag text,
Value text,
Primary key (Order_ID, Tag))
Closing Thought
• The days of staying committed to the discipline of a
single database technology – Relational – Are behind
us.
• Polyglot Persistence – “where any decent sized
enterprise will have a variety of different data storage
technologies for different kinds of data. There will still
be large amounts of it managed in relational stores,
but increasingly we'll be first asking how we want to
manipulate the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler
Recommended Reading http://lambda-architecture.net
Thank You
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
www.casertaconcepts.com

Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

  • 1.
    Low-Latency Analytics withNoSQL Joe Caserta June 18, 2014 Storm / Cassandra
  • 2.
    Quick Intro -Joe Caserta Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Dedicated to Data Warehousing, Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Formalized Alliances / Partnerships – System Integrators Partnered with Big Data vendors Cloudera, HortonWorks, Datameer, more… Launched Training practice, teaching data concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 1986 2004 1996 2009 2001 2010 2013 Launched Big Data Warehousing Meetup in NYC – 950+ Members 2012 Established best practices for big data ecosystem implementation Listed as a Top 20 Data Analytics Consulting Companies - CIO Review
  • 3.
    Expertise & Offerings StrategicRoadmap / Assessment / Education / Implementation Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Big Data Analytics
  • 4.
    Client Portfolio Finance. Healthcare &Insurance Retail/eCommerce & Manufacturing Education & Services
  • 5.
    Listed as oneof the 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20. Caserta Concepts
  • 6.
    Sales Marketing Finance ETL Ad-Hoc Query Horizontally ScalableEnvironment - Optimized for Analytics Big Data Cluster Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Others… Why is Data Analytics so important? Data Science Enterprise Data Warehouse
  • 7.
    • Data iscoming in so fast, how do we monitor it? • Real real-time analytics • Relevance engines, financial fraud sensors, early warning sensors • Dealing with sparse, incomplete, volatile, and highly manufactured data • Agile to adapt quickly to changing business • Wider breadth of datasets and sources in scope requires larger data repositories • Most of world’s data is unstructured, semi-structured or multi-structured • Data volume is growing so processes must be more reliant on programmatic administration • Less people/process dependence Volume Variety VelocityVeracity Challenges With Big Data
  • 9.
    What’s Important Today(according to Joe) Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD  Tools:  Mahout: Machine learning  Hive: Map data to structures and use SQL-like queries  Pig: Data transformation language for big data, from Yahoo  Storm: Real-time ETL NoSQL: Document: MongoDB, CouchDB Graph: Neo4j, Titan Key Value: Riak, Redis Columnar: Cassandra, Hbase  Languages: SQL, Python, SciPy, Java  Predictive Modeling: R, SAS, SPSS
  • 10.
    Why talk aboutStorm & Cassandra? ERP Finance Legacy ETL Data Analytics Horizontally Scalable Environment - Optimized for Analytics Big Data Cluster Data Science Big Data BI NoSQL Database ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Storm
  • 11.
    High Volume IngestionProject Overview • The equity trading arm of a large US bank needed to scale its infrastructure to enable the ability to process/parse trade data real-time and calculate aggregations/statistics ~ 1Million/second ~12 Billion messages/day ~240 Billon/month • The solution needed to map the raw data to a data model in memory or low latency (for real-time), while persisting mapped data to disk (for end of day). • The proposed solution also needed to handle ad-hoc data requests for data analytics.
  • 12.
    The Data • PrimarilyFIX messages: Financial Information Exchange  • Established in early 90's as a standard for trade data communication  widely used throughout the industry • Basically a delimited file of variable attribute-value pairs • Looks something like this: 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • A single trade can be comprised of 1000's of such messages, although typical trades have about a dozen
  • 13.
    Additional Requirements • Linearlyscalable • Highly available  no single point of failure ,quick recovery • Quicker time to benefit • Processing guarantees  NO DATA IS LOST!
  • 14.
    Some Sample AnalyticUse Cases • Sum(Notional volume) by Ticker: Daily, Hourly, Minute • Average trade latency (Execution TS – Order TS) • Wash Sales (sell within x seconds of last buy) for same Client/Ticker
  • 15.
    NoSQL Database (Cassandra) Messaging Messaging Aggregation Framework Storm Ingestand Computation External Data Real-time Dashboards Push Aggregates ToMQ Stream transaction Detail Higher Latency Analytics And Day End Application Log Data High Volume Real-timeAnalytics - SolutionArchitecture
  • 16.
    A little deeper… StormCluster Sensor Data d3.js Analytics Hadoop Cluster Low Latency Analytics Atomic data Aggregates Event Monitors • The Kafka messaging system is used for ingestion • Storm is used for real-time ETL and outputs atomic data and derived data needed for analytics • Redis is used as a reference data lookup cache • Real time analytics are produced from the aggregated data. • Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive Kafka
  • 17.
    What is Storm •Distributed Event Processor • Real-time data ingestion and dissemination • In-Stream ETL • Reliably process unbounded streams of data • Storm is fast: Clocked it at over a million tuples per second per node • It is scalable, fault-tolerant, guarantees your data will be processed • Preferred technology for real-time big data processing by organizations worldwide: • Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By • Incubator: • http://wiki.apache.org/incubator/StormProposal
  • 18.
    Components of Storm •Spout – Collects data from upstream feeds and submits it for processing • Tuple – A collection of data that is passed within Storm • Bolt – Processes tuples (Transformations) • Stream – Identifies outputs from Spouts/Bolts • Storm usually outputs to a NoSQL database
  • 19.
    Why NoSQL? • Performance: •Relational databases have a lot of features, overhead that we don’t need in many cases. Although we will miss some… • Scalability: • Most relational databases scale vertically giving them limits to how large they can get. Federation and Sharding is an awkward manual process. • Agile • Sparse Data / Data with a lot of variation • Most NoSQL scale horizontally on commodity hardware
  • 20.
    • Column familiesare the equivalent to a table in a RDMS • Primary unit of storage is a column, they are stored contiguously Skinny Rows: Most like relational database. Except columns are optional and not stored if omitted: Wide Rows: Rows can be billions of columns wide, used for time series, relationships, secondary indexes: What is Cassandra?
  • 21.
    Deeper Dive: Cassandraas an Analytic Database • Based on a blend of Dynamo and BigTable • Distributed, master-less • Super fast writes  Can ingest lots of data! • Very fast reads Why did we choose it: • Data throughput requirements • High availability • Simple expansion • Interesting data models for time series data (more on this later)
  • 22.
    Design Practices • Cassandradoes not support aggregation or joins  Data model must be tuned to usage • Denormalize your data (flatten your primary dimensional attributes into your fact) • Storing the same data redundantly is OK Might sound weird but we've been doing this all along in the traditional world modeling our data to make analytic queries simple!
  • 23.
    Wide rows areour friends • Cassandra composite columns are powerful for analytic models • Facilitate multi-dimensional analysis • A wide row table may have N number of rows, and a variable number of columns (millions of columns) • And now with CQL3 we have “unpacked” wide rows into named columns  Easy to work with! 20130101 20130102 20130103 20130104 20130104 20130105 … ClientA 10003 9493 43143 45553 54553 34343 … ClientB 45453 34313 54543 `23233 4233 34423 … ClientC 3323 35313 43123 54543 43433 4343 … … … … … … … .. …
  • 24.
    More about widerows! • The left-most column is the ROW KEY • It is the mechanism by which the row is distributed across the Cassandra cluster… • Care must be taken to prevent hot spots: Dates for example are not generally good candidates because all load will go to given set of servers on a particular day! • Data can be filtered using equal and “in” clause • The top row is the COLUMN KEY • Their can be a variable number of columns • It is acceptable to have millions/ even billions of columns in a table • Columns keys are sorted and can accept a range query (greater than / less than) 20130101 20130102 20130103 20130104 20130104 20130105 … ClientA 10003 9493 43143 45553 54553 34343 … ClientB 45453 34313 54543 `23233 4233 34423 … ClientC 3323 35313 43123 54543 43433 4343 … … … … … … … .. … Create table Client_Daily_Summary ( Client text, Date_ID int, Trade_Count int, Primary key (Client, Date_ID))
  • 25.
    Traditional CassandraAnalytic Model Ifwe wanted to track trade count by day, hour we could stream our ETL to two (or more) summary fact tables 0900 1000 1100 1200 1300 1400 ClientA|20131101 1000 949 4314 4555 5455 3434 ClientA|20131102 4545 3431 5454 2323 423 3442 ClientB|20131101 332 3531 4312 5454 4343 434 20130101 20130102 20130103 20130104 20130104 20130105 ClientA 10003 9493 43143 45553 54553 34343 ClientB 45453 34313 54543 `23233 4233 34423 ClientC 3323 35313 43123 54543 43433 4343 Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3: Select Date_ID, Trade_Count from Client_Hourly_Summary ` where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103 Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM Select Hour, Trade_Count from Client_Hourly_Summary ` where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
  • 26.
    Storing the Atomicdata 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • We must land all atomic data: • Persistence • Future replay (new metrics, corrections) • Drill down capabilities/auditability • The sparse nature of the FIX data fits the Cassandra data model very well. • We will store tags which are actually present in the data, saving space  a few approaches depending on usage pattern. Create table Trades_Skinny( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, …Many more columns) Create index ix_Date_ID on Trade_Data_Skinny (Date_ID) Create table Trades_Map( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, Tags map <text, text>) Create index ix_Date_ID on Trade_Data_Map (Date_ID) Create table Trades_Wide( Order_ID Text, Tag text, Value text, Primary key (Order_ID, Tag))
  • 27.
    Closing Thought • Thedays of staying committed to the discipline of a single database technology – Relational – Are behind us. • Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
  • 28.
  • 29.
    Thank You Joe Caserta President,Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta www.casertaconcepts.com

Editor's Notes

  • #11 Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB