Big Data Warehousing Meetup
December 10, 2013

Real-time Trade Data Monitoring
with Storm & Cassandra
Agenda
7:00

Networking
Grab a slice of pizza and a drink...

7:15

Welcome & Intro

President, Caserta Concepts
Author, D...
About the BDW Meetup
• Big Data is a complex, rapidly changing

landscape
• We want to share our stories and hear

about y...
About Caserta Concepts
Focused
Expertise
•
•
•
•

Big Data Analytics
Data Warehousing
Business Intelligence
Strategic Data...
Caserta Concepts
Listed as one of the 20 Most Promising
Data Analytics Consulting Companies

CIOReview looked at hundreds ...
Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting/
Implementation

Big Data
Analytics

Data Warehousing/
ETL/...
Client Portfolio
Finance
& Insurance

Retail/eCommerce
& Manufacturing

Education
& Services
We are hiring
Does this word cloud excite you?

Speak with us about our open positions: jobs@casertaconcepts.com
Why talk about Storm & Cassandra?
Traditional BI

ERP
ETL

Traditional
EDW

Finance
ETL

Ad-Hoc/Canned
Reporting

Legacy

...
What is Storm
• Distributed Event Processor
• Real-time data ingestion and dissemination
• In-Stream ETL
• Reliably proces...
Components of Storm
• Spout – Collects data from upstream feeds and submits

it for processing
• Tuple – A collection of d...
Why NoSQL?
• Performance:
• Relational databases have a lot of features, overhead that we don’t
need in many cases. Althou...
What is Cassandra?
• Column families are the equivalent to a table in a RDMS
• Primary unit of storage is a column, they a...
REAL TIME TRADE DATA MONITORING
Elliott Cordo
Chief Architect, Caserta Concepts
The Use Case
• Trade data (orders and executions)
• High volume of incoming data
• 500 thousand records per second
• 12 bi...
The Data
• Primarily FIX messages: Financial Information Exchange 

• Established in early 90's as a standard for trade d...
Additional Requirements
• Linearly scalable
• Highly available  no single point of failure ,quick recovery
• Quicker time...
Some Sample Analytic Use Cases
• Sum(Notional volume) by Ticker: Daily, Hourly, Minute
• Average trade latency (Execution ...
How has this system traditionally been
handled
• Typically by manually partitioning the application  Having a number

Mes...
Need to Establish a Platform as a Service
Architecture
d3.js Analytics

Atomic data

Sensor
Data

Aggregates
Storm Cluster...
Deeper Dive: Cassandra as an Analytic
Database
• Based on a blend of Dynamo and BigTable
• Distributed, master-less
• Supe...
Design Practices
• Cassandra does not support aggregation or joins 

Data model must be tuned to usage
• Denormalize your...
Wide rows are our friends
• Cassandra composite columns are powerful for analytic

models
• Facilitate multi-dimensional a...
More about wide rows!
• The left-most column is the ROW KEY
• It is the mechanism by which the row is distributed across t...
Traditional Cassandra Analytic Model
If we wanted to track trade count by day, hour we could
stream our ETL to two (or mor...
But there are other methods too
• Assuming some level of client side aggregation (and additive measures) we

could also fu...
Storing the Atomic data
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 |
20...
Big data solutions usually employ multiple DB types
Some considerations:
 Size type requirements:
• Volume: which is a di...
Contact

Elliott Cordo
Principal Consultant, Caserta
Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com

info@...
DEEP-DIVE INTO STORM TOPOLOGY
Noel Milton Vega
Consultant, Dimension Data, LLC.
Consultant, Caserta Concepts
Practical Deep Dive: Continuity-of-Service across Storm
failures
An approach to making topologies more resilient to task f...
Storyboard: Continuity-of-Service
ACME C
heck Deposit C (H.Q.)
orp

X

S
tep1: deposit client [A-I] checks
S
tep2: update ...
Storyboard: Continuity-of-Service

Why this example? It has the operational requirements of real-world use cases:
 Distri...
Modeling this use-case story in Storm

Blue:



Deposits a batch of checks for clients [A-I] and is given a deposit rece...
Modeling this use-case story in Storm
http://bit.ly/1bsBooT
What does Storm remember across task fail/restarts? (if
anything)
http://bit.ly/1bsBooT

worker
exec
t0

X

worker
exec

w...
Programmatically, what we’re asking is this …
http://bit.ly/1bsBooT

// ===============================
// Constructor.
//...
Lab behavior observations shows Storm does
remember …
http://bit.ly/1bsBooT

componentID =
context.getThisComponentId();
#...
Quick digression …
Lab tests show Storm does remember, but what’s
missing?
http://bit.ly/1bsBooT

So in Lab tests we observed the following b...
REDIS to the rescue :: Continuity-of-Service

Since we observed the following behaviors in Storm:
 Preserves the FQID (e....
REDIS to the rescue :: Continuity-of-Service
FQID is maintained across task Fail/Restarts
(i.e. for the lifetime of the to...
Summary :: Storm / Redis and Continuity-of-Service

Master

r/o Slave (local)

host:6379

Fields grouping within a stream
...
Noel Milton Vega
Consultant, Dimension Data, LLC.
Consultant, Caserta Concepts
P: (212) 699-2660
E1: noel@casertaconcepts....
Q&A / THANK YOU
501 Fifth Ave
17th Floor
New York, NY 10017
1-855-755-2246
info@casertaconcepts.com
Upcoming SlideShare
Loading in...5
×

Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cassandra

2,396

Published on

Caserta Concepts' implementation team presented a solution that performs big data analytics on active trade data in real-time. They presented the core components – Storm for the real-time ingest, Cassandra, a NoSQL database, and others. For more information on future events, please check out http://www.casertaconcepts.com/.

Published in: Technology, Business
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,396
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
23
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide
  • Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB
  • Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cassandra

    1. 1. Big Data Warehousing Meetup December 10, 2013 Real-time Trade Data Monitoring with Storm & Cassandra
    2. 2. Agenda 7:00 Networking Grab a slice of pizza and a drink... 7:15 Welcome & Intro President, Caserta Concepts Author, Data Warehouse ETL Toolkit 7:30 Joe Caserta About the Meetup and about Caserta Concepts Elliott Cordo Cassandra Chief Architect, Caserta Concepts 8:00 Noel Vega Consultant, Caserta Concepts Consultant, Dimension Data, LLC 8:309:00 Q&A / More Networking Storm
    3. 3. About the BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts, Big Data Analytics, DW & BI Consulting • Next BDW Meetup: JANUARY 20
    4. 4. About Caserta Concepts Focused Expertise • • • • Big Data Analytics Data Warehousing Business Intelligence Strategic Data Ecosystems Industries Served • • • • • Financial Services Healthcare / Insurance Retail / eCommerce Digital Media / Marketing K-12 / Higher Education Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
    5. 5. Caserta Concepts Listed as one of the 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20.
    6. 6. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting/ Implementation Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics
    7. 7. Client Portfolio Finance & Insurance Retail/eCommerce & Manufacturing Education & Services
    8. 8. We are hiring Does this word cloud excite you? Speak with us about our open positions: jobs@casertaconcepts.com
    9. 9. Why talk about Storm & Cassandra? Traditional BI ERP ETL Traditional EDW Finance ETL Ad-Hoc/Canned Reporting Legacy Big Data BI Big Data Cluster NoSQL Database Storm Data Analytics Mahout N1 MapReduce N2 N3 Pig/Hive N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Data Science
    10. 10. What is Storm • Distributed Event Processor • Real-time data ingestion and dissemination • In-Stream ETL • Reliably process unbounded streams of data • Storm is fast: Clocked it at over a million tuples per second per node • It is scalable, fault-tolerant, guarantees your data will be processed • Preferred technology for real-time big data processing by organizations worldwide: • Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By • Incubator: • http://wiki.apache.org/incubator/StormProposal
    11. 11. Components of Storm • Spout – Collects data from upstream feeds and submits it for processing • Tuple – A collection of data that is passed within Storm • Bolt – Processes tuples (Transformations) • Stream – Identifies outputs from Spouts/Bolts • Storm usually outputs to a NoSQL database
    12. 12. Why NoSQL? • Performance: • Relational databases have a lot of features, overhead that we don’t need in many cases. Although we will miss some… • Scalability: • Most relational databases scale vertically giving them limits to how large they can get. Federation and Sharding is an awkward manual process. • Agile • Sparse Data / Data with a lot of variation • Most NoSQL scale horizontally on commodity hardware
    13. 13. What is Cassandra? • Column families are the equivalent to a table in a RDMS • Primary unit of storage is a column, they are stored contiguously Skinny Rows: Most like relational database. Except columns are optional and not stored if omitted: Wide Rows: Rows can be billions of columns wide, used for time series, relationships, secondary indexes:
    14. 14. REAL TIME TRADE DATA MONITORING Elliott Cordo Chief Architect, Caserta Concepts
    15. 15. The Use Case • Trade data (orders and executions) • High volume of incoming data • 500 thousand records per second • 12 billion messages per day • Required that data be aggregated and monitored in real time (end to end latency measured in 100's of ms) • Both raw messages and analytics stored, persisted to a database
    16. 16. The Data • Primarily FIX messages: Financial Information Exchange  • Established in early 90's as a standard for trade data communication  widely used throughout the industry • Basically a delimited file of variable attribute-value pairs • Looks something like this: 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • A single trade can be comprised of 1000's of such messages, although typical trades have about a dozen
    17. 17. Additional Requirements • Linearly scalable • Highly available  no single point of failure ,quick recovery • Quicker time to benefit • Processing guarantees  NO DATA IS LOST!
    18. 18. Some Sample Analytic Use Cases • Sum(Notional volume) by Ticker: Daily, Hourly, Minute • Average trade latency (Execution TS – Order TS) • Wash Sales (sell within x seconds of last buy) for same Client/Ticker
    19. 19. How has this system traditionally been handled • Typically by manually partitioning the application  Having a number Message Queue of independent systems and databases “dividing” the problem Use Case 1: Partition A Database A Use Case 1: Partition B Database B Use Case 2: All Partitions Database C Main issues  • Growth requires changing these systems to accept the new partitioning scheme: Development! • A lot of different applications replicating complex architecture, tons of boilerplate code • Performing analysis across the partitioning schemes very difficult
    20. 20. Need to Establish a Platform as a Service Architecture d3.js Analytics Atomic data Sensor Data Aggregates Storm Cluster Event Monitors • Redis queue is used for ingestion • Storm is used for real-time ETL and outputs atomic data and derived data needed for analytics • Redis is used as a reference data lookup cache and state • Real time analytics are produced from the aggregated data. • Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive Low Latency Analytics
    21. 21. Deeper Dive: Cassandra as an Analytic Database • Based on a blend of Dynamo and BigTable • Distributed, master-less • Super fast writes  Can ingest lots of data! • Very fast reads Why did we choose it: • Data throughput requirements • High availability • Simple expansion • Interesting data models for time series data (more on this later)
    22. 22. Design Practices • Cassandra does not support aggregation or joins  Data model must be tuned to usage • Denormalize your data (flatten your primary dimensional attributes into your fact) • Storing the same data redundantly is OK Might sound weird but we've been doing this all along in the traditional world modeling our data to make analytic queries simple!
    23. 23. Wide rows are our friends • Cassandra composite columns are powerful for analytic models • Facilitate multi-dimensional analysis • A wide row table may have N number of rows, and a variable number of columns (millions of columns) ClientA ClientB ClientC … 20130101 20130102 20130103 20130104 20130104 20130105 … 10003 9493 43143 45553 54553 34343 … 45453 34313 54543 `23233 4233 34423 … 3323 35313 43123 54543 43433 4343 … … … … … … .. … • And now with CQL3 we have “unpacked” wide rows into named columns  Easy to work with!
    24. 24. More about wide rows! • The left-most column is the ROW KEY • It is the mechanism by which the row is distributed across the Cassandra cluster… • Care must be taken to prevent hot spots: Dates for example are not generally good candidates because all load will go to given set of servers on a particular day! • Data can be filtered using equal and “in” clause ClientA ClientB ClientC … 20130101 20130102 20130103 20130104 20130104 20130105 … 10003 9493 43143 45553 54553 34343 … 45453 34313 54543 `23233 4233 34423 … 3323 35313 43123 54543 43433 4343 … … … … … … .. … Create table Client_Daily_Summary ( Client text, Date_ID int, Trade_Count int, Primary key (Client, Date_ID)) • The top row is the COLUMN KEY • Their can be a variable number of columns • It is acceptable to have millions/ even billions of columns in a table • Columns keys are sorted and can accept a range query (greater than / less than)
    25. 25. Traditional Cassandra Analytic Model If we wanted to track trade count by day, hour we could stream our ETL to two (or more) summary fact tables ClientA ClientB ClientC 20130101 20130102 20130103 20130104 20130104 20130105 10003 9493 43143 45553 54553 34343 45453 34313 54543 `23233 4233 34423 3323 35313 43123 54543 43433 4343 Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3: Select Date_ID, Trade_Count from Client_Hourly_Summary ` where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103 ClientA|20131101 ClientA|20131102 ClientB|20131101 0900 1000 4545 332 1000 949 3431 3531 1100 4314 5454 4312 1200 4555 2323 5454 1300 5455 423 4343 1400 3434 3442 434 Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM Select Hour, Trade_Count from Client_Hourly_Summary ` where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
    26. 26. But there are other methods too • Assuming some level of client side aggregation (and additive measures) we could also further unpack and leverage column keys using CQL 3  A slightly different use case: Create table Client_Ticker_Summary ( Client text, Date_ID int, Ticker text, Trade_Count int, Notional_Volume float, Primary Key (Client, Date_ID, Ticker)) The first column in the PK definition is the Row Key aka Partition Key Look at all this flexible SQL goodness: select * from Client_Ticker_Summary where Client in ('ClientA','ClientB') select * from Client_Ticker_Summary where Client in ('ClientA','ClientB') and Date_ID >= 20130101 and Date_ID <= 20130103 select * from Client_Ticker_Summary where Client ='ClientA' and Date_ID >= 20130101 and Date_ID <= 20130103 Select * from Client_Ticker_Summary where Client = 'ClientA’ and Date_ID=20130101 and Ticker in ('APPL','GE','PG') ALSO  But not recommended! select * from Client_Ticker_Summary where Date_ID > 20120101 allow filtering; select * from Client_Ticker_Summary where Date_ID = 20120101 and ticker in ('APPL','GE') allow filtering;
    27. 27. Storing the Atomic data 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • We must land all atomic data: • Persistence • Future replay (new metrics, corrections) • Drill down capabilities/auditability • The sparse nature of the FIX data fits the Cassandra data model very well. • We will store tags which are actually present in the data, saving space  a few approaches depending on usage pattern. Create table Trades_Skinny( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, …Many more columns) Create index ix_Date_ID on Trade_Data_Skinny (Date_ID) Create table Trades_Wide( Order_ID Text Primary_Key, Tag text, Value text, Primary key (Order_ID, Tag)) Create table Trades_Map( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, Tags map <text, text>) Create index ix_Date_ID on Trade_Data_Map (Date_ID)
    28. 28. Big data solutions usually employ multiple DB types Some considerations:  Size type requirements: • Volume: which is a disk space size requirement. • Velocity: which is an message rate requirement.  Data-Structure & Query Pattern complexity: Simple K/V pair -vs- Relational -vs- …  C.A.P. theorem alignment: Which two does of your use-case benefit from?  Value-add features: • API: (Interface: e.g. HTTP ReST -vs- Client classes). (Power: e.g. mget, incrementBy). • Replication and/or H/A support. (B.C./D.R.) • Support for Data Processing Patterns (e.g. Riak has Map/Reduce; Redis zSets has Top-N) • Transaction support (Redis: Multi; Command list; Exec). • and so on.
    29. 29. Contact Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com
    30. 30. DEEP-DIVE INTO STORM TOPOLOGY Noel Milton Vega Consultant, Dimension Data, LLC. Consultant, Caserta Concepts
    31. 31. Practical Deep Dive: Continuity-of-Service across Storm failures An approach to making topologies more resilient to task failure  Tasks in Storm are the units that do the actual work.  Tasks can individually fail due to:  Resource starvation (OOM, CPU)  Unhandled exceptions  Timeouts (such as waiting for I/O)  and so on  Tasks also fail because parent Executors, Workers or Supervisors fail.  Nimbus will spawn a replacement task, but in the context of C.o.S. is that enough? Answer: No. But, maybe we can work around that. http://bit.ly/1bsBooT  My “storm-user” Google group question:
    32. 32. Storyboard: Continuity-of-Service ACME C heck Deposit C (H.Q.) orp X S tep1: deposit client [A-I] checks S tep2: update checkbook balance S tep1: deposit client [J-R] checks S tep2: update checkbook balance S tep1: deposit client [S checks -Z] S tep2: update checkbook balance Blue:  Deposits a check for an [A-I] client, and is given a deposit receipt for it (Step1).  Before he’s able to journal the receipt to the check register journal, he quits. (Step2). 1) ACME H.Q. notices that [A-I] checks aren’t being processed. Should the workload be redistributed? No! (exception policy). 2) Policy Consequence: there’s no difference before & after event, so context has to be remembered:  The new hire’s role is as check depositor for ACME (not a plumber for sub-company FOOBAR).  Their specific ACME role is to deposit checks for clients [A-I].  The role did have state: there’s an Aggregate check register; and an incomplete Transaction.
    33. 33. Storyboard: Continuity-of-Service Why this example? It has the operational requirements of real-world use cases:  Distributed model (where processors are autonomous). Suitable for Big Data.  Specific Failure / Recovery requirements:  Incomplete Transaction are completed  Aggregated state is remembered  Behavior Persistence: Same behavior before & after an exception event (stikyness).
    34. 34. Modeling this use-case story in Storm Blue:   Deposits a batch of checks for clients [A-I] and is given a deposit receipt for them (Step1). Before he’s able to journal the receipt to the check register journal, he quits. (Step2). 1) ACME H.Q. notices that [A-I] checks aren’t being processed. Should the workload be redistributed? No! (by policy). 2) Policy Consequence: there’s no difference before & after event, so context has to be remembered: acmeBolt  The role is check depositor for ACME (not a plumber for sister-company FOO). acmeBolt task (fields grouped  The specific ACME role is to deposit checks for clients [A-I].  The role did have state: there’s an Aggregate check register; and an incomplete Java objects in the JVM associated with Transaction. acmeBolt task
    35. 35. Modeling this use-case story in Storm http://bit.ly/1bsBooT
    36. 36. What does Storm remember across task fail/restarts? (if anything) http://bit.ly/1bsBooT worker exec t0 X worker exec worker exec t0 t0 supervisor node 1-of-3 worker exec t1 worker exec worker exec t1 t2 supervisor node 2-of-3 worker exec t2 worker exec worker exec t2 t2 supervisor node 3-of-3 - What is Storm’s grouping/re-grouping policy? - Will replacement tasks use the same identifier?
    37. 37. Programmatically, what we’re asking is this … http://bit.ly/1bsBooT // =============================== // Constructor. // =============================== public bolt01(Properties properties) { } worker exec t0 X worker exec t0 t0 supervisor node 1-of-3 // =============================== // prepare() method // =============================== public void prepare(Map stormConf, TopologyContext } // =============================== // execute() method. // =============================== public void execute(Tuple inTuple) { } worker exec context, worker exec t1 worker exec worker exec t1 t2 supervisor node 2-of-3 worker exec t2 worker exec worker exec t2 t2 supervisor node 3-of-3 OutputCollector collector) { Is identification remembered here? Is grouping remembered here? (i.e. redistribution policy)
    38. 38. Lab behavior observations shows Storm does remember … http://bit.ly/1bsBooT componentID = context.getThisComponentId(); # Defined in topology class. E.g. bolt01 ComponentID taskPntr1 0 taskPntr2 1 taskPntr3 2 … taskPntrN N-1 taskID = context.getThisTaskId(); # An integer between [1 – N], where N is the number of tasks, topology-wide. taskIndex = context.getThisTaskIndex(); # An integer between [0-(N-1)], where N is the number of tasks, component-wide. fqid = componentID + “.0” + Integer.toString(taskIndex) # Ex: bolt02.05; spout01.03; bolt01.00
    39. 39. Quick digression …
    40. 40. Lab tests show Storm does remember, but what’s missing? http://bit.ly/1bsBooT So in Lab tests we observed the following behaviors in Storm:  Preserve the FQID (e.g.: bolt01.02) before & after task failures. IDENTITY PERSISTANCE!  Tasks with a given FQID will receive the same grouping of data throughout the life of a topology. (Analogy: New hire will be an ACME check depositor for clients [A-I]). And yet, there is something still missing? While Storm can replay unprocessed Tuples that timed-out during the fail/restart period, it can’t regenerate in-memory (in-JVM) aggregated state What to do? 
    41. 41. REDIS to the rescue :: Continuity-of-Service Since we observed the following behaviors in Storm:  Preserves the FQID (e.g.: bolt01.02) before & after task failures. IDENTITY PERSISTANCE!  Tasks with a given FQID will receive the same grouping of data throughout the life of a topology.
    42. 42. REDIS to the rescue :: Continuity-of-Service FQID is maintained across task Fail/Restarts (i.e. for the lifetime of the topology). // =============================== // prepare() method // =============================== public void prepare(Map stormConf, TopologyContext [ ... snip ... ] context, OutputCollector collector) { this.componentID = context.getThisComponentId(); // e.g. bolt01; spout03 this.taskIndex = context.getThisTaskIndex(); // [0-(N-1)]; where N = Number of component tasks. this.fqid = componentID + “.0” + Integer.toString(this.taskIndex); // bolt01.04; spout03.00 this.redisKeyPrefix = this.fqid; // Use your unique Fully Qualified I.D. as a Redis key prefix. // Establish connection to Redis [not shown], and recover lost data structures, if any. this.hashMap = this.jedisClient.hgetAll(this.redisKeyPrefix + “-myMap”); //bolt01.01-myMap } // =============================== // execute() method // =============================== public void execute(Tuple inTuple) { [ ... snip ... ] Tuple grouping/partitioning is maintained across task fail/restarts (i.e. for the lifetime of the topology). String customer = inTuple.getString(0); double balance += inTuple.getString(1); this.hashMap.put(customer, balance); // Recovered, as necessary, in prepare(). this.jedisClient.hput(this.redisKeyPrefix + “-myMap”, customer, balance); }
    43. 43. Summary :: Storm / Redis and Continuity-of-Service Master r/o Slave (local) host:6379 Fields grouping within a stream is based on field-1 of the Tuple. } KEY: dataSourceQueue01 spout01.00 bolt01.00  taskIndex -vstaskID bolt01.01 bolt01.02 spout01.01 KEY: dataSourceQueue02 spout01.02 spout01.03 spout01.04 KEY: spout01.tupleAchHash tupleGUID GUID1 GUID2 ... GUID-n Tuple tuple1 tuple2 Tuple-n KE bolt01.02-dat aS Y: truct 1 KE bolt01.02-dat aS Y: truct 2 KE bolt01.02-dat aS Y: tructN KE bolt02.00-dat aS Y: truct 1 KE bolt02.00-dat aS Y: truct 2 KE bolt01.00-dat aS Y: tructN ... spout01.05 bolt02.00 bolt02.01 bolt02.02 } }      v v S trings (Byte-arrays). Lists (2-way queue, as linked list) S ets Hashes S orted S (Hashes w/ sorted values) ets S e/De-serialize objects as JS ON Other in-memory solution: e.g. MemS QL.
    44. 44. Noel Milton Vega Consultant, Dimension Data, LLC. Consultant, Caserta Concepts P: (212) 699-2660 E1: noel@casertaconcepts.com E2: nmvega@didata.us info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com
    45. 45. Q&A / THANK YOU 501 Fifth Ave 17th Floor New York, NY 10017 1-855-755-2246 info@casertaconcepts.com

    ×