Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

3,086 views

Published on

Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.

Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.

Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance ​,​ and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.

For more information, visit www.casertaconcepts.com

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,086
On SlideShare
0
From Embeds
0
Number of Embeds
79
Actions
Shares
0
Downloads
0
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB
  • Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

    1. 1. Low-Latency Analytics with NoSQL Joe Caserta June 18, 2014 Storm / Cassandra
    2. 2. Quick Intro - Joe Caserta Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Dedicated to Data Warehousing, Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Formalized Alliances / Partnerships – System Integrators Partnered with Big Data vendors Cloudera, HortonWorks, Datameer, more… Launched Training practice, teaching data concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 1986 2004 1996 2009 2001 2010 2013 Launched Big Data Warehousing Meetup in NYC – 950+ Members 2012 Established best practices for big data ecosystem implementation Listed as a Top 20 Data Analytics Consulting Companies - CIO Review
    3. 3. Expertise & Offerings Strategic Roadmap / Assessment / Education / Implementation Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Big Data Analytics
    4. 4. Client Portfolio Finance. Healthcare & Insurance Retail/eCommerce & Manufacturing Education & Services
    5. 5. Listed as one of the 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20. Caserta Concepts
    6. 6. Sales Marketing Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Cluster Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Others… Why is Data Analytics so important? Data Science Enterprise Data Warehouse
    7. 7. • Data is coming in so fast, how do we monitor it? • Real real-time analytics • Relevance engines, financial fraud sensors, early warning sensors • Dealing with sparse, incomplete, volatile, and highly manufactured data • Agile to adapt quickly to changing business • Wider breadth of datasets and sources in scope requires larger data repositories • Most of world’s data is unstructured, semi-structured or multi-structured • Data volume is growing so processes must be more reliant on programmatic administration • Less people/process dependence Volume Variety VelocityVeracity Challenges With Big Data
    8. 8. What’s Important Today (according to Joe) Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD  Tools:  Mahout: Machine learning  Hive: Map data to structures and use SQL-like queries  Pig: Data transformation language for big data, from Yahoo  Storm: Real-time ETL NoSQL: Document: MongoDB, CouchDB Graph: Neo4j, Titan Key Value: Riak, Redis Columnar: Cassandra, Hbase  Languages: SQL, Python, SciPy, Java  Predictive Modeling: R, SAS, SPSS
    9. 9. Why talk about Storm & Cassandra? ERP Finance Legacy ETL Data Analytics Horizontally Scalable Environment - Optimized for Analytics Big Data Cluster Data Science Big Data BI NoSQL Database ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Storm
    10. 10. High Volume Ingestion Project Overview • The equity trading arm of a large US bank needed to scale its infrastructure to enable the ability to process/parse trade data real-time and calculate aggregations/statistics ~ 1Million/second ~12 Billion messages/day ~240 Billon/month • The solution needed to map the raw data to a data model in memory or low latency (for real-time), while persisting mapped data to disk (for end of day). • The proposed solution also needed to handle ad-hoc data requests for data analytics.
    11. 11. The Data • Primarily FIX messages: Financial Information Exchange  • Established in early 90's as a standard for trade data communication  widely used throughout the industry • Basically a delimited file of variable attribute-value pairs • Looks something like this: 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • A single trade can be comprised of 1000's of such messages, although typical trades have about a dozen
    12. 12. Additional Requirements • Linearly scalable • Highly available  no single point of failure ,quick recovery • Quicker time to benefit • Processing guarantees  NO DATA IS LOST!
    13. 13. Some Sample Analytic Use Cases • Sum(Notional volume) by Ticker: Daily, Hourly, Minute • Average trade latency (Execution TS – Order TS) • Wash Sales (sell within x seconds of last buy) for same Client/Ticker
    14. 14. NoSQL Database (Cassandra) Messaging Messaging Aggregation Framework Storm Ingestand Computation External Data Real-time Dashboards Push Aggregates To MQ Stream transaction Detail Higher Latency Analytics And Day End Application Log Data High Volume Real-timeAnalytics - SolutionArchitecture
    15. 15. A little deeper… Storm Cluster Sensor Data d3.js Analytics Hadoop Cluster Low Latency Analytics Atomic data Aggregates Event Monitors • The Kafka messaging system is used for ingestion • Storm is used for real-time ETL and outputs atomic data and derived data needed for analytics • Redis is used as a reference data lookup cache • Real time analytics are produced from the aggregated data. • Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive Kafka
    16. 16. What is Storm • Distributed Event Processor • Real-time data ingestion and dissemination • In-Stream ETL • Reliably process unbounded streams of data • Storm is fast: Clocked it at over a million tuples per second per node • It is scalable, fault-tolerant, guarantees your data will be processed • Preferred technology for real-time big data processing by organizations worldwide: • Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By • Incubator: • http://wiki.apache.org/incubator/StormProposal
    17. 17. Components of Storm • Spout – Collects data from upstream feeds and submits it for processing • Tuple – A collection of data that is passed within Storm • Bolt – Processes tuples (Transformations) • Stream – Identifies outputs from Spouts/Bolts • Storm usually outputs to a NoSQL database
    18. 18. Why NoSQL? • Performance: • Relational databases have a lot of features, overhead that we don’t need in many cases. Although we will miss some… • Scalability: • Most relational databases scale vertically giving them limits to how large they can get. Federation and Sharding is an awkward manual process. • Agile • Sparse Data / Data with a lot of variation • Most NoSQL scale horizontally on commodity hardware
    19. 19. • Column families are the equivalent to a table in a RDMS • Primary unit of storage is a column, they are stored contiguously Skinny Rows: Most like relational database. Except columns are optional and not stored if omitted: Wide Rows: Rows can be billions of columns wide, used for time series, relationships, secondary indexes: What is Cassandra?
    20. 20. Deeper Dive: Cassandra as an Analytic Database • Based on a blend of Dynamo and BigTable • Distributed, master-less • Super fast writes  Can ingest lots of data! • Very fast reads Why did we choose it: • Data throughput requirements • High availability • Simple expansion • Interesting data models for time series data (more on this later)
    21. 21. Design Practices • Cassandra does not support aggregation or joins  Data model must be tuned to usage • Denormalize your data (flatten your primary dimensional attributes into your fact) • Storing the same data redundantly is OK Might sound weird but we've been doing this all along in the traditional world modeling our data to make analytic queries simple!
    22. 22. Wide rows are our friends • Cassandra composite columns are powerful for analytic models • Facilitate multi-dimensional analysis • A wide row table may have N number of rows, and a variable number of columns (millions of columns) • And now with CQL3 we have “unpacked” wide rows into named columns  Easy to work with! 20130101 20130102 20130103 20130104 20130104 20130105 … ClientA 10003 9493 43143 45553 54553 34343 … ClientB 45453 34313 54543 `23233 4233 34423 … ClientC 3323 35313 43123 54543 43433 4343 … … … … … … … .. …
    23. 23. More about wide rows! • The left-most column is the ROW KEY • It is the mechanism by which the row is distributed across the Cassandra cluster… • Care must be taken to prevent hot spots: Dates for example are not generally good candidates because all load will go to given set of servers on a particular day! • Data can be filtered using equal and “in” clause • The top row is the COLUMN KEY • Their can be a variable number of columns • It is acceptable to have millions/ even billions of columns in a table • Columns keys are sorted and can accept a range query (greater than / less than) 20130101 20130102 20130103 20130104 20130104 20130105 … ClientA 10003 9493 43143 45553 54553 34343 … ClientB 45453 34313 54543 `23233 4233 34423 … ClientC 3323 35313 43123 54543 43433 4343 … … … … … … … .. … Create table Client_Daily_Summary ( Client text, Date_ID int, Trade_Count int, Primary key (Client, Date_ID))
    24. 24. Traditional CassandraAnalytic Model If we wanted to track trade count by day, hour we could stream our ETL to two (or more) summary fact tables 0900 1000 1100 1200 1300 1400 ClientA|20131101 1000 949 4314 4555 5455 3434 ClientA|20131102 4545 3431 5454 2323 423 3442 ClientB|20131101 332 3531 4312 5454 4343 434 20130101 20130102 20130103 20130104 20130104 20130105 ClientA 10003 9493 43143 45553 54553 34343 ClientB 45453 34313 54543 `23233 4233 34423 ClientC 3323 35313 43123 54543 43433 4343 Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3: Select Date_ID, Trade_Count from Client_Hourly_Summary ` where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103 Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM Select Hour, Trade_Count from Client_Hourly_Summary ` where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
    25. 25. Storing the Atomic data 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • We must land all atomic data: • Persistence • Future replay (new metrics, corrections) • Drill down capabilities/auditability • The sparse nature of the FIX data fits the Cassandra data model very well. • We will store tags which are actually present in the data, saving space  a few approaches depending on usage pattern. Create table Trades_Skinny( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, …Many more columns) Create index ix_Date_ID on Trade_Data_Skinny (Date_ID) Create table Trades_Map( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, Tags map <text, text>) Create index ix_Date_ID on Trade_Data_Map (Date_ID) Create table Trades_Wide( Order_ID Text, Tag text, Value text, Primary key (Order_ID, Tag))
    26. 26. Closing Thought • The days of staying committed to the discipline of a single database technology – Relational – Are behind us. • Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
    27. 27. Recommended Reading http://lambda-architecture.net
    28. 28. Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta www.casertaconcepts.com

    ×