Cassandra &Next Generation AnalysisCassandra for a high-velocity dataingestion and real-time analysis system.Ameet Chaubal...
Presentation Route• Describe	  conven,onal	  technology	  solu,on	  • Highlight	  deficiencies	  • Showcase	  new	  solu,on...
Business Case•  Capture messages from high-volume e-Commerce site.•  Store them into a database•  Perform near real-time q...
Olden Days …JMS QueueTransientStorageRDBMSDatawarehouseAnalysiseCommerce Website
Business Case, Details…Messages: 5000 msg/sec~ 250 million / dayMessage size : 1 KbJMS QueueTransientStorageRDBMSDatawareh...
What’s the problem?JMS QueueDatawarehouseSITE ISITE IIJMS Queue•  QueueReplicationproblems•  Message Loss•  Other applicat...
Problems Recap• Over	  5000	  msg/sec	  High	  Write	  Speed	  • Extrac9on	  &	  Load	  very	  slow	  ETL	  from	  Transie...
ThriftConnectionPoolOnline e-CommerceApplicationEvent JMSA3LoadBalancerVIPA6A5ReplicationConsumersHector /Java Client -1He...
Role of Data ModelBefore we get there,what features are missing from Cassandra incomparison to traditional RDBMS
Shortcomings… Opportunities•  No Joins across Column Families•  No analytical functions such as sum, count…•  Difficulty i...
Importance of Data model - Cassandra•  In lieu of JOINS, “smart” de-normalization techniquesare crucial.•  Need to use “FE...
Features of Cassandra Modeling•  “WIDE” Column Family–  Organize data in “horizontal” as opposed to “vertical” fashion as ...
Data ModelWide	  rows	  with	  sharding	  Row	  Key	  =	  “<min>|<part#>”	  Role	  of	  par99on	  #:	  	  •  Each	  row	  ...
Composite Columns•  Composite Columns:–  Actual message stored as part of composite column•  Variable granularity grouping...
Benefits
Data Center 3 (RO)Data Center 2(RW)Data Center 1(RW)Geo-Redundancy16Data Center 4 (RO)
Data Consolidation and Extraction•  Single view of data across multiple locations•  Data extraction can be performed in pa...
Low-Latency & Batch Applications•  Triaging–  Troubleshooting customer issues within 10 minutes ofoccurrence–  Feeding a d...
Opportunities Remaining•  Near real-time pattern detection andresponse•  Message loss in JMS queue•  JMS queue replication...
Further Improvements…HOW ???
Accenture	  Cloud	  PlaAorm	  Recommender	  as	  a	  Service	  …	  Network	  Analy9cs	  Services	  Big Data Platform
Driversconsumer devicesvideo usageIssuesOperational CostsUnderstanding service quality degradationInefficient capacity pla...
INGEST	   PROCESS	  VISUALIZE	  ANALYZE	  STORE	  
WHY STORM?
ScalabilityReliabilityData types, size, velocityMission critical dataProcessing, computation, etc.Time series / patternana...
How do we get this from Storm?Processing guaranteesLow-levelPrimitivesParallelizationRobust fail-over strategiesScalabilit...
PRIMITIVES	  
Stream	  Spout	  Bolt	  Topology	  Subop(mal	  network	  speed,	  geospa(al	  analysis	  	  Request	  info	  (IP,	  user-­...
Integration with CassandraCassandraOptimal for time series dataNear-linear scalableLow read/write latencyScales in conjunc...
SUBOPTIMAL NETWORK SPEED TOPOLOGYAN EXAMPLE
KaUa	  Spout	  Pre-­‐process	   Sessionize	  Calculate	  N/W	  Speed	  per	  Session	  Update	  Speed	  per	  IP	  Iden(fy...
Cassandra	  KaUa	  Spout	  Pre-­‐process	   Sessionize	  Calculate	  N/W	  Speed	  per	  Session	  Update	  Speed	  per	  ...
Cassandra	  KaUa	  Spout	  Pre-­‐process	   Sessionize	  Calculate	  N/W	  Speed	  per	  Session	  Update	  Speed	  per	  ...
Lessons Learned•  Rebalance Topology	•  Tweak parallelism in bolt	•  Isolation of Topologies	•  Use TimeUUIDUtils	•  Log4j...
Thank YouQ & A
Upcoming SlideShare
Loading in …5
×

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

2,818 views

Published on

The presentation aims to highlight the challenges posed by large scale and near real-time data processing problems. In past, such problems were solved using conventional technologies, primarily a database and JMS queue. However these solutions had their limits and presented serious problems in terms of scale and redundancy. The new breed of products - a la Cassandra & Kafka, being innately distributed in their design, aim to tackle such challenges in a very elegant manner. The presentation will showcase some of the use cases of this genre from the industry and describe the solutions which have been increasing in their sophistication.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,818
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
53
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

  1. 1. Cassandra &Next Generation AnalysisCassandra for a high-velocity dataingestion and real-time analysis system.Ameet Chaubal & Fausto Inestroza
  2. 2. Presentation Route• Describe  conven,onal  technology  solu,on  • Highlight  deficiencies  • Showcase  new  solu,on  implemented  using  Cassandra  • Layout  architecture  with  improvements  
  3. 3. Business Case•  Capture messages from high-volume e-Commerce site.•  Store them into a database•  Perform near real-time queries fortroubleshooting•  Perform deeper analysis a la BI.
  4. 4. Olden Days …JMS QueueTransientStorageRDBMSDatawarehouseAnalysiseCommerce Website
  5. 5. Business Case, Details…Messages: 5000 msg/sec~ 250 million / dayMessage size : 1 KbJMS QueueTransientStorageRDBMSDatawarehouseeCommerce WebsiteDecouple UI from storageMultiple sinksDedicated storage TriageData AnalysisBusiness Intelligence
  6. 6. What’s the problem?JMS QueueDatawarehouseSITE ISITE IIJMS Queue•  QueueReplicationproblems•  Message Loss•  Other applicationsaffected in case offailover•  Triage data isolated•  No universal view•  Data Consolidationadds delay•  Inability to keep upwith increasingmessages•  Analysis alwayslagging the action•  No low-latencyqueriesBatch LoadTransientstorage
  7. 7. Problems Recap• Over  5000  msg/sec  High  Write  Speed  • Extrac9on  &  Load  very  slow  ETL  from  Transient  storage  to  Data  warehouse  takes  over  4  hours  • Analysis  always  lags  events  by  hours  ETL  performed  in  batches  4  hours  apart  • No  high  availability  No  Geo-­‐Redundancy  for  Transient  Storage  • Data  stored  in  disparate  buckets  No  Universal  view  of  data  for  “Triage”  applica9ons/troubleshoo9ng  • No  dashboard    No  low-­‐latency  queries  •  No  immediate  alert,  paRern  detec9on  No  real-­‐9me  analysis  
  8. 8. ThriftConnectionPoolOnline e-CommerceApplicationEvent JMSA3LoadBalancerVIPA6A5ReplicationConsumersHector /Java Client -1Hector /Java Client -2Hector /Java Client -nJMSPublisherA1A2CassandraA7A4Writeevent toqueueFetchfromqueueCassandra + HadoopA8Map/ReduceHive Queries/BIReal-TimeDashboardA9A10A12Solution Blueprint
  9. 9. Role of Data ModelBefore we get there,what features are missing from Cassandra incomparison to traditional RDBMS
  10. 10. Shortcomings… Opportunities•  No Joins across Column Families•  No analytical functions such as sum, count…•  Difficulty in constructing “WHERE” clausepredicates across composite columns•  Inability to order range of Keys in RandomPartitioner
  11. 11. Importance of Data model - Cassandra•  In lieu of JOINS, “smart” de-normalization techniquesare crucial.•  Need to use “FEATURES” of Cassandra to effectivelymodel the business rules and business data•  “Client” or “Application” code becomes extremelyimportant.•  “APPLICATION” + “DATABASE” => Full Package
  12. 12. Features of Cassandra Modeling•  “WIDE” Column Family–  Organize data in “horizontal” as opposed to “vertical” fashion as in RDBMS•  Automatic Sorting of Columns–  Important to “MODEL” the data in “COLUMNS” as opposed to rows.•  Faster Access to ALL COLUMNS of a Row Key–  All columns of a row key stored on ONE server =>fast iteration/aggregations•  Useful info in “COLUMN NAME”–  Ground breaking from RDBMS perspective–  Enables “MORE” “INFORMATION” to be PACKED–  “COLUMN” as entity becomes “MORE POWERFUL”.•  COMPOSITE Column NAMES:–  Column names can be COMPOSITES !!! Made up of multiple columns–  Auto sorting still works
  13. 13. Data ModelWide  rows  with  sharding  Row  Key  =  “<min>|<part#>”  Role  of  par99on  #:    •  Each  row  is  stored  by  a  single  server  and  with  5,000x60=300,000  events  per  minute,  that  would  put  large  load  for  a  minute  on  a  single  server.    •  A  “par99on”  contrap9on  aims  to  “break”  this  huge  row,  remove  hotspots  and  spread  the  load  to  possibly  all  servers  •  The  #  of  par99ons,  some  mul9ple  of  the  #  of  servers  •  Finite  #  of  par99ons  –  s9ll  maintains  the  row  key  as  meaningful,  i.e.  we  can  construct  the  keys  for  a  certain  minute  and  fetch  records  for  them.  
  14. 14. Composite Columns•  Composite Columns:–  Actual message stored as part of composite column•  Variable granularity grouping–  Minute: Row key based on minuteMin_par((on  (TEXT)   DC:TimeUUID:UserID:Message(Composite)   …  2012-­‐07-­‐18-­‐08-­‐13-­‐p-­‐1   Status  …   …  2012-­‐07-­‐19-­‐11-­‐21-­‐p-­‐3   Status  
  15. 15. Benefits
  16. 16. Data Center 3 (RO)Data Center 2(RW)Data Center 1(RW)Geo-Redundancy16Data Center 4 (RO)
  17. 17. Data Consolidation and Extraction•  Single view of data across multiple locations•  Data extraction can be performed in parallel•  Data extraction process performed indedicated cluster of machines.
  18. 18. Low-Latency & Batch Applications•  Triaging–  Troubleshooting customer issues within 10 minutes ofoccurrence–  Feeding a dashboard of live feed data throughaggregations performed in Counter CFs•  Analysis–  Analytical and ad Hoc queries to replace the needfor remote data warehouse eventually–  Map/Reduce via Hive without ETL
  19. 19. Opportunities Remaining•  Near real-time pattern detection andresponse•  Message loss in JMS queue•  JMS queue replication.•  reducing the impact of Queue failover onother applications
  20. 20. Further Improvements…HOW ???
  21. 21. Accenture  Cloud  PlaAorm  Recommender  as  a  Service  …  Network  Analy9cs  Services  Big Data Platform
  22. 22. Driversconsumer devicesvideo usageIssuesOperational CostsUnderstanding service quality degradationInefficient capacity planning
  23. 23. INGEST   PROCESS  VISUALIZE  ANALYZE  STORE  
  24. 24. WHY STORM?
  25. 25. ScalabilityReliabilityData types, size, velocityMission critical dataProcessing, computation, etc.Time series / patternanalysisFault-toleranceWhat do we need?Multiple use cases
  26. 26. How do we get this from Storm?Processing guaranteesLow-levelPrimitivesParallelizationRobust fail-over strategiesScalabilityReliabilityFault-toleranceProcessing, computation,etc.
  27. 27. PRIMITIVES  
  28. 28. Stream  Spout  Bolt  Topology  Subop(mal  network  speed,  geospa(al  analysis    Request  info  (IP,  user-­‐agent,  etc)  Pull  messages  from  distributed  queue  Sessioniza(on,  speed  calcula(on    Tuple   Tuple   Tuple  
  29. 29. Integration with CassandraCassandraOptimal for time series dataNear-linear scalableLow read/write latencyScales in conjunction with StormCustom BoltUses Hector API to access CassandraCreates dynamic columns per requestStores relevant network data
  30. 30. SUBOPTIMAL NETWORK SPEED TOPOLOGYAN EXAMPLE
  31. 31. KaUa  Spout  Pre-­‐process   Sessionize  Calculate  N/W  Speed  per  Session  Update  Speed  per  IP  Iden(fy  Sub-­‐Op(mal  Speed  Store  in  Cassandra  Cassandra  Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)  Tuple  (ip  1)  
  32. 32. Cassandra  KaUa  Spout  Pre-­‐process   Sessionize  Calculate  N/W  Speed  per  Session  Update  Speed  per  IP  Iden(fy  Sub-­‐Op(mal  Speed  Store  in  Cassandra  Tuple  (ip  2)  Tuple  (ip  2)  Tuple  (ip  2)  Tuple  (ip  1)  Tuple  (ip  2)  Tuple  (ip  1)   Tuple  (ip  1)  Tuple  (ip  2)   Tuple  (ip  2)  Tuple  (ip  2)  Tuple  (ip  1)  Tuple  (ip  2)  Tuple  (ip  1)  Tuple  (ip  2)  Tuple  (ip  1)  Tuple  (ip  1)  Tuple  (ip  1)  Tuple  (ip  1)  Parallelism  
  33. 33. Cassandra  KaUa  Spout  Pre-­‐process   Sessionize  Calculate  N/W  Speed  per  Session  Update  Speed  per  IP  Join  Compare  Speed  Store  in  Cassandra  Speed  by  Loca(on  Stream  1  Stream  2  KaUa  Spout  Tuple  (ip  1)   Tuple  (ip  1/NY)  Tuple  (NY)  Tuple  (ip  1/NY)  Branching  and  Joins  
  34. 34. Lessons Learned•  Rebalance Topology •  Tweak parallelism in bolt •  Isolation of Topologies •  Use TimeUUIDUtils •  Log4j level set to INFO by default
  35. 35. Thank YouQ & A

×