C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza
Upcoming SlideShare
Loading in...5
×
 

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

on

  • 1,809 views

The presentation aims to highlight the challenges posed by large scale and near real-time data processing problems. In past, such problems were solved using conventional technologies, primarily a ...

The presentation aims to highlight the challenges posed by large scale and near real-time data processing problems. In past, such problems were solved using conventional technologies, primarily a database and JMS queue. However these solutions had their limits and presented serious problems in terms of scale and redundancy. The new breed of products - a la Cassandra & Kafka, being innately distributed in their design, aim to tackle such challenges in a very elegant manner. The presentation will showcase some of the use cases of this genre from the industry and describe the solutions which have been increasing in their sophistication.

Statistics

Views

Total Views
1,809
Views on SlideShare
1,808
Embed Views
1

Actions

Likes
1
Downloads
28
Comments
0

1 Embed 1

http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza Presentation Transcript

  • Cassandra &Next Generation AnalysisCassandra for a high-velocity dataingestion and real-time analysis system.Ameet Chaubal & Fausto Inestroza
  • Presentation Route• Describe  conven,onal  technology  solu,on  • Highlight  deficiencies  • Showcase  new  solu,on  implemented  using  Cassandra  • Layout  architecture  with  improvements  
  • Business Case•  Capture messages from high-volume e-Commerce site.•  Store them into a database•  Perform near real-time queries fortroubleshooting•  Perform deeper analysis a la BI.
  • Olden Days …JMS QueueTransientStorageRDBMSDatawarehouseAnalysiseCommerce Website
  • Business Case, Details…Messages: 5000 msg/sec~ 250 million / dayMessage size : 1 KbJMS QueueTransientStorageRDBMSDatawarehouseeCommerce WebsiteDecouple UI from storageMultiple sinksDedicated storage TriageData AnalysisBusiness Intelligence
  • What’s the problem?JMS QueueDatawarehouseSITE ISITE IIJMS Queue•  QueueReplicationproblems•  Message Loss•  Other applicationsaffected in case offailover•  Triage data isolated•  No universal view•  Data Consolidationadds delay•  Inability to keep upwith increasingmessages•  Analysis alwayslagging the action•  No low-latencyqueriesBatch LoadTransientstorage
  • Problems Recap• Over  5000  msg/sec  High  Write  Speed  • Extrac9on  &  Load  very  slow  ETL  from  Transient  storage  to  Data  warehouse  takes  over  4  hours  • Analysis  always  lags  events  by  hours  ETL  performed  in  batches  4  hours  apart  • No  high  availability  No  Geo-­‐Redundancy  for  Transient  Storage  • Data  stored  in  disparate  buckets  No  Universal  view  of  data  for  “Triage”  applica9ons/troubleshoo9ng  • No  dashboard    No  low-­‐latency  queries  •  No  immediate  alert,  paRern  detec9on  No  real-­‐9me  analysis  
  • ThriftConnectionPoolOnline e-CommerceApplicationEvent JMSA3LoadBalancerVIPA6A5ReplicationConsumersHector /Java Client -1Hector /Java Client -2Hector /Java Client -nJMSPublisherA1A2CassandraA7A4Writeevent toqueueFetchfromqueueCassandra + HadoopA8Map/ReduceHive Queries/BIReal-TimeDashboardA9A10A12Solution Blueprint
  • Role of Data ModelBefore we get there,what features are missing from Cassandra incomparison to traditional RDBMS
  • Shortcomings… Opportunities•  No Joins across Column Families•  No analytical functions such as sum, count…•  Difficulty in constructing “WHERE” clausepredicates across composite columns•  Inability to order range of Keys in RandomPartitioner
  • Importance of Data model - Cassandra•  In lieu of JOINS, “smart” de-normalization techniquesare crucial.•  Need to use “FEATURES” of Cassandra to effectivelymodel the business rules and business data•  “Client” or “Application” code becomes extremelyimportant.•  “APPLICATION” + “DATABASE” => Full Package
  • Features of Cassandra Modeling•  “WIDE” Column Family–  Organize data in “horizontal” as opposed to “vertical” fashion as in RDBMS•  Automatic Sorting of Columns–  Important to “MODEL” the data in “COLUMNS” as opposed to rows.•  Faster Access to ALL COLUMNS of a Row Key–  All columns of a row key stored on ONE server =>fast iteration/aggregations•  Useful info in “COLUMN NAME”–  Ground breaking from RDBMS perspective–  Enables “MORE” “INFORMATION” to be PACKED–  “COLUMN” as entity becomes “MORE POWERFUL”.•  COMPOSITE Column NAMES:–  Column names can be COMPOSITES !!! Made up of multiple columns–  Auto sorting still works
  • Data ModelWide  rows  with  sharding  Row  Key  =  “<min>|<part#>”  Role  of  par99on  #:    •  Each  row  is  stored  by  a  single  server  and  with  5,000x60=300,000  events  per  minute,  that  would  put  large  load  for  a  minute  on  a  single  server.    •  A  “par99on”  contrap9on  aims  to  “break”  this  huge  row,  remove  hotspots  and  spread  the  load  to  possibly  all  servers  •  The  #  of  par99ons,  some  mul9ple  of  the  #  of  servers  •  Finite  #  of  par99ons  –  s9ll  maintains  the  row  key  as  meaningful,  i.e.  we  can  construct  the  keys  for  a  certain  minute  and  fetch  records  for  them.  
  • Composite Columns•  Composite Columns:–  Actual message stored as part of composite column•  Variable granularity grouping–  Minute: Row key based on minuteMin_par((on  (TEXT)   DC:TimeUUID:UserID:Message(Composite)   …  2012-­‐07-­‐18-­‐08-­‐13-­‐p-­‐1   Status  …   …  2012-­‐07-­‐19-­‐11-­‐21-­‐p-­‐3   Status  
  • Benefits
  • Data Center 3 (RO)Data Center 2(RW)Data Center 1(RW)Geo-Redundancy16Data Center 4 (RO)
  • Data Consolidation and Extraction•  Single view of data across multiple locations•  Data extraction can be performed in parallel•  Data extraction process performed indedicated cluster of machines.
  • Low-Latency & Batch Applications•  Triaging–  Troubleshooting customer issues within 10 minutes ofoccurrence–  Feeding a dashboard of live feed data throughaggregations performed in Counter CFs•  Analysis–  Analytical and ad Hoc queries to replace the needfor remote data warehouse eventually–  Map/Reduce via Hive without ETL
  • Opportunities Remaining•  Near real-time pattern detection andresponse•  Message loss in JMS queue•  JMS queue replication.•  reducing the impact of Queue failover onother applications
  • Further Improvements…HOW ???
  • Accenture  Cloud  PlaAorm  Recommender  as  a  Service  …  Network  Analy9cs  Services  Big Data Platform
  • Driversconsumer devicesvideo usageIssuesOperational CostsUnderstanding service quality degradationInefficient capacity planning
  • INGEST   PROCESS  VISUALIZE  ANALYZE  STORE  
  • WHY STORM?
  • ScalabilityReliabilityData types, size, velocityMission critical dataProcessing, computation, etc.Time series / patternanalysisFault-toleranceWhat do we need?Multiple use cases
  • How do we get this from Storm?Processing guaranteesLow-levelPrimitivesParallelizationRobust fail-over strategiesScalabilityReliabilityFault-toleranceProcessing, computation,etc.
  • PRIMITIVES  
  • Stream  Spout  Bolt  Topology  Subop(mal  network  speed,  geospa(al  analysis    Request  info  (IP,  user-­‐agent,  etc)  Pull  messages  from  distributed  queue  Sessioniza(on,  speed  calcula(on    Tuple   Tuple   Tuple  
  • Integration with CassandraCassandraOptimal for time series dataNear-linear scalableLow read/write latencyScales in conjunction with StormCustom BoltUses Hector API to access CassandraCreates dynamic columns per requestStores relevant network data
  • SUBOPTIMAL NETWORK SPEED TOPOLOGYAN EXAMPLE
  • KaUa  Spout  Pre-­‐process   Sessionize  Calculate  N/W  Speed  per  Session  Update  Speed  per  IP  Iden(fy  Sub-­‐Op(mal  Speed  Store  in  Cassandra  Cassandra  Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)  Tuple  (ip  1)  
  • Cassandra  KaUa  Spout  Pre-­‐process   Sessionize  Calculate  N/W  Speed  per  Session  Update  Speed  per  IP  Iden(fy  Sub-­‐Op(mal  Speed  Store  in  Cassandra  Tuple  (ip  2)  Tuple  (ip  2)  Tuple  (ip  2)  Tuple  (ip  1)  Tuple  (ip  2)  Tuple  (ip  1)   Tuple  (ip  1)  Tuple  (ip  2)   Tuple  (ip  2)  Tuple  (ip  2)  Tuple  (ip  1)  Tuple  (ip  2)  Tuple  (ip  1)  Tuple  (ip  2)  Tuple  (ip  1)  Tuple  (ip  1)  Tuple  (ip  1)  Tuple  (ip  1)  Parallelism  
  • Cassandra  KaUa  Spout  Pre-­‐process   Sessionize  Calculate  N/W  Speed  per  Session  Update  Speed  per  IP  Join  Compare  Speed  Store  in  Cassandra  Speed  by  Loca(on  Stream  1  Stream  2  KaUa  Spout  Tuple  (ip  1)   Tuple  (ip  1/NY)  Tuple  (NY)  Tuple  (ip  1/NY)  Branching  and  Joins  
  • Lessons Learned•  Rebalance Topology •  Tweak parallelism in bolt •  Isolation of Topologies •  Use TimeUUIDUtils •  Log4j level set to INFO by default
  • Thank YouQ & A