• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
NASSCOM Big Data and Analytics Summit 2013: Keymote 2: Gurinder Grewal
 

NASSCOM Big Data and Analytics Summit 2013: Keymote 2: Gurinder Grewal

on

  • 713 views

Keynote II: Big Data: Connecting the dots – ...

Keynote II: Big Data: Connecting the dots –
Delivering split-second decisions by integrating
offline and online systems.
Gurinder Grewal, Leader of Risk Big Data Platform,
Paypal

Statistics

Views

Total Views
713
Views on SlideShare
661
Embed Views
52

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 52

http://carlesz.blogspot.com.es 48
http://carlesz.blogspot.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    NASSCOM Big Data and Analytics Summit 2013: Keymote 2: Gurinder Grewal NASSCOM Big Data and Analytics Summit 2013: Keymote 2: Gurinder Grewal Presentation Transcript

    • BIG DATA – CONNECTING THE DOTS… Delivering split-second decisions by integrating offline and online systems NASSCOM Big Data & Analytics Summit 2013 GURINDER S. GREWAL
    • ART OF DECISION MAKING – SPEED & ACCURACY 11:01AM 11:05AM 11:06AM • Credit card used from three distance locations in short time Result based on realtime analysis: Block the card, not decided? • According to past purchasing behavior • Card holder lives in US - wife paid bill online from home PC • Card holder’s kid studies in Europe - used card to purchase books • Card holder travels to Japan - paid for lunch Result based on historical analysis: It’s a legit usage
    • TIERED BIG DATA STRATEGY real time e.g. filters near real time e.g. correlations offline e.g. behavioral analysis cost, speed data volume, accuracy effective decision = fn(accuracy, speed, cost) data age secondshoursyears Data in-motion Data in-use
    • BIG DATA - COMPUTATION STRATEGY Offline (map-reduce, batch) Offline variablesOnline variables Near Real-time (complex event processing) Realtime (in-flow processing) • fast, very stringent availability and performance SLA’s • computations are simple and eventually accurate • computations are transient, short lived (user sessions) • event-driven, incremental processing • high efficiency and scalability • data for short time windows (hours) • optimized for throughput • computations are slow and accurate • data captured as events for historical analysis
    • Hadoop Technology Stack BIG DATA IN USE - OFFLINE ECOSYSTEM HDFS HBase Map Reduce Framework Data Storage Data Processing Data Integration ETL Flume, Sqoop Programming Languages Pig Hive QL Scheduling, Coordination Zookeeper Oozie UI Framework/SDK Hue Hue SDK Structured Data Unstructured Data MPP DW RDBMS
    • BIG DATA IN MOTION – ONLINE ECOSYSTEM Complex Event Processing correlations filtering aggregations pattern matching In-memory data store Message Bus Offline Decision Service Events stream CEP enables continuous analytics on data in motion • Solution for velocity of big data • Well suited for detection, decisioning, alerting and taking actions • Relies on in-memory data grid for ability to provide low latency Monitoring
    • BIG DATA MOVEMENT EVOLUTION Offline In-memory data store Offline NoSQL (persistent backing store) In-memory data store Two-tier architecture Data Cloud Data Cloud Initial state • 500GB GB in 16 hours Optimization – Phase 1 • 2 TB in 16 hours • Split data files prepared offline • Maximize data load parallelism • Maximum data compression • Optimize data format • Validation before data movement Scale – Phase 2 • 10 TB in 6 hours • Add persistent NoSQL behind in-memory store • Blast bulk load into NoSQL store • Batch process will warm the cache • Lazy warm-up as needed, while serving r/w • Refresh cache contents via time based evictions Batch Multi-tier architecture
    • Confidential and Proprietary8 USE CASE: GRAPH BASED DECISIONING Map/Reduce Graph builder In-memory graph store Online Graph Server Daily incremental updates Continuous graph updates and rollup • Generate graph and associated complex variables on Hadoop on daily basis • Move the incremental changes to online in-memory graph store • Based on event stream, keep graph, offline variables up-to-date • In-memory store provides fast read only access to Decision services Decision Service Avg. read time: 2ms 95th percentile: 6ms Events stream offline online
    • Confidential and Proprietary9 • Hadoop is best for offline processing of variety and volume data – not for real time • CEP is a solution for online, big data in motion (velocity), complements Hadoop • Harness true power of big data by combining offline and online data • Data integration is a key – careful planning and optimization is needed • Online data stores are not optimized for highly parallel writes, bulk loads • Big data can solve complex problems while delivering speed and accuracy CONCLUSION
    • THANK YOU!