Big data – can it deliver speed and accuracy v1


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Key message – large user base, large number of lookup, stringent SLA’s, big data …
  • What is big data?You will get variety of answers depending who you ask. Answer would include; hadoop, large processing cluster, large data size, variety of data, velocity of data, etc.What really matters is, the insight, it provides to make effective decisions by providing rich set of characteristics.The richness of characteristic effects the quality of decision.The power of big data is realized when the insight provided by big data is transformed effectively into business value.
  • Key Message: There is so much talk about three dimensions of big data - velocity, variety and volume. In reality, these dimensions are source of conflicting requirements.For example, in stock trading system, the price ticks are change in matter of milliseconds (velocity). The ticks data needs to analyzed and consumed quick to make critical trading decisions. A small delay can alter the outcome (accuracy) and cause monetary damage.In order to make accurate decisions, the system should utilize both historical patterns and the current pattern. For example, customers who buy diapers also buy formula milk. You want to recommend formula milk to every customer who puts diapers in shopping cart. A customer just bought formula milk a couple of hours ago, apply further filters in real time and recommend something else to the customer, e.g. shampoo if female customer, a beer if male customer.So, the combination of historical and real time analysis delivers best possible decisions.
  • Key Message: The very first thing is to lay out data strategy to utilize big data effectively. Data analysis in offline environment always lags behind for two reasons – Latency of data movement from production (transactional) system to analytics systemOffline analysis are performed in a batch mode, hence at any given time, there is a set of data not consumed by offline systemThat means, we need to deal with two states of the data: 1. Data in Motion (Online environment) has the following characteristics - data volumes are relatively smaller than offline environment - main source of transactional data, user interactions and behavioral data, hence is mainly structured (generic data models for flexibility) - data is transient and can be recreated any time on offline system using simulations - data technologies for online systems are optimized for speed, complex to maintain, hence cost is high, able to handle terabytes of data2.Data in Use (Offline environment) as the following characteristics - data volumes are very large - large variety of data, different formats – structured, semi-structured, raw, diverse data sources - master copy of the data, generally appended to preserve trail of changes for historical analysis - data technologies for offline system are optimized for throughput, can handle > petabytes of data, simple architecture, can be deployed using commodity H/W
  • You need to merge offline and online dataa comprehensive view. Hence, data integration is a key.Single most important factor that needs to be considered to keep data integration cost low – “minimize the data bits that needs to be moved”
  • Online data technologies are not optimized for parallel/bulk data loads. 1TB data load can take up to 15 hours. increase efficiencyless verbose data format, separate metadata from contentaggressively use compression, partition data for parallel data loadreduce costvalidate data before moving to preserve data link bandwidthminimize resource usage by moving only changes to the dataincrease reliabilitythrottle data load speed to minimize degradation of online systemsBe sensitive to peak load time
  • Big data – can it deliver speed and accuracy v1

    1. 1. BIG DATA ANALYTICS Can it deliver speed and accuracy? Risk & Compliance Engineering, Paypal Gurinder S. Grewal This deck contains generic architecture information, and does not reflect the exact details of current or planned systems. June 2013
    2. 2. ABOUT PAYPAL • 123MM active users • 190 markets, 25 currencies • $300,000 payments processed/minute • 2B+ events/day • 12 TB new data added per day • 500K+ real time queries per second • < 100ms average response time we are talking, a lot of data …big data!
    3. 3. WHAT IS BIG DATA? transactions interactions observations petabytes of data diverse analytics variety of data structures hadoop large number of characteristics large map/reduce cluster terradata
    4. 4. GROWING COMPLEXITY AND EXPECTATIONS Emerging technologies in the modern world are opening up possibilities for sophisticated analytics. Data infrastructure is growing, so are the expectations - make decisions fast and with higher accuracy! FraudSophistication DataComplexity time Simple rules, black/white lists Linear Modes, aggregated variables Location, Time Analysis lowhigh lowhigh Inline Histories Analysis Consistency Networks..
    5. 5. Time taken to make a decision DECISIONS MUST BE QUICK • A gang of cyber-criminals stole $45 million in a matter of hours • More than 36,000 transactions were made worldwide and about $40 million was stolen in 6 hours Source: BusinessValue 80 low high Prevention Fast Detection High Fraud Loss Fraudloss low high
    6. 6. DECISIONS MUST BE ACCURATE 11:01AM 11:05AM 11:06AM • Credit card used from three distance locations in short time Result based on realtime analysis: Block the card, not decided? • According to past purchasing behavior • Card holder lives in US - wife paid bill online from home PC • Card holder’s kid studies in Europe - used card to purchase books • Card holder travels to Japan - paid for lunch Result based on historical analysis: It’s a legit usage
    7. 7. DO WE HAVE CONFLICTING REQUIREMENTS? speed • analyze data incoming at high velocity in split second • consume data in timely manner to make decisions accuracy • utilize powerful analytics techniques (text mining, predictive analysis) • processing large variety and volume of data (details) cost • can’t spend a dollar to save a penny – pick a right tool for right job
    8. 8. TIERED BIG DATA STRATEGY real time e.g. filters near real time e.g. correlations offline e.g. behavioral analysis cost, speed data volume, accuracy effective decision = fn(accuracy, speed, cost) data age secondshoursyears Data in-motion Data in-use
    9. 9. BIG DATA - COMPUTATION STRATEGY Offline (map-reduce, batch) Offline variablesOnline variables Near Real-time (complex event processing) Realtime (in-flow processing) • fast, very stringent availability and performance SLA’s • computations are simple and eventually accurate • computations are transient, short lived (user sessions) • event-driven, incremental processing • high efficiency and scalability • data for short time windows (hours) • optimized for throughput • computations are slow and accurate • data captured as events for historical analysis
    10. 10. Hadoop Technology Stack BIG DATA IN USE - OFFLINE ECOSYSTEM HDFS HBase Map Reduce Framework Data Storage Data Processing Data Integration ETL Flume, Sqoop Programming Languages Pig Hive QL Scheduling, Coordination Zookeeper Oozie UI Framework/SDK Hue Hue SDK Structured Data Unstructured Data MPP DW RDBMS
    11. 11. BIG DATA IN MOTION – ONLINE ECOSYSTEM Complex Event Processing correlations filtering aggregations pattern matching In-memory data store Message Bus Offline Decision Service Events stream CEP enables continuous analytics on data in motion • Solution for velocity of big data • Well suited for detection, decisioning, alerting and taking actions • Relies on in-memory data grid for ability to provide low latency Monitoring
    12. 12. BIG DATA MOVEMENT Offline Data movement between offline and online is the key and biggest challenge • ETL jobs require custom coding, biggest bottleneck • Data transfer very expensive, slow across networks, multiple data centers • Online data stores are not optimized for parallel or bulk loads Slows down data store during ETL operation Negatively impacts online applications availability Data Cloud Offline
    13. 13. BIG DATA MOVEMENT EVOLUTION Offline In-memory data store Offline NoSQL (persistent backing store) In-memory data store Two-tier architecture Data Cloud Data Cloud Initial state • 500GB in 16 hours Optimization – Phase 1 • 2 TB in 16 hours • Split data files prepared offline • Maximize data load parallelism • Maximum data compression • Optimize data format • Validation before data movement Scale – Phase 2 • 10 TB in 6 hours • Add persistent NoSQL behind in-memory store • Blast bulk load into NoSQL store • Batch process will warm the cache • Lazy warm-up as needed, while serving r/w • Refresh cache contents via time based evictions Batch Multi-tier architecture
    14. 14. Confidential and Proprietary14 USE CASE: GRAPH BASED DECISIONING Map/Reduce Graph builder In-memory graph store Online Graph Server Daily incremental updates Continuous graph updates and rollup • Generate graph and associated complex variables on Hadoop on daily basis • Move the incremental changes to online in-memory graph store • Based on event stream, keep graph, offline variables up-to-date • In-memory store provides fast read only access to Decision services Decision Service Avg. read time: 2ms 95th percentile: 6ms Events stream offline online
    15. 15. Confidential and Proprietary15 • Hadoop is best for offline processing of variety and volume data – not for real time • CEP is a solution for online, big data in motion (velocity), complements Hadoop • Harness true power of big data by combining offline and online data • Data integration is a key – careful planning and optimization is needed • Online data stores are not optimized for highly parallel writes, bulk loads • Big data can solve complex problems while delivering speed and accuracy CONCLUSION