Information excellence 2012feb_sungard_aditya_yadav real time hadoop analytics with raptor
Upcoming SlideShare
Loading in...5
×
 

Information excellence 2012feb_sungard_aditya_yadav real time hadoop analytics with raptor

on

  • 1,187 views

Information Excellence 2012 Spring Summit "DATA DYNAMICS"

Information Excellence 2012 Spring Summit "DATA DYNAMICS"
Realtime Hadoop Analytics with Raptor
Aditya Yadav, Head, ATS/R&D at SunGard India

Statistics

Views

Total Views
1,187
Views on SlideShare
1,187
Embed Views
0

Actions

Likes
0
Downloads
34
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Information excellence 2012feb_sungard_aditya_yadav real time hadoop analytics with raptor Information excellence 2012feb_sungard_aditya_yadav real time hadoop analytics with raptor Presentation Transcript

  • Raptor: Real-time Analytics on Hadoop Information Excellence Summit,Aditya Yadav February 25, 2012 BangaloreATS/Research Head, SunGard India http://informationexcellence.wordpress.com/Soundar VeluProduct Architect, Advanced Technology, SunGardProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
  • Aditya Yadav Aditya Yadav Head ATS/R&D SunGard India Aditya Yadav heads the ATS/R&D at SunGard India, An Applied Research and Consulting Group which works with Emerging Technologies like Cloud Computing, BigData/Hadoop, GPGPU, Visualization, Statistics and Analytics. Aditya lead the team that create Raptor - Realtime Hadoop Analytics and is working now on bringing OLAP over Bigdata to the Open Source community. Aditya has earlier worked with Thoughtworks where he worked on Internet Scale Systems, Agile Coaching and Cloud Computing Evangelist. He is an author of a dozen open source projects, 7+ books and reports. Ran a boutique consulting company and was the CTO at one of the Top 25 Indian Startups.Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 1
  • Soundararajan Velu Soundararajan Velu Product Architect Sungard Soundararajan Velu is a product architect with the Advanced Technology team in SunGard. With extensive experience building enterprise applications and distributed computing products, Soundar’s specialties include building large scale applications and frameworks using Apache-Hadoop and related technologies. He consults on building low-latency applications and Service Oriented Architectures. Soundar is the creator of some of the highly reused products like SOA Accelerator, CORL Engine and Raptor. He holds a Computer Science and Engineering degree with distinction from VTU India.Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 2
  • Who We Are, Who Am I, What We Do?  Who We Are & What We Do?  Fortune 500 Company  Financial Services Firm  Provide software & consulting services across the industry  Exploring impacts of Big Data approach for last 2+ years  Who Am I?  Part of Applied Research & Consulting group based out of Bangalore, India  We focus on latest technology trendsProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 3
  • Agenda  Financial Services and Data Problems  Raptor Architecture Overview  Raptor Components  Benchmarks  Future EnhancementsProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 4
  • Financial Services & Typical “data intensive” Problems Wholesale Markets Portfolio Construction Price Forecasting Risk Calculation Compliance Surveillance Batch Processing Retail Markets Fraud Detection Targeted Marketing Demand ForecastingProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 5
  • Implementation Constraints in Financial Services RDBMS Legacy Burden Centricity Architectures, languages, Constrained semantics, tools, systems, skills rigidity and specificity Data Silos Cost of Change Governance “You don’t own Evolution is the only answer that data, I do”Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 6
  • What Did We Need?  A generic, reliable, and cost effective analytics solution with a wide range of application areas  Query execution and analytics at soft real-time windows (acceptable and consistent latencies)  Minimum customization, seamless integration and ease of use  Policy around data storage and processing  Adaptive segmentation algorithms for optimized data search (Indexing, partitioning and filtering)Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 7
  • Some Limitations with Traditional Hadoop  Jobs are executed in a brute force fashion, causing complete scans of files for every single query  Long warm up time for jobs, performs poorly for relatively smaller data sets.  Scheduling imbalance in HDFS operations and job execution  Limitations around the kind of jobs that can be executed with MapReduce, (non equality joins not supported)  Open bugs and lacking features, memory management bottlenecks  Only for batch mode based applications, does not fit real-time analytics scenariosProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 8
  • Raptor Application Stack Z Flume Scribe Sqoop o o Hue Raptor Oozie K e Hive/Pig e p HBase Map Reduce Quantum Processor e r HDFSProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 9
  • Raptor ArchitectureProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 10
  • Raptor Digester - Level 1 Segmentation Database Data Stream Logs/Files Data load adaptors Data Cleanser Call back Digester Raptor Client Table specific data transformer and partitioner, partition based on Data Transformer Call back demographics, Storage buckets customers, IP address or composite columns. Store partitioned HDFS data into respective file buckets in HDFS.Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 11
  • Raptor Block Analyzer – Level 2 Segmentation Write Block Raptor client DataXceiver Offer BlockID for async-Indexing IndexFactory BlockReceiver Get Indexer for Asynchronously Block analyze/Index Block DataBlock Index Queue Block Analyzer DataBlock Indexer BloomFilter Store column meta information, Every block is asynchronously analyzed Block index map and column level bloom filters in ThorCache. and the resulting column level blooms, index tree and column level meta information is stored in ThorCache. All this Thor Block happens within the DataNode Context. Index CacheProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 12
  • Raptor Client Framework Raptor Client is similar to dfsclient or jobclient, it stands as a primary interface into Raptor. Block Analyzer • Get Indexer for block • Mirror Indexer to replica nodes Code Generator NameNode Deploy Generated • Get block info for file Raptor Executor Code • Get file stats • Execute internal task • Clean up task • Get result segments Digester Raptor Client DataNode • Get Digester for table • Invoke Quantum Processor Services • Get Analyzer for table • Manage Indexes • Execute Jobs • Get Block analyzers Raptor Optimizer • Deploy generated infrastructure patch Re-Index Block • Get analyzer for table Re-Balance blocks • Get Indexer for table • Get adaptive metrics Adaptive Engine Raptor-Hive • Get indexer for block/table Policy Manager • Get Hive Meta-data for table • Get adaptive metrics for table • Auto shard buckets • Get Raptor Meta Data for table • Drop indexes for buckets • Get digester for Table • Archive Buckets • Deploy generated infrastructure patchProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 13
  • Getting Results Out – End to End Flow Client Execute HQL Thrift Application Statement Client Hive JDBC Connection Thrift ResultSet Client Client applications create a hive JDBC Thrift Hive Server connection to Raptor-HiveServer. Client fires HQL Queries over the connection. The query is executed by Raptor and a Hive ResultSet is returned as Result. This is the interface used by Raptor-Hive Clients to access data in HDFS. e.g. “Select Name, CNO, TxID, floor(TxAmt) Fetch Task Execution from Credit_Tx order by CNO” Fetch Results Write Results Execute Non-HQL queries directly with raptor Client, HDFS Raptor Client Quantum ProcessorProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 14
  • Raptor-Hive Processing Engine HiveServer Get Table Block JDBC Client Analyzers Datanodes Compile Optimize Raptor Optimizer Hive MetaStore Execution Plan Metrics published to Adaptive engine Raptor MetaStore Raptor Raptor Adaptive Executor Engine Raptor Client MapReduce Raptor HBase Quantum ProcessorProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 15
  • Quantum Processor Framework Hive Executor QuantumNode is a service similar to DataXceiver on DataNode. All RaptorClient RaptorProtocolHelper Raptor operations are executed by the Quantum Processor framework. Quantum Node QuantumProcessor GetNodeStats SearchBlock ReadResultSegment Manage Indexes ReadBoundaryRecord ExecuteInternalTask Get File analyzers ExecuteExternalTaskProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 16
  • Raptor Block Searcher/Processor Block Searcher analyzes the block indexes and bloom filters Hive Executor RaptorClient Process/Search QuantumProcessor and extracts the resultant Block indexed records via index based offset reader or if no indexes are available then it does a complete block scan. Thor Block Index BlockSearcher Cache Block Index No Block Index /Bloom Seq Block Reader Available Process All Column Level Records Bloom Yes Quantum Operator Column Meta Search Index Tree Data Result Writer Index Block Reader HDFS Process Result RecordsProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 17
  • Quantum Operator Quantum Operator takes block records as input and performs the required field inspections, and applies UDFs and aggregation. The resultant records are either written to local segment files or to HDFS directly based on the operator type. Evaluate rows Block Searcher Quantum Operator Read Remote Seg Offer result rows Read k-remote segment files, if reduce operator. Write Result Inspect Rows Apply Join/Merge Apply UDF Has Yes reduce child? Write results to k-segmented local files, if child is a reduce No operator. Local Quantum Operators: SelectOperator, JoinOperator, HDFS Segment File SortByOperator, GropuByOperator, GPUOperator, etc…Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 18
  • Adaptive Indexing Execute Hive Queries Collect Execution Metrics Raptor Adaptive Engine Asynchronous Adaptive Raptor Metadata Raptor adaptive engine collects metrics Indexer from the executed queries. Metrics include the columns and tables involved Adaptive in the query, the types of aggregation Index executed on columns etc… these metrics Scheduler are used by the adaptive index scheduler to schedule re-indexing of the blocks based on these usage metrics. Re Analyze/Index Blocks based on usage/query Metrics HDFS DataNode DataNode DataNode DataNode ThorIndex CacheProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 19
  • Code Generation and Data Definition Framework User supplies table schema in an XML format. Based on the Xml table schema and hints, Raptor code generator generates metadata in Raptor Code Generator hive metadata store, HBase and raptor metadata store. It also generates table specific indexer column analyzers and bloom filter classes. Create Hive Table Definitions HiveServer Hive Metadata HBase Create Raptor Table Metadata Raptor Adaptive Engine Raptor Metadata Raptor Client Generate Table Specific Indexers, Bloom Filter , Column Deploy Generated code to Analyzers and digester Raptor- HiveServer and DataNodesProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 20
  • Raptor Policy Manager – Time Based ShardTime based auto sharding, as per the Raptor Policy Managerpolicy defined by the user. Table bucketsare moved from highly segmented timeshards to sparsely segmented archive Raptor Metadatastore over time. HDFS Completely Bucket-1 Bucket-2 Bucket-3 Primary Segmented Time-Shard Partially Secondary Bucket-1 Bucket-2 Bucket-3 Segmented Time-Shard-1 … All Segments Merged and Indexes dropped Bucket-1 Bucket-2 Bucket-3 Archive Store Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 21
  • Infrastructure Enhancements  Adaptive compression based on network congestion statistics  Scheduling of jobs via raptor client (via Hive) in conjunction with enhanced NameNode block policy  Computation intensive jobs scheduled on GPU enabled nodes  Object Pools across the Hadoop-Raptor ecosystem  Hand-shake mechanism between clients and DataNodes to avoid imminent operation failures  Interactive user console for managing the cluster, tasks, data policy, filters etc…Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 22
  • Benchmark Raptor Benchmark carried out on a 5 node cluster with commodity hardware { Core2-Duo, 4GB RAM, 1TB storage, 100Mbps NIC, 64MB block size, Replication factor 3, Network Compression enabled} Table with 13 columns and mixed types. Sample data generated with node.js with moderate entropy. (table with 1 group index{3 columns} and 0 column bloom filters, 1 bucket} 5 Node Cluster Raptor Map/Reduce OperationTable Size 256MB 1GB 10GB 256MB 1GB 10GB Load Time 4.8s 22s 4.2m 54.6s 3.5m 36.0m Simple select without predicate 6.9s 22s 3.2m 21.6s 50.5s 9.5m Select with complex Predicate 0.8s 1.7s 0.9m 9.2s 32.2s 4.1m Select with Order By 1.6s 6.5s 1.1m 41.0s 3.2m 5.6m Select with Group By 0.6s 1.2s 0.8m 61.3s 1.5m 4.0m Simple Join 2.9s 5.7s 1.7m 1.2m 3.4m 9.2m User Defined row level Function 0.6s 1.1s 0.9m 13.2s 49s 7.3mProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 23
  • Response-Time Trends 2000 1500 Map/Reduce Seconds 1000 500 Almost Linear 0 256MB 1GB 10GB 50GB 100GB 120 Raptor 100 80 Seconds 60 Almost Logarithmic 40 20 0 256MB 1GB 10GB 50GB 100GBProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 24
  • Case Study: Credit Card Fraud Detection Scribe Scribe Scribe Clients Clients Clients Scribe Server Raptor Digester Cache Segment Data into Subsets/Buckets/ Partitions Raptor-Hadoop Raptor Policy Manager Pruning - Removes redundant Classifiers from Mapreduce Quantum ensemble Processor Apply Data Mining Techniques to generate Periodic Fraud Report Classifiers in parallel Generation Raptor Classifier Raptor-Hive Combine resultant base models to generate a uses Meta-Classifiers Meta-Classifier to Flag Transactions Analytics Results & Reports Flagging Cluster Sends out Notifications Notifications Cluster PG SQL Credit Card Fraud ApplicationProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 25
  • Other Raptor Use Cases  Smart customer care solution  Financial fraud analytics  Media usage log analytics  Computation intensive jobs using GPUs  Predictive tradingProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 26
  • Roadmap  Merge Raptor into Next Generation MapReduce  Dimensions and Cubes (MRCubes)  Cloud ready Raptor  Roles and security  Job failover management  Rules and triggers  Planning to open source  Starter Kits/Examples of various Use CasesProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 27
  • Benefits  Query responses at soft real-time windows.  Code generation framework, for table specific raptor infrastructure code.  Zero down time, with hot deployment of generated infrastructure code.  Seamless integration into existing infrastructure with multiple ingress options.  Automatic time based sharding and data archival.  Distribute & execute non-MR jobs on the Hadoop ClusterProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 28
  • Contacts Soundar Velu Aditya Yadav Product Architect ATS/Research Head Soundararajan.velu@sungard.com Aditya.Yadav@sungard.com Twitter: @greyquestProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
  • References  Optimizing Distributed Joins with Bloom Filters www.l3s.de/web/upload/documents/1/analysis.pdf  Apache Hadoop Goes Realtime at Facebook borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf  Data Mining with MAPREDUCE: Graph and Tensor Algorithms with MR www.ml.cmu.edu/research/dap-papers/tsourakakisdap.pdf  Distributed Cube Materialization on Holistic Measures∗ www.eecs.umich.edu/~congy/work/icde11a.pdf  HadoopDB in Action: Building Real World Applications www.cs.yale.edu/homes/dna/papers/hadoopdb-demo.pdf  TeraByte Sort on Apache Hadoop sortbenchmark.org/YahooHadoop.pdf  Mahouth in Action www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684  Web-Scale K-Means Clustering www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf  Hive – A Petabyte Scale Data Warehouse Using Hadoop infolab.stanford.edu/~ragho/hive-icde2010.pdfProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 30
  • CONFIDENTIALITY STATEMENT Copyright © 2011 by SunGard Data Systems (or its subsidiaries, “SunGard”). All rights reserved. No parts of this document may be reproduced, transmitted or stored electronically without SunGard’s prior written permission. This document contains SunGards confidential or proprietary information. By accepting this document, you agree that: (A)(1) if a pre-existing contract containing disclosure and use restrictions exists between your company and SunGard, you and your company will use this information subject to the terms of the pre-existing contract; or (2) if no such pre-existing contract exists, you and your Company agree to protect this information and not reproduce or disclose the information in any way; and (B) SunGard makes no warranties, express or implied, in this document, and SunGard shall not be liable for damages of any kind arising out of use of this document.Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 31
  • Raptor: Real-time Analytics on HadoopProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 32