Your SlideShare is downloading. ×
0
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard

5,284

Published on

Raptor combines Hadoop & HBase with machine learning models for adaptive data segmentation, partitioning, bucketing, and filtering to enable ad-hoc queries and real-time analytics. Raptor has …

Raptor combines Hadoop & HBase with machine learning models for adaptive data segmentation, partitioning, bucketing, and filtering to enable ad-hoc queries and real-time analytics. Raptor has intelligent optimization algorithms that switch query execution between HBase and MapReduce. Raptor can create per-block dynamic bloom filters for adaptive filtering. A policy manager allows optimized indexing and autosharding. This session will address how Raptor has been used in prototype systems in predictive trading, times-series analytics, smart customer care solutions, and a generalized analytics solution that can be hosted on the cloud.

Published in: Technology, Business
0 Comments
20 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,284
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
20
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Cloud Computing, Distributed Computing, Big Data, Low Latency, Mobile Computing, GPU/FPGA, NG UI Frameworks, Functional Languages, DSLs, etc…
  • Financial Services firms are investigating next generation technologies for data analytics. Data growth – particularly of unstructured data – poses a special challenge as the volume and diversity of data types outstrip the capabilities of older technologies such as relational databases.Predictive analysis of both internal and external data result in better, proactive management of a wide range of issues from: credit and operational risk e.g. fraud and reputational risk; customer loyalty and profitability; etcBanks, mutual funds, and insurance companies need to transform their data into actionable information to comply with government and industry regulations, manage risk, increase revenue in a competitive economy, identify fraud and improve efficiency.
  • Financial services firms worldwide are struggling with the challenges of Big Data analytics management - analyzing massive volumes of complex data from many sources in order to better serve and retain customers, find and effectively recruit new prospects, detect and prevent fraud, manage assets and streamline operations.Financial services companies are taking advantage of Big Data analytics to increase profit margins, reduce risk and solidify relationships with profitable customers. Now more than ever, a powerful, scalable, affordable and simple Business Intelligence (BI) solution for business users is key to helping make the right decisions that drive overall success. Data in Financial Services - Use It or Lose It! Financial Services is a domain where what you do with data and what you don't do with data can have a dramatic impact on your success. The BigData movement is attracting both investment and press - horizontally as an analytics infrastructure and vertically with domain specific tools. In financial services, distributed computing and parallel processing has been employed for many years, so what is the utility of BigData for these well-known problems? Is the size of data being analyzed the real challenge or is it the ability to evolve insight and analytic capabilities on what, in relative terms, may really be SmallData? This presentation will discuss how time, cost and information asymmetry are critical competitive advantages in financial services and how the current state of the art is in need of some help from the BigData community. We will also discuss the hardware and software used for typical computation and data processing workloads in financial services and why distributed computing is still reserved for only relatively few niche applications. We will conclude with a discussion on the barriers to adoption of new technology in financial services, why they exist, and how they might be overcome through greater academic and industry collaboration.BigDataWhy the allure?Why the marketing?How is innovation marketed?What is reality, what is hypeThesis – can BigData solve our challenges in FS?Financial Services, as a sectorWholesaleProblemsRiskForecastingProcessingcomplianceRetailProblemsForecastingProcessingcomplianceFS Data intensive computing for….ProcessingAlphaFS Data intensive computing in two flavorsTimeSizeFS Data IntensiveInter-machineGrids, BigData?Intra-machineGPUs, FPGAsFS Data Intensive AlgosGraphsLinear Time Series & eventsCalculusA 2nd look at data intensive computingCloud economicsConsumer successesSocial mediaDigital exhaustcompetitionBarriers to be overcome, barriers created by the burden of data constrained computingLegacy burden (see till slide)Data Silos – governanceJailed semantics with relational DBs
  • One of the most promising technologies is the Apache Hadoop and MapReduce framework for dealing with this “big data” problem. However current implementations of Hadoop lacks the necessary enterprise robustness demanded by Financial Services firms for wide-scale deployment of MapReduce applications.
  • Hadoop is a perfect fit for processing petabytes of data distributed over a large cluster of nodes. Developed with a fault tolerant design ground up, which makes it very resilient to hardware and software failures on nodes. The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative-execution enabled, however, a single task can be executed on multiple slave nodes.
  • Raptor within the Hadoop ecosystem, Raptor is a specialized analytics platform, as hive is a generalized analytics platform, it is a transparent addition to the hive system.Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.Scribe is a server for aggregating log data streamed in real time from a large number of servers. It is designed to be scalable, extensible without client-side modification, and robust to failure of the network or any specific machine.Oozie is an open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce). Oozie is a lot of things, but being:A workflow solution for off Hadoop processingAnother query processing API, a la Cascading is not one of them.Sqoop(“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:Imports individual tables or entire databases to files in HDFSGenerates Java classes to allow you to interact with your imported dataProvides the ability to import from SQL databases straight into your Hive data warehouseHue is a browser-based desktop interface for interacting with Hadoop. It supports a file browser, job tracker interface, cluster health monitor, and more.
  • Raptor is an ensemble of various frameworks and instruments that help segment data to a fine level allowing for spontaneous search on massive data sets. This is achieved with segmentation at two levels and a intelligent data policy manager.With adaptive data segmentation, partitioning ,column level hierarchical indexes and bloom filters, Allows ad hoc spontaneous querying and searching with real-time and near-real-time analytics capabilities.Intelligent optimization algorithms for switching execution of queries & search within Raptor, HBase or MapReducePolicy manager allows optimized indexing and automatic sharding, based on time and bucket sizes.Level 1 segmentationRaptor digester framework, segmentation based on high level parametersLevel 2 segmentationRaptor block analyzer framework, fine level segmentation ThorCacheAn LRU/MRU based spill over cache system, facilitates the segmentation frameworks
  • Level 1 segmentation:Based on high-level parametersPartition columns, row hashesBusiness specific data segmentation, by enhancing the generated digester code to meet business needs.Dimensioning/cubesAdaptive data transformation.Data is loaded into raptor using the following ways,Direct load from hive, using implicit load function in HQL.Via the Digester, This is an auto-generated code framework specific to the table.Data load Adaptors, these are custom data plug-in that facilitate in capturing data from various sources like databases, internet data streams, log files etc...Table Partitions – This approach helps store data based on columns, segmenting data based on demographics, groups, domains, industries etc… Digester – This component helps crunching down incoming data/streams, feeds the processed records into consistently hashed storage buckets.Storage Buckets – These are append 'able storage files in HDFS, each of the bucket corresponds to a specific Hash Key.Adaptive aggregation & transformation - Calculate data aggregates and vital stats per column as data is loaded into raptor and transform data to a more easy to use format. This is used for query optimization during query execution.Data Cubes & Dimensioning – Based on user hints, specific dimensions and cubes are generated for every table.
  • Level 2 segmentation:Applied fine tuned segmentation techniquesSingle/hierarchical indexingColumn bloom filtersAdaptive column indexingMeta Column InformationIndexing Engine– This component indexes the records of every data block into a B+ Tree/Lucene index map which is in turn cached into LRU based ViteCache (spill over cache). Does adaptive indexing where  unused column indexes are evicted and most frequently accessed columns get  indexed.Bloom Filter - a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. In our case we use to test if the block/index map consists of the partial search keys.ThorCache– A LRU/MRU based low latency index store to hold the per-block b+ tree/Lucene indices. (allows for to-disk spill over for effective memory management)
  • Raptor Meta Store – This is a RDBMS based database that store all meta data required by the various components in the system.
  • Quantum Processor -  is a very light weight processing framework written within the Hadoop layer, provides results in fraction of times when compared with traditional mapreduce. Used for real-time queries,  used when level 1 segmentation  search returns positive/negative and level 2 segmentation search returns Positive.
  • Business specific Code Generation Framework – This framework generates all the infrastructure components like indexer, digester, table schema, dimensions etc.. Based on user defined xml schema.
  • Policy Manager –  This component  helps organize  data in a time sorted way, allowing for  moving older bucketized data to sparsely indexed/managed buckets while retaining newer data in highly organized/indexed buckets automatically based on defined policy.  Allowing for efficient resource utilization.
  • Intelligent packet compression, based on the congestion on network, raptor switches to a resilient compression mode where packets are compressed using cpu efficient algorithm.Scheduling of jobs is now done using namenode operation maps, where namenode maintains information on all the operations {reads, writes, jobs} currently being performed on the cluster. This ensures a even distribution of all operations on the cluster.All computation intensive operations are scheduled on nodes that have capabilities to execute such jobs on GPU’s.Adaptive rebalancing, this function increases the replication factor of frequently used data and ensures a even availability of data, moves data closer to frequently accessed client sub-cluster nodes. Interactive User Console, this provides an interface to the user to manage the cluster, mange tasks, manage indexes, shards, filters, etc..for each table, set up policies around data, archive obsolete data.
  • 65969
  • Architecture: Raptor over Hadoop/HBase Raptor partitions incoming transactions Raptor executes Classifiers over GPU for each partition Flagging Cluster uses MetaClassifiers to Flag Transactions Notification Cluster sends out Notifications Raptor also implements Phasing out data and Classifier Pruning. Hive Interface not used except for Periodic Fraud Report GenerationHow does it work?Divide Data into Subsets/Buckets/Partitions (Where Raptor Excels) Apply Data Mining Techniques to generate Classifiers in parallel Combine resultant base models to generate a MetaClassifier The base classifiers execute in parallel while the MetaClassifier combines their results Pruning - Removes redundant Classifiers from ensemble (Very Old Data)Large Scale Data Mining TechniqueUsing a Scalable BlckBox ApproachCost Reduction Based Model usedLearning Algorithms used C4.5, CART, Ripper, Bayes, ID3
  • Smart customer care solution - allows CSR representatives to query terabytes of real-time business data and handle customer queries on complaints and events. Financial fraud analytics - Retrospective Analysis of incoming financial logs, Ranks entities based on suspicious behavior, generate events on suspected entry triggers.Media usage Log Analytics – “Targeted Marketing”, The analytics system, helps add short term and long term score to marketing content based on media content usage statistics this information helps target end users with the right marketing content.Predictive Trading – Distributed GPU-based event processing platform to significantly speed up algorithm trading computations, namely, market event parsing and market event matching against strategies.Strategies are distributed in disjoint clusters to enables highly parallelizable event matching, In each cluster, strategies are stored as contiguous blocks of memory to enable fast sequential access to improve memory localityComputation intensive jobs using GPUs - Just as in traditional parallel computing, distributed GPU processing scales better than processing on shared-systems because it will not overwhelm shared system resources.
  • Transcript

    • 1. Raptor: Real-time Analytics on Hadoop Soundar Velu Product Architect, Advanced Technology, SunGard Anil Kumar Batchu Research Engineer, Applied Research, SunGardProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
    • 2. Who We Are, Who Am I, What We Do?  Who We Are & What We Do?  Fortune 500 Company  Financial Services Firm  Provide software & consulting services across the industry  Exploring impacts of Big Data approach for last 2+ years  Who Am I?  Part of Applied Research & Consulting group based out of Bangalore, India  We focus on latest technology trendsProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 1
    • 3. Agenda  Financial Services and Data Problems  Raptor Architecture Overview  Raptor Components  Benchmarks  Future EnhancementsProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 2
    • 4. Financial Services & Typical “data intensive” Problems Wholesale Markets Portfolio Construction Price Forecasting Risk Calculation Compliance Surveillance Batch Processing Retail Markets Fraud Detection Targeted Marketing Demand ForecastingProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 3
    • 5. Implementation Constraints in Financial Services RDBMS Legacy Burden Centricity Architectures, languages, Constrained semantics, tools, systems, skills rigidity and specificity Data Silos Cost of Change Governance “You don’t own Evolution is the only answer that data, I do”Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 4
    • 6. What Did We Need?  A generic, reliable, and cost effective analytics solution with a wide range of application areas  Query execution and analytics at soft real-time windows (acceptable and consistent latencies)  Minimum customization, seamless integration and ease of use  Policy around data storage and processing  Adaptive segmentation algorithms for optimized data search (Indexing, partitioning and filtering)Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 5
    • 7. Some Limitations with Traditional Hadoop  Jobs are executed in a brute force fashion, causing complete scans of files for every single query  Long warm up time for jobs, performs poorly for relatively smaller data sets.  Scheduling imbalance in HDFS operations and job execution  Limitations around the kind of jobs that can be executed with MapReduce, (non equality joins not supported)  Open bugs and lacking features, memory management bottlenecks  Only for batch mode based applications, does not fit real-time analytics scenariosProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 6
    • 8. Raptor Application Stack Z Flume Scribe Sqoop o o Hue Raptor Oozie K e Hive/Pig e p HBase Map Reduce Quantum Processor e r HDFSProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 7
    • 9. Raptor ArchitectureProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 8
    • 10. Raptor Digester - Level 1 Segmentation Database Data Stream Logs/Files Data load adaptors Data Cleanser Call back Digester Raptor Client Table specific data transformer and partitioner, partition based on Data Transformer Call back demographics, Storage buckets customers, IP address or composite columns. Store partitioned HDFS data into respective file buckets in HDFS.Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 9
    • 11. Raptor Block Analyzer – Level 2 Segmentation Write Block Raptor client DataXceiver Offer BlockID for async-Indexing IndexFactory BlockReceiver Get Indexer for Asynchronously Block analyze/Index Block DataBlock Index Queue Block Analyzer DataBlock Indexer BloomFilter Store column meta information, Every block is asynchronously analyzed Block index map and column level bloom filters in ThorCache. and the resulting column level blooms, index tree and column level meta information is stored in ThorCache. All this Thor Block happens within the DataNode Context. Index CacheProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 10
    • 12. Raptor Client Framework Raptor Client is similar to dfsclient or jobclient, it stands as a primary interface into Raptor. Block Analyzer • Get Indexer for block • Mirror Indexer to replica nodes Code Generator NameNode Deploy Generated • Get block info for file Raptor Executor Code • Get file stats • Execute internal task • Clean up task • Get result segments Digester Raptor Client DataNode • Get Digester for table • Invoke Quantum Processor Services • Get Analyzer for table • Manage Indexes • Execute Jobs • Get Block analyzers Raptor Optimizer • Deploy generated infrastructure patch Re-Index Block • Get analyzer for table Re-Balance blocks • Get Indexer for table • Get adaptive metrics Adaptive Engine Raptor-Hive • Get indexer for block/table Policy Manager • Get Hive Meta-data for table • Get adaptive metrics for table • Auto shard buckets • Get Raptor Meta Data for table • Drop indexes for buckets • Get digester for Table • Archive Buckets • Deploy generated infrastructure patchProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 11
    • 13. Getting Results Out – End to End Flow Client Execute HQL Thrift Application Statement Client Hive JDBC Connection Thrift ResultSet Client Client applications create a hive JDBC Thrift Hive Server connection to Raptor-HiveServer. Client fires HQL Queries over the connection. The query is executed by Raptor and a Hive ResultSet is returned as Result. This is the interface used by Raptor-Hive Clients to access data in HDFS. e.g. “Select Name, CNO, TxID, floor(TxAmt) Fetch Task Execution from Credit_Tx order by CNO” Fetch Results Write Results Execute Non-HQL queries directly with raptor Client, HDFS Raptor Client Quantum ProcessorProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 12
    • 14. Raptor-Hive Processing Engine HiveServer Get Table Block JDBC Client Analyzers Datanodes Compile Optimize Raptor Optimizer Hive MetaStore Execution Plan Metrics published to Adaptive engine Raptor MetaStore Raptor Raptor Adaptive Executor Engine Raptor Client MapReduce Raptor HBase Quantum ProcessorProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 13
    • 15. Quantum Processor Framework Hive Executor QuantumNode is a service similar to DataXceiver on DataNode. All RaptorClient RaptorProtocolHelper Raptor operations are executed by the Quantum Processor framework. Quantum Node QuantumProcessor GetNodeStats SearchBlock ReadResultSegment Manage Indexes ReadBoundaryRecord ExecuteInternalTask Get File analyzers ExecuteExternalTaskProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 14
    • 16. Raptor Block Searcher/Processor Block Searcher analyzes the block indexes and bloom filtersHive Executor Process/Search RaptorClient QuantumProcessor and extracts the resultant Block indexed records via index based offset reader or if no indexes are available then it does a complete block scan. Thor Block Index BlockSearcher Cache Block Index Block Index No /Bloom Seq Block Reader Available Process All Column Level Records Bloom Yes Quantum Operator Column Meta Search Index Tree Data Result Writer Index Block Reader HDFS Process Result RecordsProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 15
    • 17. Quantum Operator Quantum Operator takes block records as input and performs the required field inspections, and applies UDFs and aggregation. The resultant records are either written to local segment files or to HDFS directly based on the operator type. Evaluate rows Block Searcher Quantum Operator Read Remote Seg Offer result rows Read k-remote segment files, if reduce operator. Write Result Inspect Rows Apply Join/Merge Apply UDF Has Yes reduce child? Write results to k-segmented local files, if child is a reduce No operator. Local Quantum Operators: SelectOperator, JoinOperator, HDFS Segment File SortByOperator, GropuByOperator, GPUOperator, etc…Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 16
    • 18. Adaptive Indexing Execute Hive Queries Collect Execution Metrics Raptor Adaptive Engine Asynchronous Adaptive Raptor Metadata Raptor adaptive engine collects metrics Indexer from the executed queries. Metrics include the columns and tables involved Adaptive in the query, the types of aggregation Index executed on columns etc… these metrics Scheduler are used by the adaptive index scheduler to schedule re-indexing of the blocks based on these usage metrics. Re Analyze/Index Blocks based on usage/query Metrics HDFS DataNode DataNode DataNode DataNode ThorIndex CacheProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 17
    • 19. Code Generation and Data Definition Framework User supplies table schema in an XML format. Based on the Xml table schema and hints, Raptor code generator generates metadata in Raptor Code Generator hive metadata store, HBase and raptor metadata store. It also generates table specific indexer column analyzers and bloom filter classes. Create Hive Table Definitions HiveServer Hive Metadata HBase Create Raptor Table Metadata Raptor Adaptive Engine Raptor Metadata Raptor Client Generate Table Specific Indexers, Bloom Filter , Column Deploy Generated code to Analyzers and digester Raptor- HiveServer and DataNodesProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 18
    • 20. Raptor Policy Manager – Time Based ShardTime based auto sharding, as per the Raptor Policy Managerpolicy defined by the user. Table bucketsare moved from highly segmented timeshards to sparsely segmented archive Raptor Metadatastore over time. HDFS Completely Bucket-1 Bucket-2 Bucket-3 Primary Segmented Time-Shard Partially Bucket-1 Bucket-2 Bucket-3 Secondary Segmented Time-Shard-1 … All Segments Merged and Indexes dropped Bucket-1 Bucket-2 Bucket-3 Archive Store Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 19
    • 21. Infrastructure Enhancements  Adaptive compression based on network congestion statistics  Scheduling of jobs via raptor client (via Hive) in conjunction with enhanced NameNode block policy  Computation intensive jobs scheduled on GPU enabled nodes  Object Pools across the Hadoop-Raptor ecosystem  Hand-shake mechanism between clients and DataNodes to avoid imminent operation failures  Interactive user console for managing the cluster, tasks, data policy, filters etc…Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 20
    • 22. Benchmark Raptor Benchmark carried out on a 5 node cluster with commodity hardware { Core2-Duo, 4GB RAM, 1TB storage, 100Mbps NIC, 64MB block size, Replication factor 3, Network Compression enabled} Table with 13 columns and mixed types. Sample data generated with node.js with moderate entropy. (table with 1 group index{3 columns} and 0 column bloom filters, 1 bucket} 5 Node Cluster Raptor Map/Reduce OperationTable Size 256MB 1GB 10GB 256MB 1GB 10GB Load Time 4.8s 22s 4.2m 54.6s 3.5m 36.0m Simple select wihout predicate 6.9s 22s 3.2m 21.6s 50.5s 9.5m Select with complex Predicate 0.8s 1.7s 0.9m 9.2s 32.2s 4.1m Select with Order By 1.6s 6.5s 1.1m 41.0s 3.2m 5.6m Select with Group By 0.6s 1.2s 0.8m 61.3s 1.5m 4.0m Simple Join 2.9s 5.7s 1.7m 1.2m 3.4m 9.2m User Defined row level Function 0.6s 1.1s 0.9m 13.2s 49s 7.3mProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 21
    • 23. Case Study: Credit Card Fraud Detection Scribe Scribe Scribe Clients Clients Clients Scribe Server Raptor Digester Cache Segment Data into Subsets/Buckets/ Partitions Raptor-Hadoop Raptor Policy Manager Pruning - Removes redundant Classifiers from Mapreduce Quantum ensemble Processor Apply Data Mining Techniques to generate Periodic Fraud Report Classifiers in parallel Generation Raptor Classifier Raptor-Hive Combine resultant base models to generate a uses Meta-Classifiers Meta-Classifier to Flag Transactions Analytics Results & Reports Flagging Cluster Sends out Notifications Notifications Cluster PG SQL Credit Card Fraud ApplicationProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 22
    • 24. Other Raptor Use Cases  Smart customer care solution  Financial fraud analytics  Media usage log analytics  Computation intensive jobs using GPUs  Predictive tradingProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 23
    • 25. Roadmap  Merge Raptor into Next Generation MapReduce  Dimensions and Cubes (MRCubes)  Cloud ready Raptor  Roles and security  Job failover management  Rules and triggers  Planning to open source  Starter Kits/Examples of various Use CasesProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 24
    • 26. Benefits  Query responses at soft real-time windows.  Code generation framework, for table specific raptor infrastructure code.  Zero down time, with hot deployment of generated infrastructure code.  Seamless integration into existing infrastructure with multiple ingress options.  Automatic time based sharding and data archival.  Distribute & execute non-MR jobs on the Hadoop ClusterProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 25
    • 27. Contacts Soundar Velu Anil Batchu Product Architect Research Engineer Soundararajan.velu@sungard.com AnilKumar.B@sungard.com Twitter: @greyquestProprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
    • 28. References  Optimizing Distributed Joins with Bloom Filters www.l3s.de/web/upload/documents/1/analysis.pdf  Apache Hadoop Goes Realtime at Facebook borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf  Data Mining with MAPREDUCE: Graph and Tensor Algorithms with MR www.ml.cmu.edu/research/dap-papers/tsourakakisdap.pdf  Distributed Cube Materialization on Holistic Measures∗ www.eecs.umich.edu/~congy/work/icde11a.pdf  HadoopDB in Action: Building Real World Applications www.cs.yale.edu/homes/dna/papers/hadoopdb-demo.pdf  TeraByte Sort on Apache Hadoop sortbenchmark.org/YahooHadoop.pdf  Mahouth in Action www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684  Web-Scale K-Means Clustering www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf  Hive – A Petabyte Scale Data Warehouse Using Hadoop infolab.stanford.edu/~ragho/hive-icde2010.pdf 27Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
    • 29. CONFIDENTIALITY STATEMENT Copyright © 2011 by SunGard Data Systems (or its subsidiaries, “SunGard”). All rights reserved. No parts of this document may be reproduced, transmitted or stored electronically without SunGard’s prior written permission. This document contains SunGards confidential or proprietary information. By accepting this document, you agree that: (A)(1) if a pre-existing contract containing disclosure and use restrictions exists between your company and SunGard, you and your company will use this information subject to the terms of the pre-existing contract; or (2) if no such pre-existing contract exists, you and your Company agree to protect this information and not reproduce or disclose the information in any way; and (B) SunGard makes no warranties, express or implied, in this document, and SunGard shall not be liable for damages of any kind arising out of use of this document.Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 28

    ×