Raptor combines Hadoop & HBase with machine learning models for adaptive data segmentation, partitioning, bucketing, and filtering to enable ad-hoc queries and real-time analytics. Raptor has intelligent optimization algorithms that switch query execution between HBase and MapReduce. Raptor can create per-block dynamic bloom filters for adaptive filtering. A policy manager allows optimized indexing and autosharding. This session will address how Raptor has been used in prototype systems in predictive trading, times-series analytics, smart customer care solutions, and a generalized analytics solution that can be hosted on the cloud.
Presentation at the Bernadotte Academy in 2007 by Infosphere CEO Mats Björe about the concept of OSINT. Examples from tools like Silobreake is included
Get to know Bluewolf's innovative business practices. They include remote database support, IT resourcing and SaaS consulting services. Please feel free to reach out with any additional questions.
Financial Services Data - Use It or Lose ItJohn Avery
Presentation at Data-Intensive
Computing, Graphs, and
Combinatorics in
Bio-Informatics, Finance,
Linguistics,
and National Security
Tuesday–Wednesday, July 26–27, 2011
Presentation at the Bernadotte Academy in 2007 by Infosphere CEO Mats Björe about the concept of OSINT. Examples from tools like Silobreake is included
Get to know Bluewolf's innovative business practices. They include remote database support, IT resourcing and SaaS consulting services. Please feel free to reach out with any additional questions.
Financial Services Data - Use It or Lose ItJohn Avery
Presentation at Data-Intensive
Computing, Graphs, and
Combinatorics in
Bio-Informatics, Finance,
Linguistics,
and National Security
Tuesday–Wednesday, July 26–27, 2011
SAP Inside Track Istanbul 2011 - Sybase Mobile Sales Project - Lessons Learned FIT Solutions
This is the 3rd Community Event in Turkey. We shared our experience at Borusan Makina Sybase Mobile Sales for SAP CRM Implementation Project (one of the four SUP implementation within MENA Area)
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
Big Data Analytics is characterized by analysis of data on three vectors: exploding data volume, proliferating data variety (relational, multi-media), and accelerating data velocity. However, other key vectors such as costs and skill set needed for Big Data Analytics are often overlooked. In this session, we will consider all five vectors by exploring various techniques where traditional but progressive technologies such as column store DBMS and Event Stream Processing is combined with open source frameworks such as Hadoop to exploit the full potential of Big Data Analytics.
Agenda:
- Big Data Analytics in the real world
- Commercial and Open Source techniques
- Bringing together Commercial and Open Source techniques
* Architectures
* Programming APIs
(e.g. embedded and federated MapReduce)
- Conclusions
Big Data, Hadoop, Hortonworks and Microsoft HDInsightHortonworks
Big Data is everywhere. And at the center of the big data discussion is Apache Hadoop, a next-generation enterprise data platform that allows you to capture, process and share the enormous amounts of new, multi-structured data that doesn’t fit into transitional systems.
With Microsoft HDInsight, powered by Hortonworks Data Platform, you can bridge this new world of unstructured content with the structured data we manage today. Together, we bring Hadoop to the masses as an addition to your current enterprise data architectures so that you can amass net new insight without net new headache.
Observe. Think. Act.
Solve Access Governance challenges for SAP identities in the areas of automated role building, vulnerabilities in SAP systems, streamline roles across SAP systems, auditing rules in eGRC systems, etc.
Automated Role Building for SAP environments: Use your SAP systems to identify the access governance requirements and sync them between the SAP systems employed.
Unilog is at the forefront of Big Data Analytics. XRF2 is Unilog’s flagship SaaS product for “Big Data Analysis”. XRF2 provides accurate Business Reporting based on intelligence gathered from terabytes of Competitive Product Data, Competitive Pricing and Competitive Product Categorization. Companies that use XRF2 make the most educated decisions regarding Product Portfolio, Strategic Sourcing and Strategic Pricing. This application is used by some of the most admired companies in the Distribution and Retail industry.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
More Related Content
Similar to Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
SAP Inside Track Istanbul 2011 - Sybase Mobile Sales Project - Lessons Learned FIT Solutions
This is the 3rd Community Event in Turkey. We shared our experience at Borusan Makina Sybase Mobile Sales for SAP CRM Implementation Project (one of the four SUP implementation within MENA Area)
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
Big Data Analytics is characterized by analysis of data on three vectors: exploding data volume, proliferating data variety (relational, multi-media), and accelerating data velocity. However, other key vectors such as costs and skill set needed for Big Data Analytics are often overlooked. In this session, we will consider all five vectors by exploring various techniques where traditional but progressive technologies such as column store DBMS and Event Stream Processing is combined with open source frameworks such as Hadoop to exploit the full potential of Big Data Analytics.
Agenda:
- Big Data Analytics in the real world
- Commercial and Open Source techniques
- Bringing together Commercial and Open Source techniques
* Architectures
* Programming APIs
(e.g. embedded and federated MapReduce)
- Conclusions
Big Data, Hadoop, Hortonworks and Microsoft HDInsightHortonworks
Big Data is everywhere. And at the center of the big data discussion is Apache Hadoop, a next-generation enterprise data platform that allows you to capture, process and share the enormous amounts of new, multi-structured data that doesn’t fit into transitional systems.
With Microsoft HDInsight, powered by Hortonworks Data Platform, you can bridge this new world of unstructured content with the structured data we manage today. Together, we bring Hadoop to the masses as an addition to your current enterprise data architectures so that you can amass net new insight without net new headache.
Observe. Think. Act.
Solve Access Governance challenges for SAP identities in the areas of automated role building, vulnerabilities in SAP systems, streamline roles across SAP systems, auditing rules in eGRC systems, etc.
Automated Role Building for SAP environments: Use your SAP systems to identify the access governance requirements and sync them between the SAP systems employed.
Unilog is at the forefront of Big Data Analytics. XRF2 is Unilog’s flagship SaaS product for “Big Data Analysis”. XRF2 provides accurate Business Reporting based on intelligence gathered from terabytes of Competitive Product Data, Competitive Pricing and Competitive Product Categorization. Companies that use XRF2 make the most educated decisions regarding Product Portfolio, Strategic Sourcing and Strategic Pricing. This application is used by some of the most admired companies in the Distribution and Retail industry.
Similar to Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard (20)
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
1. Raptor: Real-time Analytics on Hadoop
Soundar Velu
Product Architect, Advanced Technology, SunGard
Anil Kumar Batchu
Research Engineer, Applied Research, SunGard
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
2. Who We Are, Who Am I, What We Do?
Who We Are & What We Do?
Fortune 500 Company
Financial Services Firm
Provide software & consulting services across the industry
Exploring impacts of Big Data approach for last 2+ years
Who Am I?
Part of Applied Research & Consulting group based out of
Bangalore, India
We focus on latest technology trends
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 1
3. Agenda
Financial Services and Data Problems
Raptor Architecture Overview
Raptor Components
Benchmarks
Future Enhancements
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 2
4. Financial Services & Typical “data intensive” Problems
Wholesale Markets
Portfolio Construction Price Forecasting
Risk Calculation
Compliance Surveillance Batch Processing
Retail Markets
Fraud Detection
Targeted Marketing Demand Forecasting
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 3
5. Implementation Constraints in Financial Services
RDBMS
Legacy Burden
Centricity
Architectures, languages, Constrained semantics,
tools, systems, skills rigidity and specificity
Data Silos Cost of Change
Governance “You don’t own Evolution is the only answer
that data, I do”
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 4
6. What Did We Need?
A generic, reliable, and cost effective analytics solution
with a wide range of application areas
Query execution and analytics at soft real-time
windows (acceptable and consistent latencies)
Minimum customization, seamless integration and
ease of use
Policy around data storage and processing
Adaptive segmentation algorithms for optimized data
search (Indexing, partitioning and filtering)
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 5
7. Some Limitations with Traditional Hadoop
Jobs are executed in a brute force fashion, causing
complete scans of files for every single query
Long warm up time for jobs, performs poorly for
relatively smaller data sets.
Scheduling imbalance in HDFS operations and job
execution
Limitations around the kind of jobs that can be executed
with MapReduce, (non equality joins not supported)
Open bugs and lacking features, memory management
bottlenecks
Only for batch mode based applications, does not fit
real-time analytics scenarios
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 6
8. Raptor Application Stack
Z Flume Scribe Sqoop
o
o Hue Raptor Oozie
K
e Hive/Pig
e
p HBase Map Reduce Quantum Processor
e
r HDFS
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 7
10. Raptor Digester - Level 1 Segmentation
Database Data Stream Logs/Files
Data load adaptors
Data Cleanser
Call back Digester Raptor Client
Table specific data
transformer and
partitioner, partition
based on
Data Transformer
Call back
demographics,
Storage buckets customers, IP address
or composite columns.
Store partitioned
HDFS data into
respective file
buckets in HDFS.
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 9
11. Raptor Block Analyzer – Level 2 Segmentation
Write Block
Raptor client
DataXceiver
Offer BlockID for
async-Indexing
IndexFactory BlockReceiver
Get Indexer for
Asynchronously
Block
analyze/Index Block
DataBlock
Index
Queue
Block Analyzer DataBlock
Indexer BloomFilter
Store column meta information, Every block is asynchronously analyzed
Block index map and column
level bloom filters in ThorCache.
and the resulting column level blooms,
index tree and column level meta
information is stored in ThorCache. All this
Thor Block happens within the DataNode Context.
Index Cache
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 10
12. Raptor Client Framework
Raptor Client is similar to dfsclient or jobclient, it stands as a
primary interface into Raptor.
Block Analyzer
• Get Indexer for block
• Mirror Indexer to replica nodes
Code Generator NameNode
Deploy Generated • Get block info for file
Raptor Executor Code • Get file stats
• Execute internal task
• Clean up task
• Get result segments
Digester Raptor Client DataNode
• Get Digester for table
• Invoke Quantum Processor Services
• Get Analyzer for table
• Manage Indexes
• Execute Jobs
• Get Block analyzers
Raptor Optimizer • Deploy generated infrastructure patch
Re-Index Block
• Get analyzer for table
Re-Balance blocks
• Get Indexer for table
• Get adaptive metrics
Adaptive Engine Raptor-Hive
• Get indexer for block/table
Policy Manager • Get Hive Meta-data for table
• Get adaptive metrics for table
• Auto shard buckets
• Get Raptor Meta Data for table
• Drop indexes for buckets
• Get digester for Table
• Archive Buckets
• Deploy generated infrastructure patch
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 11
13. Getting Results Out – End to End Flow
Client Execute HQL Thrift
Application Statement Client
Hive JDBC
Connection Thrift
ResultSet Client
Client applications create a hive JDBC
Thrift Hive Server
connection to Raptor-HiveServer. Client fires HQL
Queries over the connection. The query is
executed by Raptor and a Hive ResultSet is
returned as Result. This is the interface used by Raptor-Hive
Clients to access data in HDFS.
e.g. “Select Name, CNO, TxID, floor(TxAmt) Fetch Task Execution
from Credit_Tx order by CNO”
Fetch Results Write Results
Execute Non-HQL queries
directly with raptor Client,
HDFS
Raptor Client Quantum Processor
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 12
14. Raptor-Hive Processing Engine
HiveServer
Get Table Block
JDBC Client Analyzers
Datanodes
Compile
Optimize Raptor Optimizer
Hive
MetaStore Execution Plan Metrics published
to Adaptive engine
Raptor
MetaStore
Raptor Raptor Adaptive
Executor Engine
Raptor Client
MapReduce Raptor HBase
Quantum Processor
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 13
15. Quantum Processor Framework
Hive Executor
QuantumNode is a service similar to
DataXceiver on DataNode. All
RaptorClient RaptorProtocolHelper Raptor operations are executed by
the Quantum Processor framework.
Quantum Node
QuantumProcessor
GetNodeStats SearchBlock
ReadResultSegment Manage Indexes
ReadBoundaryRecord ExecuteInternalTask
Get File analyzers ExecuteExternalTask
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 14
16. Raptor Block Searcher/Processor
Block Searcher analyzes the
block indexes and bloom filters
Hive Executor Process/Search
RaptorClient QuantumProcessor and extracts the resultant
Block indexed records via index
based offset reader or if no
indexes are available then it
does a complete block scan.
Thor Block Index BlockSearcher
Cache
Block Index Block Index No
/Bloom Seq Block Reader
Available
Process All
Column Level Records
Bloom
Yes Quantum Operator
Column Meta Search Index Tree
Data
Result Writer
Index Block Reader
HDFS
Process Result
Records
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 15
17. Quantum Operator
Quantum Operator takes block records as input and performs the required field inspections, and applies
UDFs and aggregation. The resultant records are either written to local segment files or to HDFS directly
based on the operator type.
Evaluate rows
Block Searcher Quantum Operator Read Remote Seg
Offer result rows Read k-remote segment
files, if reduce operator.
Write Result
Inspect Rows Apply Join/Merge Apply UDF
Has Yes
reduce
child?
Write results to k-segmented
local files, if child is a reduce
No operator.
Local Quantum Operators: SelectOperator, JoinOperator,
HDFS Segment File SortByOperator, GropuByOperator, GPUOperator, etc…
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 16
18. Adaptive Indexing
Execute Hive Queries Collect Execution Metrics
Raptor Adaptive
Engine
Asynchronous
Adaptive Raptor Metadata
Raptor adaptive engine collects metrics Indexer
from the executed queries. Metrics
include the columns and tables involved Adaptive
in the query, the types of aggregation Index
executed on columns etc… these metrics Scheduler
are used by the adaptive index scheduler
to schedule re-indexing of the blocks
based on these usage metrics. Re Analyze/Index Blocks
based on usage/query Metrics
HDFS
DataNode DataNode DataNode DataNode
ThorIndex
Cache
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 17
19. Code Generation and Data Definition Framework
User supplies table schema in an XML format.
Based on the Xml table schema and hints,
Raptor code generator generates metadata in
Raptor Code Generator hive metadata store, HBase and raptor metadata
store. It also generates table specific indexer
column analyzers and bloom filter classes.
Create Hive Table Definitions
HiveServer
Hive Metadata
HBase
Create Raptor Table Metadata
Raptor Adaptive
Engine
Raptor Metadata
Raptor Client
Generate Table Specific
Indexers, Bloom Filter , Column Deploy Generated code to
Analyzers and digester Raptor- HiveServer and
DataNodes
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 18
20. Raptor Policy Manager – Time Based Shard
Time based auto sharding, as per the Raptor Policy
Manager
policy defined by the user. Table buckets
are moved from highly segmented time
shards to sparsely segmented archive Raptor Metadata
store over time.
HDFS
Completely Bucket-1 Bucket-2 Bucket-3 Primary
Segmented Time-Shard
Partially
Bucket-1 Bucket-2 Bucket-3 Secondary
Segmented
Time-Shard-1
…
All Segments Merged
and Indexes dropped Bucket-1 Bucket-2 Bucket-3 Archive Store
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 19
21. Infrastructure Enhancements
Adaptive compression based on network congestion statistics
Scheduling of jobs via raptor client (via Hive) in conjunction
with enhanced NameNode block policy
Computation intensive jobs scheduled on GPU enabled nodes
Object Pools across the Hadoop-Raptor ecosystem
Hand-shake mechanism between clients and DataNodes to
avoid imminent operation failures
Interactive user console for managing the cluster, tasks, data
policy, filters etc…
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 20
22. Benchmark Raptor
Benchmark carried out on a 5 node cluster with commodity hardware { Core2-Duo, 4GB
RAM, 1TB storage, 100Mbps NIC, 64MB block size, Replication factor 3, Network
Compression enabled}
Table with 13 columns and mixed types. Sample data generated with node.js with moderate
entropy. (table with 1 group index{3 columns} and 0 column bloom filters, 1 bucket}
5 Node Cluster Raptor Map/Reduce
OperationTable Size 256MB 1GB 10GB 256MB 1GB 10GB
Load Time 4.8s 22s 4.2m 54.6s 3.5m 36.0m
Simple select wihout predicate 6.9s 22s 3.2m 21.6s 50.5s 9.5m
Select with complex Predicate 0.8s 1.7s 0.9m 9.2s 32.2s 4.1m
Select with Order By 1.6s 6.5s 1.1m 41.0s 3.2m 5.6m
Select with Group By 0.6s 1.2s 0.8m 61.3s 1.5m 4.0m
Simple Join 2.9s 5.7s 1.7m 1.2m 3.4m 9.2m
User Defined row level Function 0.6s 1.1s 0.9m 13.2s 49s 7.3m
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 21
23. Case Study: Credit Card Fraud Detection
Scribe Scribe Scribe
Clients Clients Clients
Scribe Server Raptor Digester
Cache Segment Data into
Subsets/Buckets/
Partitions
Raptor-Hadoop Raptor Policy Manager
Pruning - Removes
redundant Classifiers from
Mapreduce Quantum ensemble
Processor
Apply Data Mining
Techniques to generate Periodic Fraud Report
Classifiers in parallel Generation
Raptor Classifier Raptor-Hive
Combine resultant base
models to generate a
uses Meta-Classifiers Meta-Classifier
to Flag Transactions
Analytics Results & Reports
Flagging Cluster
Sends out Notifications
Notifications Cluster PG SQL
Credit Card Fraud Application
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 22
24. Other Raptor Use Cases
Smart customer care solution
Financial fraud analytics
Media usage log analytics
Computation intensive jobs using GPUs
Predictive trading
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 23
25. Roadmap
Merge Raptor into Next Generation MapReduce
Dimensions and Cubes (MRCubes)
Cloud ready Raptor
Roles and security
Job failover management
Rules and triggers
Planning to open source
Starter Kits/Examples of various Use Cases
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 24
26. Benefits
Query responses at soft real-time windows.
Code generation framework, for table specific raptor
infrastructure code.
Zero down time, with hot deployment of generated
infrastructure code.
Seamless integration into existing infrastructure with
multiple ingress options.
Automatic time based sharding and data archival.
Distribute & execute non-MR jobs on the Hadoop Cluster
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 25
27. Contacts
Soundar Velu Anil Batchu
Product Architect Research Engineer
Soundararajan.velu@sungard.com AnilKumar.B@sungard.com
Twitter: @greyquest
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
28. References
Optimizing Distributed Joins with Bloom Filters
www.l3s.de/web/upload/documents/1/analysis.pdf
Apache Hadoop Goes Realtime at Facebook
borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf
Data Mining with MAPREDUCE: Graph and Tensor Algorithms with MR
www.ml.cmu.edu/research/dap-papers/tsourakakisdap.pdf
Distributed Cube Materialization on Holistic Measures∗
www.eecs.umich.edu/~congy/work/icde11a.pdf
HadoopDB in Action: Building Real World Applications
www.cs.yale.edu/homes/dna/papers/hadoopdb-demo.pdf
TeraByte Sort on Apache Hadoop
sortbenchmark.org/YahooHadoop.pdf
Mahouth in Action
www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684
Web-Scale K-Means Clustering
www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Hive – A Petabyte Scale Data Warehouse Using Hadoop
infolab.stanford.edu/~ragho/hive-icde2010.pdf 27
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
Cloud Computing, Distributed Computing, Big Data, Low Latency, Mobile Computing, GPU/FPGA, NG UI Frameworks, Functional Languages, DSLs, etc…
Financial Services firms are investigating next generation technologies for data analytics. Data growth – particularly of unstructured data – poses a special challenge as the volume and diversity of data types outstrip the capabilities of older technologies such as relational databases.Predictive analysis of both internal and external data result in better, proactive management of a wide range of issues from: credit and operational risk e.g. fraud and reputational risk; customer loyalty and profitability; etcBanks, mutual funds, and insurance companies need to transform their data into actionable information to comply with government and industry regulations, manage risk, increase revenue in a competitive economy, identify fraud and improve efficiency.
Financial services firms worldwide are struggling with the challenges of Big Data analytics management - analyzing massive volumes of complex data from many sources in order to better serve and retain customers, find and effectively recruit new prospects, detect and prevent fraud, manage assets and streamline operations.Financial services companies are taking advantage of Big Data analytics to increase profit margins, reduce risk and solidify relationships with profitable customers. Now more than ever, a powerful, scalable, affordable and simple Business Intelligence (BI) solution for business users is key to helping make the right decisions that drive overall success. Data in Financial Services - Use It or Lose It! Financial Services is a domain where what you do with data and what you don't do with data can have a dramatic impact on your success. The BigData movement is attracting both investment and press - horizontally as an analytics infrastructure and vertically with domain specific tools. In financial services, distributed computing and parallel processing has been employed for many years, so what is the utility of BigData for these well-known problems? Is the size of data being analyzed the real challenge or is it the ability to evolve insight and analytic capabilities on what, in relative terms, may really be SmallData? This presentation will discuss how time, cost and information asymmetry are critical competitive advantages in financial services and how the current state of the art is in need of some help from the BigData community. We will also discuss the hardware and software used for typical computation and data processing workloads in financial services and why distributed computing is still reserved for only relatively few niche applications. We will conclude with a discussion on the barriers to adoption of new technology in financial services, why they exist, and how they might be overcome through greater academic and industry collaboration.BigDataWhy the allure?Why the marketing?How is innovation marketed?What is reality, what is hypeThesis – can BigData solve our challenges in FS?Financial Services, as a sectorWholesaleProblemsRiskForecastingProcessingcomplianceRetailProblemsForecastingProcessingcomplianceFS Data intensive computing for….ProcessingAlphaFS Data intensive computing in two flavorsTimeSizeFS Data IntensiveInter-machineGrids, BigData?Intra-machineGPUs, FPGAsFS Data Intensive AlgosGraphsLinear Time Series & eventsCalculusA 2nd look at data intensive computingCloud economicsConsumer successesSocial mediaDigital exhaustcompetitionBarriers to be overcome, barriers created by the burden of data constrained computingLegacy burden (see till slide)Data Silos – governanceJailed semantics with relational DBs
One of the most promising technologies is the Apache Hadoop and MapReduce framework for dealing with this “big data” problem. However current implementations of Hadoop lacks the necessary enterprise robustness demanded by Financial Services firms for wide-scale deployment of MapReduce applications.
Hadoop is a perfect fit for processing petabytes of data distributed over a large cluster of nodes. Developed with a fault tolerant design ground up, which makes it very resilient to hardware and software failures on nodes. The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative-execution enabled, however, a single task can be executed on multiple slave nodes.
Raptor within the Hadoop ecosystem, Raptor is a specialized analytics platform, as hive is a generalized analytics platform, it is a transparent addition to the hive system.Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.Scribe is a server for aggregating log data streamed in real time from a large number of servers. It is designed to be scalable, extensible without client-side modification, and robust to failure of the network or any specific machine.Oozie is an open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce). Oozie is a lot of things, but being:A workflow solution for off Hadoop processingAnother query processing API, a la Cascading is not one of them.Sqoop(“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:Imports individual tables or entire databases to files in HDFSGenerates Java classes to allow you to interact with your imported dataProvides the ability to import from SQL databases straight into your Hive data warehouseHue is a browser-based desktop interface for interacting with Hadoop. It supports a file browser, job tracker interface, cluster health monitor, and more.
Raptor is an ensemble of various frameworks and instruments that help segment data to a fine level allowing for spontaneous search on massive data sets. This is achieved with segmentation at two levels and a intelligent data policy manager.With adaptive data segmentation, partitioning ,column level hierarchical indexes and bloom filters, Allows ad hoc spontaneous querying and searching with real-time and near-real-time analytics capabilities.Intelligent optimization algorithms for switching execution of queries & search within Raptor, HBase or MapReducePolicy manager allows optimized indexing and automatic sharding, based on time and bucket sizes.Level 1 segmentationRaptor digester framework, segmentation based on high level parametersLevel 2 segmentationRaptor block analyzer framework, fine level segmentation ThorCacheAn LRU/MRU based spill over cache system, facilitates the segmentation frameworks
Level 1 segmentation:Based on high-level parametersPartition columns, row hashesBusiness specific data segmentation, by enhancing the generated digester code to meet business needs.Dimensioning/cubesAdaptive data transformation.Data is loaded into raptor using the following ways,Direct load from hive, using implicit load function in HQL.Via the Digester, This is an auto-generated code framework specific to the table.Data load Adaptors, these are custom data plug-in that facilitate in capturing data from various sources like databases, internet data streams, log files etc...Table Partitions – This approach helps store data based on columns, segmenting data based on demographics, groups, domains, industries etc… Digester – This component helps crunching down incoming data/streams, feeds the processed records into consistently hashed storage buckets.Storage Buckets – These are append 'able storage files in HDFS, each of the bucket corresponds to a specific Hash Key.Adaptive aggregation & transformation - Calculate data aggregates and vital stats per column as data is loaded into raptor and transform data to a more easy to use format. This is used for query optimization during query execution.Data Cubes & Dimensioning – Based on user hints, specific dimensions and cubes are generated for every table.
Level 2 segmentation:Applied fine tuned segmentation techniquesSingle/hierarchical indexingColumn bloom filtersAdaptive column indexingMeta Column InformationIndexing Engine– This component indexes the records of every data block into a B+ Tree/Lucene index map which is in turn cached into LRU based ViteCache (spill over cache). Does adaptive indexing where unused column indexes are evicted and most frequently accessed columns get indexed.Bloom Filter - a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. In our case we use to test if the block/index map consists of the partial search keys.ThorCache– A LRU/MRU based low latency index store to hold the per-block b+ tree/Lucene indices. (allows for to-disk spill over for effective memory management)
Raptor Meta Store – This is a RDBMS based database that store all meta data required by the various components in the system.
Quantum Processor - is a very light weight processing framework written within the Hadoop layer, provides results in fraction of times when compared with traditional mapreduce. Used for real-time queries, used when level 1 segmentation search returns positive/negative and level 2 segmentation search returns Positive.
Business specific Code Generation Framework – This framework generates all the infrastructure components like indexer, digester, table schema, dimensions etc.. Based on user defined xml schema.
Policy Manager – This component helps organize data in a time sorted way, allowing for moving older bucketized data to sparsely indexed/managed buckets while retaining newer data in highly organized/indexed buckets automatically based on defined policy. Allowing for efficient resource utilization.
Intelligent packet compression, based on the congestion on network, raptor switches to a resilient compression mode where packets are compressed using cpu efficient algorithm.Scheduling of jobs is now done using namenode operation maps, where namenode maintains information on all the operations {reads, writes, jobs} currently being performed on the cluster. This ensures a even distribution of all operations on the cluster.All computation intensive operations are scheduled on nodes that have capabilities to execute such jobs on GPU’s.Adaptive rebalancing, this function increases the replication factor of frequently used data and ensures a even availability of data, moves data closer to frequently accessed client sub-cluster nodes. Interactive User Console, this provides an interface to the user to manage the cluster, mange tasks, manage indexes, shards, filters, etc..for each table, set up policies around data, archive obsolete data.
65969
Architecture: Raptor over Hadoop/HBase Raptor partitions incoming transactions Raptor executes Classifiers over GPU for each partition Flagging Cluster uses MetaClassifiers to Flag Transactions Notification Cluster sends out Notifications Raptor also implements Phasing out data and Classifier Pruning. Hive Interface not used except for Periodic Fraud Report GenerationHow does it work?Divide Data into Subsets/Buckets/Partitions (Where Raptor Excels) Apply Data Mining Techniques to generate Classifiers in parallel Combine resultant base models to generate a MetaClassifier The base classifiers execute in parallel while the MetaClassifier combines their results Pruning - Removes redundant Classifiers from ensemble (Very Old Data)Large Scale Data Mining TechniqueUsing a Scalable BlckBox ApproachCost Reduction Based Model usedLearning Algorithms used C4.5, CART, Ripper, Bayes, ID3
Smart customer care solution - allows CSR representatives to query terabytes of real-time business data and handle customer queries on complaints and events. Financial fraud analytics - Retrospective Analysis of incoming financial logs, Ranks entities based on suspicious behavior, generate events on suspected entry triggers.Media usage Log Analytics – “Targeted Marketing”, The analytics system, helps add short term and long term score to marketing content based on media content usage statistics this information helps target end users with the right marketing content.Predictive Trading – Distributed GPU-based event processing platform to significantly speed up algorithm trading computations, namely, market event parsing and market event matching against strategies.Strategies are distributed in disjoint clusters to enables highly parallelizable event matching, In each cluster, strategies are stored as contiguous blocks of memory to enable fast sequential access to improve memory localityComputation intensive jobs using GPUs - Just as in traditional parallel computing, distributed GPU processing scales better than processing on shared-systems because it will not overwhelm shared system resources.