SlideShare a Scribd company logo
1 of 54
Download to read offline
Big Data Analytics - Accelerated 
stream-horizon.com
Legacy ETL platforms & conventional Data Integration approach 
•Unable to meet latency & data throughput demands of Big Data integration challenges 
•Based on batch (rather than streaming) design paradigms 
•Insufficient level of parallelism of data processing 
•Require specific ETL platform skills 
•High TCO 
•Inefficient software delivery lifecycle 
•Require coding of ETL logic & staff with niche skills 
•High risk & unnecessarily complex release management 
•Vendor lock-in 
•Often require exotic hardware appliances 
•Complex environment management 
•Repository ownership (maintenance & database licenses)
StreamHorizon’s “adaptiveETL” platform - Performance 
•Next generation ETL Big Data processing platform 
•Delivering performance critical Big Data projects 
•Massively parallel data streaming engine 
•Backed with In Memory Data Grid (Coherence, Infinispan, Hazelcast, any other.) 
•ETL processes run in memory & interact with cache (In Memory Data Grid) 
•Unnecessary Staging (I/O expensive) ETL steps are eliminated 
•Quick Time to Market (measured in days)
StreamHorizon’s “adaptiveETL” platform - Effectiveness 
•Low TCO 
•Fully Configurable via XML 
•Requires no ETL platform specific knowledge 
•Shift from 'coding' to 'XML configuration' reduces IT skills required to deliver, manage, run & outsource projects 
•Eliminated 90% manual coding 
•Flexible & Customizable – override any default behavior of StreamHorizon platform with custom Java, OS script or SQL implementation 
•No vendor lock-in (all custom written code runs outside StreamHorizon platform-no need to re-develop code when migrating your solution) 
•No ETL tool specific language 
•Out of the box features like Type 0,1,2, Custom dimensions, dynamic In Memory Cache formation transparent to developer
AdaptiveETL enables: 
•Increased data velocity & reduced cost of delivery: 
•Quick Time to Market – deploy StreamHorizon & deliver fully functional Pilot of your Data integration project in a single week 
•Total Project Cost / Data Throughput ratio = 0.2 (20% of budget required in comparison with Market Leaders) 
•1 Hour Proof of Concept – download and test-run StreamHorizon’s demo Data Warehousing project 
•Big Data architectures supported: 
•Hadoop ecosystem - Fully integrated with Apache Hadoop ecosystem (Spark, Storm, Pig, Hive, HDFS etc.) 
•Conventional ETL deployments - Data processing throughput of 1 million records per second (single commodity server, single database table) 
•Big Data architectures with In Memory Data Grid acting as a Data Store (Coherence, Infinispan, Hazelcast etc.)
AdaptiveETL enables: 
•Achievable data processing throughput: 
•Conventional (RDBMS) ETL deployments - 1 million records per second (single database table, single commodity server, HDD Storage) 
•Hadoop ecosystem (StreamHorizon & Spark) - over 1 million records per second for every node of your cluster 
•File system (conventional & Hadoop HDFS) - 1 million records per second per server (or cluster node) utilizing HDD Storage 
•Supported Architectural paradigms: 
•Lambda Architecture - Hadoop real time & batch oriented data streaming/processing architecture 
•Data Streaming & Micro batch Architecture 
•Massively parallel conventional ETL Architecture 
•Batch oriented conventional ETL Architecture
AdaptiveETL enables: 
•Scalability & Compatibility: 
•Virtualizable & Clusterable 
•Horizontally & Vertically scalable 
•Highly Available (HA) 
•Running on Linux, Solaris, Windows, Compute Clouds (EC2 & any other) 
•Deployment Architectures: 
•Runs on Big Data clusters: Hadoop, HDFS, Kafka, Spark, Storm, Hive, Impala and more… 
•Runs as StreamHorizon Data Processing Cluster (ETL grid) 
•Runs on Compute Grid (alongside grid libraries like Quant Library or any other)
Targeted Program & Project profiles 
Greenfield project candidates: 
•Data Warehousing 
•Data Integration 
•Business Intelligence & Management Information 
•OLTP Systems (Online Transactional Processing) 
Brownfield project candidates: 
Quickly enable existing Data Warehouses & Databases which: 
•Struggle to keep up with loading of large volumes of data 
•Don’t satisfy SLA from query latency perspective 
Latency Profile: 
•Real Time (Low Latency - 0.666 microseconds per record - average) 
•Batch Oriented 
Delivery Profile: 
•<5 days Working Prototypes 
•Quick Time to Market Projects 
•Compliance & Legislation Critical Deliveries 
Skill Profile 
•Low-Medium Skilled IT workforce 
•Offshore based deliveries 
Data Volume Profile 
•VLDB (Very Large Databases) 
•Small-Medium Databases
Industries (not limited to…) 
Finance - Market Risk, Credit Risk, Foreign Exchange, Tick Data, Operations 
Telecom - Processing PM and FM data in real time (both radio and core) 
Insurance - Policy Pricing, Claim Profiling & Analysis 
Health – Activity Tracking, Care Cost Analysis, Treatment Profiling 
ISP - User Activity Analysis & Profiling, Log Data Mining, Behavioral Analysis
Complexity 
•Data Integration = set of large (in numbers), interdependent & usually trivial units of ETL transformations 
Performance 
•Long loading time windows 
•Frequent SLA breaks 
•‘Domino Effect’ execution & dependencies (Waterfall paradigm) 
…above are consequence of batch oriented rather than ‘Real Time’ & ‘Data Streaming’ design paradigms 
Query Latency 
•Longer than expected query response times 
•Inadequate Ad-Hoc query capability 
The Problem
Data Throughput of StreamHorizon platform enables: 
•Indexing of Data Warehouses to extent previously unimaginable (4+ indexes per single fact table) 
•Extensively indexed database delivers load throughput of 50-80% compared to model without indexes, however, it reduces query latency (intentional sacrifice of load latency for query performance) 
•StreamHorizon fully supports OLAP integration (please refer to FAQ page). OLAP delivers slice & dice and drill-down (data pivoting) capability via Excel or any other user front end tool. 
•No need to utilize OLAP cubes or equivalents (In-Memory solutions) acting as 'query accelerators'. Such solution are dependent on available memory and thereby impose limit do data volumes system can handle. 
•Horizontal scaling of In-Memory software comes with a price (increased latency) as queried data is collated from multiple servers into single resultset 
•No need to purchase exotic or specialist hardware appliances 
•No need to purchase In-Memory/OLAP hardware & licence 
•Simple software stack: 
Our Solution (Example: RDBMS based Big Data Streaming Platform) 
StreamHorizon + Vanilla Database 
VS. 
ETL Tool + Database (usually exotic) + OLAP/In-Memory solution (usually clustered)
•Time to market of project delivery - measured in days rather than months. 
•IT Skills required for development are reduced to basic IT knowledge. 
•Manage data volumes typical only for Financial Exchanges, Telecom blue chips and ISP’s. 
•Desktops running with StreamHorizon platform have more data processing bandwidth than commodity servers with state of the art (read complex and expensive) ETL tools and dozen of CPU’s. 
•Generic & Adaptable ETL platform 
•Simple to setup (XML configuration) 
•Platform geared to deliver projects of high complexity (low latency or batch oriented fashion) 
Our Solution - (continued)
Architecture
High Level Architecture 
Data SourcesStreamHorizonData Processing PlatformData Processing EngineData TargetsData Delivery EngineProcessing CoordinatorCustom PluginsEngineMiddlewareMessagingFiles & DirectoriesDirect StreamingWeb ServicesIn-Memory Data GridData Processing Throughput of 1+ Million Records per Second DatabasesHadoopHFiles & DirectoriesDirect StreamingWeb ServicesDatabasesHadoopHMiddlewareMessaging
Component Architecture 
StreamHorizon Component ArchitectureData Processing EngineData SourcesData CleansingData TargetsData QualityData TransfomationData MiningData Delivery EngineFile WriterHadoop PersistorDB Bulk LoaderMessage GeneratorProcessing CoordinatorTransaction ManagerEventGeneratorRecovery ManagerCustom PluginsEngineCustomJavaCustomSQLCustomScriptsProcessingMetrics & StatsIn-Memory Data GridLocal CacheGlobal CacheData Integrity GuardJDBC LoaderMiddlewareMessagingFiles & DirectoriesDirect StreamingWeb ServicesDatabasesHadoopHMiddlewareMessagingFiles & DirectoriesDirect StreamingWeb ServicesDatabasesHadoopH
Data Processing Flow (Generic Target)
Data Processing Flow (RDBMS Database as a Target)
Benchmarks
Benchmarks 
Reading MethodWriting MethodDatabase Load TypeThroughputCommentDeployment MethodInstances & ThreadpoolsBuffered File Read(JVM Heap) OffOff2.9million/sec · This is theoretical testing scenario (as data is generally always persisted) · Shows StreamHorizon Data Processing Ability without external bottlenecks · Eliminates bottleneck introduced by: · DB (DB persistence is switched Off) · I/O (Writing Bulk files is switched Off) and I/O (Reading Feed files is switched Off) Local Cluster (single server) 3 instances x12 ETL threads per instanceBenchmark 1Read File (SAN) Write Bulk FileorAny other File (feed) Off1.86million/sec · Throughput for feed processing with StreamHorizon Local Cluster running on a single physical server · This test shows ability to create output feed (bulk file or any other feed (file) for that matter by applying ETL logic to input file (feed) Local Cluster(single server) 3 instances x12 ETL threadsper instanceFile to FileRead File (SAN) Write Bulk File(Fact table format) Bulk Load1.9million/sec · Throughput for DB Bulk Load (Oracle – External Tables) with single StreamHorizon instance running on a single physical serverSingle Instance(single server) 1 instancex50 DB ThreadsBULK LoadRead File (SAN)OffJDBC Load1.1million/sec · Throughput for JDBC Load (Oracle) with StreamHorizon Local Cluster running on a single physical serverLocal Cluster(single server) 3 instancesx16 ETL threadsJDBC Load
Benchmarks (setup & dependency) 
Benchmarks performed with: 
•200 million loaded records 
•Individual file size of 100K records 
•Each record : 
minimum 200 bytes / 20 attributes 
•Target schema is Data Mart 
•Kimball design methodology 
•Fact table contains: 
•6 Dimensions & 3 Measures 
•Single dimension cardinality up to 1200 
•Data generated by Quant Library (meaning ‘partially’ sorted) 
•Performing housekeeping tasks 
Housekeeping tasks performed by StreamHorizon: 
•Files moved from source to archive directory 
•Bulk files created and deleted during processing 
•Metrics gathered (degrades performance on average by approx 20%) 
•Files moved from source to error directory (corrupted data feeds)
Cost - Effectiveness
Functional & Hardware Profile 
StreamHorizon 
Market Leaders 
Horizontal scalability (at no extra cost) 
Vertical scalability 
Clusterability, Recoverability & High Availability (at no extra cost) 
Runs on Commodity hardware* 
Exotic database cluster licences 
Not Required 
Often Required 
Specialized Data Warehouse Appliances (Exotic Hardware) 
Not Required 
Often Required 
Linux, Solaris, Windows and Compute Cloud (EC2) 
Ability to run on personal hardware (laptops & workstations)* 
* - Implies efficiency of StreamHorizon Platform in hardware resource consumption compared to Market Leaders
Cost-Effectiveness Analysis - I 
Target Throughput of 1million records per second 
StreamHorizon 
Market Leaders 
Hardware Cost 
1 unit 
3 - 20 units 
Time To Market (installation, setup & full production ready deployment)* 
4 days 
40+ days 
Throughput (records per hour) ** 
3.69 billion 
(Single Engine) 
1.83 billion 
(3 Engines) 
Throughput (records per second) ** 
1 million 
(Single Engine) 
510K 
(3 Engines) 
Requires Human Resources with specialist ETL Vendors skill 
No 
Yes 
Setup and administration solely based on intuitive XML configuration 
Yes 
No 
FTE headcount required to support solution 
0.25 
2 
* - Assuming that file data feeds to be processed by StreamHorizon (project dependency) are available at the time of installation 
** - Please refer to end of this document for detailed description of hardware environment
Cost-Effectiveness Analysis - II 
Hardware Cost ($) Throughput (records per second) 250K500K1m750KSpecialized Data Warehousing AppliancesMarket Leading VendorsStreamHorizonCost of ownership (Man-Days per Month) Effort510DeliveryMarket Leading VendorsStreamHorizon
Cost-Effectiveness Analysis - III 
Development(Man-Days) Time To Market(Time from Project inception to Project Delivery) 2550DeliveryMarket Leading VendorsStreamHorizonStreamHorizonProject (unit) 1Market Leading Vendors · Total Costs (licence & hardware) · Speed of Delivery · Solution Complexity · MaintainabilityOverall cost-effectiveness ratiocomparison based on: 1
Use Case study
Use Case Study 
Delivering Market Risk system for Tier 1 Bank 
•Increasing throughput of the system for the factor of 10 
•Reducing code base from 10,000+ lines of code to 420 lines of code 
•Outsourcing model is now realistic target (delivered solution has almost no code base and is fully configurable via XML) 
Workforce Savings: 
•Reduced number of FTE & part-time staff engaged on the project (due to simplification) 
Hardware Savings: 
•$200K of recycled (hardware no longer required) servers (4 in total) 
Software Licence Savings: 
•$400K of recycled software licences (no longer required due to stack simplification) 
Total 
•Negative project cost (due to savings achieved in recycled hardware and software licences) 
•BAU / RTB budged reduced for 70% due to reduced implementation complexity & IT stack simplification
Use Case Study - continued 
•Single server acts as both StreamHorizon (ETL Server) and as a database server 
•Single Vanilla database instance delivers query performance better than previously utilized OLAP Cube 
(MOLAP mode of operation). 
•By eliminating OLAP engine from software stack: 
oUser query latency was reduced 
oETL load latency reduced for factor of 10+ 
oAbility to support number of concurrent users is increased 
•Tier 1 Bank was able to run complete Risk batch data processing on a single desktop (without breaking SLA).
ROI & Risk Management
Reduced Workforce demand 
Development 
•Reduced development headcount (due to simplicity of StreamHorizon platform) 
•Enables utilization of skills of your current IT team 
•No need to hire extra headcount with specialized (ETL platform specific) knowledge 
•Customizations and extensions can be coded in Java, Scripts and SQL 
•Supports agile development methods 
Infrastructure & Support 
•Install at one location – run everywhere 
•Not required - Dedicated teams of experts to setup and maintain ETL servers 
•Not required - Dedicated teams of experts to setup and maintain ETL platform Repository 
BAU / RTB / Cost of ownership 
•BAU budgets are usually 20% compared to cost of ownership of traditionally implemented Data Integration project 
•Simple to outsource due to simplicity of delivered solution
Environment Risk Management 
•Ability to run multiple versions of StreamHorizon simultaneously on same hardware with no interference or dependencies 
•Simply turn off old and turn on new StreamHorizon version on the same hardware 
•Seamless upgrade & rollback performing simple ‘File Drop’ 
•Backups taken by simple directory/file copy 
•Instant startup time measured in seconds 
•ESCROW Compliant
Planned platform extensions 
•Infraction - StreamHorizon ‘In-Memory’ Data Store designed to overcome limitations of existing market leading OLAP and In-memory data stores. 
•Feed streaming to support massive compute grids and eliminate data feed (file) persistence and thereby I/O within Data Processing Lifecycle 
•StreamHorizon streaming is based on highly efficient communication protocol (analog to ProtoBuffer by Google)
Deployment Topologies
StreamHorizon Connectivity Map 
•Spark 
•Storm 
•Kafka 
•TCP Streams 
•Netty 
•Hive 
•Impala 
•Pig 
•Any Other via plugins… 
Messaging 
Relational Databases 
•JMS (HornetQ, ActiveMQ) 
•AMQP 
•Kafka 
•Any Other via plugins… 
Hadoop ecosystem 
•Any file system excluding mainframe 
Non-Hadoop file systems 
•ORACLE 
•MSSQL 
•DB2 
•Sybase 
•SybaseIQ 
•Teradata 
•MySQL 
•H2 
•HSQL 
•PostgreSQL 
•Derby 
•Informix 
•Any Other JDBC compliant… 
Acting as 
•Hadoop 2 Hadoop Data Streaming Platform 
•Conventional ETL Data Processing Platform 
•Hadoop 2 Non-Hadoop ETL bridge 
•Non-Hadoop 2 Hadoop ETL bridge 
Non-Relational Data Targets 
•HDFS 
•MongoDB 
•Cassandra 
•Hbase 
•Elastic Search 
•TokuMX 
•Apache CouchDB 
•Cloudata 
•Oracle NoSQL Database 
•Any Other via plugins…
StreamHorizon OLAP Integration 
•Seamless integration with OLAP & In-Memory servers 
•Supported integration modes: 
•Real Time – Data uploaded to the OLAP server immediately after every successful completion of ETL or DB task by StreamHorizon 
•Batch Oriented – Data uploaded to the OLAP server in batches rather than immediately after processing of each data entity (file, message, SQL query) 
•OLAP integration is available in all StreamHorizon operational modes 
•JDBC 
•Bulk Load 
•Thrift 
•Hadoop 
•Messaging 
•Any other connectivity mode supported by StreamHorizon 
•More information is available on StreamHorizon FAQ page
Deployment – Biased
Configuration Options
ETL Streams – Complex & Integrated ETL Flows
Deployment – Vanilla & Grid 
StreamHorizon Deployment TopologiesData Processing Grid (Multi Server Deployment) Local Cluster (Single Server Deployment)Single Instance (Single Server Deployment) Data TargetsHHHHData Sources · Single StreamHorizon Instance · Massively Parallel Threadpool · Running on commodity serverData TargetsHHHHData Sources · Multiple StreamHorizon Instances · Parallel Threadpools · Running on commodity serverData SourcesData TargetsHHHH · Single Instance & Local Cluster StreamHorizon Deployments · Running on commodity serversHHHHHHHHHHHH
Data Processing Grid Deployment 
•Enables your organisation to build Data Processing grid, that is, cluster of ETL (Data Processing) servers (analogue to Compute Grid for computation tasks). 
•Enables IT infrastructure to reduce total number of ETL servers by approximate factor of 4 
•Ability to seamlessly upgrade your Data Processing grid via your compute grid scheduler (Data Synapse or other) or any other management software (Altirs etc.) 
Compute Grid Deployment 
•StreamHorizon enables you to process data (execute ETL logic) at your compute grid servers as soon as data is created (by Quant Library for example). 
•Helps to avoid sending complex/inefficient data structures (like middleware messages, XML or FpML) via network 
•Utilize StreamHorizon to process your data at compute grid nodes and directly persist your data to your persistence store 
•Eliminates need for expensive In-Memory Data Grid Solutions 
Data Processing Grid - ETL cluster
Data Volumes Shipped via Network vs. XML (performance vs. readability) 
•Due to challenging volumes numerous architectural solutions gravitate away from XML (self descriptive, human readable data format, but not storage effective) 
•XML message size is unnecessarily large, dominated by metadata rather than data itself (over 80%) 
•XML does however makes daily job easier for your IT staff and Business Analysts as it is intuitive and human readable as data format 
The Solution – Best of both worlds 
•StreamHorizon deployed at your compute grid nodes enables you to keep XML as inter-process/inter-component message format – without burdening the network traffic. 
•StreamHorizon persist directly to your persistence store (most commonly a database) only data, thereby achieving minimal network congestion and latency (as if you where not using XML as cross component communication format). 
Data Processing Grid & XML
Compute Grid - Distributed Deployment (Example: Finance Industry) 
StreamHorizon Embedded into Compute GridData Targets · StreamHorizon Single instance (Grid Node Deployments) · StreamHorizon Direct interaction with Data Target (Data Warehouse) QLQLQLQLQLQLConsolidatedDataWarehouseQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLData SourcesMarket DataTrade DataStatic Data Compute Grid Scheduler PnLCredit RiskMarket RiskQLQLQLQL- Quant Library Instance
Compute Grid – Data Streaming (Example: Finance Industry) 
StreamHorizon Direct Streaming - Compute Grid Deployment Data Targets- StreamHorizon Direct Streaming ClientConsolidatedDataWarehouseData SourcesMarket DataTrade DataStatic Data Compute Grid Scheduler PnLCredit RiskMarket RiskStreamHorizonInstanceQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQL
Big Data Analytics - Accelerated
StreamHorizon & Big Data 
Integrates into your Data Processing Pipeline… 
•Seamlessly integrates at any point of your your data processing pipeline 
•Implements generic input/output HDFS connectivity. 
•Enables you to implement your own, customised input/output HDFS connectivity. 
•Ability to process data from heterogeneous sources like: 
•Storm 
•Kafka 
•TCP Streams 
•Netty 
•Local File System 
•Any Other… 
Accelerates your clients… 
•Reduces Network data congestion 
•Improves latency of Impala, Hive or any other Massively Parallel Processing SQL Query Engine. 
And more… 
•Portable Across Heterogeneous Hardware and Software Platforms 
•Portable from one platform to another 
•Impala 
•Hive 
•HBase 
•Any Other…
StreamHorizon - Flavours – Big Data Processing 
Storm - Reactive, Fast, Real Time Processing 
•Guaranteed data processing 
•Guarantees no data loss 
•Real-time processing 
•Horizontal scalability 
•Fault-tolerance 
•Stateless nodes 
•Open Source 
Kafka 
•Designed for processing of real time activity stream data (metrics, KPI's, collections, social media streams) 
•A distributed Publish-Subscribe messaging system for Big Data 
•Acts as Producer, Broker, Consumer of message topics 
•Persists messages (has ability to rewind) 
•Initially developed by LinkedIn (current ownership of Apache) 
Hadoop – Big Batch Oriented Processing 
•Batch processing 
•Jobs runs to completion 
•Stateful nodes 
•Scalable 
•Guarantees no data loss 
•Open Source 
SeamlessIntegration
StreamHorizon – Big Data Processing Pipeline
Big Data & StreamHorizon – Data Persistence (Example: Finance Industry) 
Data Calculation, Data Processing & HDFS PersistenceBig Data – DataNode Cluster- StreamHorizon instance (daemon) Data ProcessingTasks HDFS NameNode QL- Data Source (Quant Library Instance) QLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQL
Big Data & StreamHorizon – Data Retrieval (Example: Finance Industry) 
Nominal Network CongestionNetwork Congestion – StreamHorizon ‘Accelerator Method’Nominal Network CongestionUserGroups NameNodeHDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive Dimensional Data (Low cardinality) Dimensional AndFact Data Fact Data (High Cardinality) UserGroups NameNodeHDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive
Streaming Data Aggregations – impact on Big Data Query Latency 
Out of the BoxImpalaHiveMedium-LowHighMedium-LowMedium-HighHighNominalQuery LatencyMemory FootprintProcessing FootprintMediumHighSpace ConsumptionWith Streaming Data AggregationsImpalaHiveLowLowLowMedium-LowMedium - LowNominal - LowLowLow
Other Technical details 
•Pluggable Architecture - extend and customize StreamHorizon platform without recompilation 
•Integrates with RDBMS vendors via JDBC (Oracle, MSSQL, MySQL, SybaseIQ, Sybase, Teradata, kdb+ and others) 
•Seamless integration with major programming and script languages (Java, SQL, Shell, Python, Scala, Windows Batch etc.) 
•Utilizes leading industry standard software development practices (GIT, Maven, TDD, IMDG…) 
•Instant startup time measured in seconds
Commodity Server Deployment (details of StreamHorizon benchmark environments) 
Target throughput of 1 million records per second 
•Commodity Tier 2 SAN storage (no SSD, no local storage was used) 
•Commodity AMD 2.1 GHz Processors 6 processors (4 cores each) 
•Files of size of 100K records (of a minimum single line size of 400 bytes) 
•35 attributes per line which comprise 13 Dimensions (Kimball Star Schema Data Model, fact table has 18 attributes in total) 
•Tested with heavily indexed Database tables tuned to respond to queries scanning up to 10 million fact table records within 1.4 seconds (without use of aggregate tables, material views or equivalent facilities. Standard (no RAC, non Parallel Query) Enterprise Edition Oracle database instance) 
•Testing Query profile: 
oAggregate up to 10 million transactional records & return results 
oReturn up to 10,000 of transactional records without aggregation 
oAll queries execute against single table partition which contains 500 Million records
Desktop Hardware Deployment (details of StreamHorizon benchmark environments) 
Target throughput of 250K records per second 
•Target hardware: laptops or workstations (commodity personal hardware) 
•Local (not SSD) Hard Disk storage used (single Hard Disk) 
•2 AMD (A8) processors (4 cores) 
•Files of size of 100K records (of a minimum single line size of 400 bytes) 
•35 attributes per line which comprise 13 Dimensions (Kimball Star Schema Data Model)
Q&A 
stream-horizon.com

More Related Content

What's hot

Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemDataWorks Summit/Hadoop Summit
 
Etl overview training
Etl overview trainingEtl overview training
Etl overview trainingMondy Holten
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018alanfgates
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsDataWorks Summit
 
An Expert Guide to Migrating Legacy Databases to PostgreSQL
An Expert Guide to Migrating Legacy Databases to PostgreSQLAn Expert Guide to Migrating Legacy Databases to PostgreSQL
An Expert Guide to Migrating Legacy Databases to PostgreSQLEDB
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkDataWorks Summit
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform LoadABDUL KHALIQ
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United AirlinesDataWorks Summit
 

What's hot (20)

Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
 
Etl overview training
Etl overview trainingEtl overview training
Etl overview training
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
An Expert Guide to Migrating Legacy Databases to PostgreSQL
An Expert Guide to Migrating Legacy Databases to PostgreSQLAn Expert Guide to Migrating Legacy Databases to PostgreSQL
An Expert Guide to Migrating Legacy Databases to PostgreSQL
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
GemFire In-Memory Data Grid
GemFire In-Memory Data GridGemFire In-Memory Data Grid
GemFire In-Memory Data Grid
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform Load
 
CloverETL Basic Training Excerpt
CloverETL Basic Training ExcerptCloverETL Basic Training Excerpt
CloverETL Basic Training Excerpt
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United Airlines
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
 

Viewers also liked

Viewers also liked (6)

Fs And Self Service
Fs And Self ServiceFs And Self Service
Fs And Self Service
 
Dssc Intro
Dssc IntroDssc Intro
Dssc Intro
 
Dart 21004 Detailed Overview V2 C
Dart 21004 Detailed Overview V2 CDart 21004 Detailed Overview V2 C
Dart 21004 Detailed Overview V2 C
 
Federator
FederatorFederator
Federator
 
Grid Server Intro
Grid Server IntroGrid Server Intro
Grid Server Intro
 
FabricServer Technology Overview
FabricServer Technology OverviewFabricServer Technology Overview
FabricServer Technology Overview
 

Similar to StreamHorizon overview

Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Survey of Big Data Infrastructures
Survey of Big Data InfrastructuresSurvey of Big Data Infrastructures
Survey of Big Data Infrastructuresm.a.kirn
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehousehadoopsphere
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)Marco Gralike
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021ssuser8ccb5a
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxcamyla81
 
StreamHorizon and bigdata overview
StreamHorizon and bigdata overviewStreamHorizon and bigdata overview
StreamHorizon and bigdata overviewStreamHorizon
 

Similar to StreamHorizon overview (20)

Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Survey of Big Data Infrastructures
Survey of Big Data InfrastructuresSurvey of Big Data Infrastructures
Survey of Big Data Infrastructures
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
 
An AMIS overview of database 12c
An AMIS overview of database 12cAn AMIS overview of database 12c
An AMIS overview of database 12c
 
ETL (1).ppt
ETL (1).pptETL (1).ppt
ETL (1).ppt
 
What database
What databaseWhat database
What database
 
StreamHorizon and bigdata overview
StreamHorizon and bigdata overviewStreamHorizon and bigdata overview
StreamHorizon and bigdata overview
 

Recently uploaded

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 

Recently uploaded (20)

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 

StreamHorizon overview

  • 1. Big Data Analytics - Accelerated stream-horizon.com
  • 2. Legacy ETL platforms & conventional Data Integration approach •Unable to meet latency & data throughput demands of Big Data integration challenges •Based on batch (rather than streaming) design paradigms •Insufficient level of parallelism of data processing •Require specific ETL platform skills •High TCO •Inefficient software delivery lifecycle •Require coding of ETL logic & staff with niche skills •High risk & unnecessarily complex release management •Vendor lock-in •Often require exotic hardware appliances •Complex environment management •Repository ownership (maintenance & database licenses)
  • 3. StreamHorizon’s “adaptiveETL” platform - Performance •Next generation ETL Big Data processing platform •Delivering performance critical Big Data projects •Massively parallel data streaming engine •Backed with In Memory Data Grid (Coherence, Infinispan, Hazelcast, any other.) •ETL processes run in memory & interact with cache (In Memory Data Grid) •Unnecessary Staging (I/O expensive) ETL steps are eliminated •Quick Time to Market (measured in days)
  • 4. StreamHorizon’s “adaptiveETL” platform - Effectiveness •Low TCO •Fully Configurable via XML •Requires no ETL platform specific knowledge •Shift from 'coding' to 'XML configuration' reduces IT skills required to deliver, manage, run & outsource projects •Eliminated 90% manual coding •Flexible & Customizable – override any default behavior of StreamHorizon platform with custom Java, OS script or SQL implementation •No vendor lock-in (all custom written code runs outside StreamHorizon platform-no need to re-develop code when migrating your solution) •No ETL tool specific language •Out of the box features like Type 0,1,2, Custom dimensions, dynamic In Memory Cache formation transparent to developer
  • 5. AdaptiveETL enables: •Increased data velocity & reduced cost of delivery: •Quick Time to Market – deploy StreamHorizon & deliver fully functional Pilot of your Data integration project in a single week •Total Project Cost / Data Throughput ratio = 0.2 (20% of budget required in comparison with Market Leaders) •1 Hour Proof of Concept – download and test-run StreamHorizon’s demo Data Warehousing project •Big Data architectures supported: •Hadoop ecosystem - Fully integrated with Apache Hadoop ecosystem (Spark, Storm, Pig, Hive, HDFS etc.) •Conventional ETL deployments - Data processing throughput of 1 million records per second (single commodity server, single database table) •Big Data architectures with In Memory Data Grid acting as a Data Store (Coherence, Infinispan, Hazelcast etc.)
  • 6. AdaptiveETL enables: •Achievable data processing throughput: •Conventional (RDBMS) ETL deployments - 1 million records per second (single database table, single commodity server, HDD Storage) •Hadoop ecosystem (StreamHorizon & Spark) - over 1 million records per second for every node of your cluster •File system (conventional & Hadoop HDFS) - 1 million records per second per server (or cluster node) utilizing HDD Storage •Supported Architectural paradigms: •Lambda Architecture - Hadoop real time & batch oriented data streaming/processing architecture •Data Streaming & Micro batch Architecture •Massively parallel conventional ETL Architecture •Batch oriented conventional ETL Architecture
  • 7. AdaptiveETL enables: •Scalability & Compatibility: •Virtualizable & Clusterable •Horizontally & Vertically scalable •Highly Available (HA) •Running on Linux, Solaris, Windows, Compute Clouds (EC2 & any other) •Deployment Architectures: •Runs on Big Data clusters: Hadoop, HDFS, Kafka, Spark, Storm, Hive, Impala and more… •Runs as StreamHorizon Data Processing Cluster (ETL grid) •Runs on Compute Grid (alongside grid libraries like Quant Library or any other)
  • 8. Targeted Program & Project profiles Greenfield project candidates: •Data Warehousing •Data Integration •Business Intelligence & Management Information •OLTP Systems (Online Transactional Processing) Brownfield project candidates: Quickly enable existing Data Warehouses & Databases which: •Struggle to keep up with loading of large volumes of data •Don’t satisfy SLA from query latency perspective Latency Profile: •Real Time (Low Latency - 0.666 microseconds per record - average) •Batch Oriented Delivery Profile: •<5 days Working Prototypes •Quick Time to Market Projects •Compliance & Legislation Critical Deliveries Skill Profile •Low-Medium Skilled IT workforce •Offshore based deliveries Data Volume Profile •VLDB (Very Large Databases) •Small-Medium Databases
  • 9. Industries (not limited to…) Finance - Market Risk, Credit Risk, Foreign Exchange, Tick Data, Operations Telecom - Processing PM and FM data in real time (both radio and core) Insurance - Policy Pricing, Claim Profiling & Analysis Health – Activity Tracking, Care Cost Analysis, Treatment Profiling ISP - User Activity Analysis & Profiling, Log Data Mining, Behavioral Analysis
  • 10. Complexity •Data Integration = set of large (in numbers), interdependent & usually trivial units of ETL transformations Performance •Long loading time windows •Frequent SLA breaks •‘Domino Effect’ execution & dependencies (Waterfall paradigm) …above are consequence of batch oriented rather than ‘Real Time’ & ‘Data Streaming’ design paradigms Query Latency •Longer than expected query response times •Inadequate Ad-Hoc query capability The Problem
  • 11. Data Throughput of StreamHorizon platform enables: •Indexing of Data Warehouses to extent previously unimaginable (4+ indexes per single fact table) •Extensively indexed database delivers load throughput of 50-80% compared to model without indexes, however, it reduces query latency (intentional sacrifice of load latency for query performance) •StreamHorizon fully supports OLAP integration (please refer to FAQ page). OLAP delivers slice & dice and drill-down (data pivoting) capability via Excel or any other user front end tool. •No need to utilize OLAP cubes or equivalents (In-Memory solutions) acting as 'query accelerators'. Such solution are dependent on available memory and thereby impose limit do data volumes system can handle. •Horizontal scaling of In-Memory software comes with a price (increased latency) as queried data is collated from multiple servers into single resultset •No need to purchase exotic or specialist hardware appliances •No need to purchase In-Memory/OLAP hardware & licence •Simple software stack: Our Solution (Example: RDBMS based Big Data Streaming Platform) StreamHorizon + Vanilla Database VS. ETL Tool + Database (usually exotic) + OLAP/In-Memory solution (usually clustered)
  • 12. •Time to market of project delivery - measured in days rather than months. •IT Skills required for development are reduced to basic IT knowledge. •Manage data volumes typical only for Financial Exchanges, Telecom blue chips and ISP’s. •Desktops running with StreamHorizon platform have more data processing bandwidth than commodity servers with state of the art (read complex and expensive) ETL tools and dozen of CPU’s. •Generic & Adaptable ETL platform •Simple to setup (XML configuration) •Platform geared to deliver projects of high complexity (low latency or batch oriented fashion) Our Solution - (continued)
  • 14. High Level Architecture Data SourcesStreamHorizonData Processing PlatformData Processing EngineData TargetsData Delivery EngineProcessing CoordinatorCustom PluginsEngineMiddlewareMessagingFiles & DirectoriesDirect StreamingWeb ServicesIn-Memory Data GridData Processing Throughput of 1+ Million Records per Second DatabasesHadoopHFiles & DirectoriesDirect StreamingWeb ServicesDatabasesHadoopHMiddlewareMessaging
  • 15. Component Architecture StreamHorizon Component ArchitectureData Processing EngineData SourcesData CleansingData TargetsData QualityData TransfomationData MiningData Delivery EngineFile WriterHadoop PersistorDB Bulk LoaderMessage GeneratorProcessing CoordinatorTransaction ManagerEventGeneratorRecovery ManagerCustom PluginsEngineCustomJavaCustomSQLCustomScriptsProcessingMetrics & StatsIn-Memory Data GridLocal CacheGlobal CacheData Integrity GuardJDBC LoaderMiddlewareMessagingFiles & DirectoriesDirect StreamingWeb ServicesDatabasesHadoopHMiddlewareMessagingFiles & DirectoriesDirect StreamingWeb ServicesDatabasesHadoopH
  • 16. Data Processing Flow (Generic Target)
  • 17. Data Processing Flow (RDBMS Database as a Target)
  • 19. Benchmarks Reading MethodWriting MethodDatabase Load TypeThroughputCommentDeployment MethodInstances & ThreadpoolsBuffered File Read(JVM Heap) OffOff2.9million/sec · This is theoretical testing scenario (as data is generally always persisted) · Shows StreamHorizon Data Processing Ability without external bottlenecks · Eliminates bottleneck introduced by: · DB (DB persistence is switched Off) · I/O (Writing Bulk files is switched Off) and I/O (Reading Feed files is switched Off) Local Cluster (single server) 3 instances x12 ETL threads per instanceBenchmark 1Read File (SAN) Write Bulk FileorAny other File (feed) Off1.86million/sec · Throughput for feed processing with StreamHorizon Local Cluster running on a single physical server · This test shows ability to create output feed (bulk file or any other feed (file) for that matter by applying ETL logic to input file (feed) Local Cluster(single server) 3 instances x12 ETL threadsper instanceFile to FileRead File (SAN) Write Bulk File(Fact table format) Bulk Load1.9million/sec · Throughput for DB Bulk Load (Oracle – External Tables) with single StreamHorizon instance running on a single physical serverSingle Instance(single server) 1 instancex50 DB ThreadsBULK LoadRead File (SAN)OffJDBC Load1.1million/sec · Throughput for JDBC Load (Oracle) with StreamHorizon Local Cluster running on a single physical serverLocal Cluster(single server) 3 instancesx16 ETL threadsJDBC Load
  • 20. Benchmarks (setup & dependency) Benchmarks performed with: •200 million loaded records •Individual file size of 100K records •Each record : minimum 200 bytes / 20 attributes •Target schema is Data Mart •Kimball design methodology •Fact table contains: •6 Dimensions & 3 Measures •Single dimension cardinality up to 1200 •Data generated by Quant Library (meaning ‘partially’ sorted) •Performing housekeeping tasks Housekeeping tasks performed by StreamHorizon: •Files moved from source to archive directory •Bulk files created and deleted during processing •Metrics gathered (degrades performance on average by approx 20%) •Files moved from source to error directory (corrupted data feeds)
  • 22. Functional & Hardware Profile StreamHorizon Market Leaders Horizontal scalability (at no extra cost) Vertical scalability Clusterability, Recoverability & High Availability (at no extra cost) Runs on Commodity hardware* Exotic database cluster licences Not Required Often Required Specialized Data Warehouse Appliances (Exotic Hardware) Not Required Often Required Linux, Solaris, Windows and Compute Cloud (EC2) Ability to run on personal hardware (laptops & workstations)* * - Implies efficiency of StreamHorizon Platform in hardware resource consumption compared to Market Leaders
  • 23. Cost-Effectiveness Analysis - I Target Throughput of 1million records per second StreamHorizon Market Leaders Hardware Cost 1 unit 3 - 20 units Time To Market (installation, setup & full production ready deployment)* 4 days 40+ days Throughput (records per hour) ** 3.69 billion (Single Engine) 1.83 billion (3 Engines) Throughput (records per second) ** 1 million (Single Engine) 510K (3 Engines) Requires Human Resources with specialist ETL Vendors skill No Yes Setup and administration solely based on intuitive XML configuration Yes No FTE headcount required to support solution 0.25 2 * - Assuming that file data feeds to be processed by StreamHorizon (project dependency) are available at the time of installation ** - Please refer to end of this document for detailed description of hardware environment
  • 24. Cost-Effectiveness Analysis - II Hardware Cost ($) Throughput (records per second) 250K500K1m750KSpecialized Data Warehousing AppliancesMarket Leading VendorsStreamHorizonCost of ownership (Man-Days per Month) Effort510DeliveryMarket Leading VendorsStreamHorizon
  • 25. Cost-Effectiveness Analysis - III Development(Man-Days) Time To Market(Time from Project inception to Project Delivery) 2550DeliveryMarket Leading VendorsStreamHorizonStreamHorizonProject (unit) 1Market Leading Vendors · Total Costs (licence & hardware) · Speed of Delivery · Solution Complexity · MaintainabilityOverall cost-effectiveness ratiocomparison based on: 1
  • 27. Use Case Study Delivering Market Risk system for Tier 1 Bank •Increasing throughput of the system for the factor of 10 •Reducing code base from 10,000+ lines of code to 420 lines of code •Outsourcing model is now realistic target (delivered solution has almost no code base and is fully configurable via XML) Workforce Savings: •Reduced number of FTE & part-time staff engaged on the project (due to simplification) Hardware Savings: •$200K of recycled (hardware no longer required) servers (4 in total) Software Licence Savings: •$400K of recycled software licences (no longer required due to stack simplification) Total •Negative project cost (due to savings achieved in recycled hardware and software licences) •BAU / RTB budged reduced for 70% due to reduced implementation complexity & IT stack simplification
  • 28. Use Case Study - continued •Single server acts as both StreamHorizon (ETL Server) and as a database server •Single Vanilla database instance delivers query performance better than previously utilized OLAP Cube (MOLAP mode of operation). •By eliminating OLAP engine from software stack: oUser query latency was reduced oETL load latency reduced for factor of 10+ oAbility to support number of concurrent users is increased •Tier 1 Bank was able to run complete Risk batch data processing on a single desktop (without breaking SLA).
  • 29. ROI & Risk Management
  • 30. Reduced Workforce demand Development •Reduced development headcount (due to simplicity of StreamHorizon platform) •Enables utilization of skills of your current IT team •No need to hire extra headcount with specialized (ETL platform specific) knowledge •Customizations and extensions can be coded in Java, Scripts and SQL •Supports agile development methods Infrastructure & Support •Install at one location – run everywhere •Not required - Dedicated teams of experts to setup and maintain ETL servers •Not required - Dedicated teams of experts to setup and maintain ETL platform Repository BAU / RTB / Cost of ownership •BAU budgets are usually 20% compared to cost of ownership of traditionally implemented Data Integration project •Simple to outsource due to simplicity of delivered solution
  • 31. Environment Risk Management •Ability to run multiple versions of StreamHorizon simultaneously on same hardware with no interference or dependencies •Simply turn off old and turn on new StreamHorizon version on the same hardware •Seamless upgrade & rollback performing simple ‘File Drop’ •Backups taken by simple directory/file copy •Instant startup time measured in seconds •ESCROW Compliant
  • 32. Planned platform extensions •Infraction - StreamHorizon ‘In-Memory’ Data Store designed to overcome limitations of existing market leading OLAP and In-memory data stores. •Feed streaming to support massive compute grids and eliminate data feed (file) persistence and thereby I/O within Data Processing Lifecycle •StreamHorizon streaming is based on highly efficient communication protocol (analog to ProtoBuffer by Google)
  • 34. StreamHorizon Connectivity Map •Spark •Storm •Kafka •TCP Streams •Netty •Hive •Impala •Pig •Any Other via plugins… Messaging Relational Databases •JMS (HornetQ, ActiveMQ) •AMQP •Kafka •Any Other via plugins… Hadoop ecosystem •Any file system excluding mainframe Non-Hadoop file systems •ORACLE •MSSQL •DB2 •Sybase •SybaseIQ •Teradata •MySQL •H2 •HSQL •PostgreSQL •Derby •Informix •Any Other JDBC compliant… Acting as •Hadoop 2 Hadoop Data Streaming Platform •Conventional ETL Data Processing Platform •Hadoop 2 Non-Hadoop ETL bridge •Non-Hadoop 2 Hadoop ETL bridge Non-Relational Data Targets •HDFS •MongoDB •Cassandra •Hbase •Elastic Search •TokuMX •Apache CouchDB •Cloudata •Oracle NoSQL Database •Any Other via plugins…
  • 35. StreamHorizon OLAP Integration •Seamless integration with OLAP & In-Memory servers •Supported integration modes: •Real Time – Data uploaded to the OLAP server immediately after every successful completion of ETL or DB task by StreamHorizon •Batch Oriented – Data uploaded to the OLAP server in batches rather than immediately after processing of each data entity (file, message, SQL query) •OLAP integration is available in all StreamHorizon operational modes •JDBC •Bulk Load •Thrift •Hadoop •Messaging •Any other connectivity mode supported by StreamHorizon •More information is available on StreamHorizon FAQ page
  • 38. ETL Streams – Complex & Integrated ETL Flows
  • 39. Deployment – Vanilla & Grid StreamHorizon Deployment TopologiesData Processing Grid (Multi Server Deployment) Local Cluster (Single Server Deployment)Single Instance (Single Server Deployment) Data TargetsHHHHData Sources · Single StreamHorizon Instance · Massively Parallel Threadpool · Running on commodity serverData TargetsHHHHData Sources · Multiple StreamHorizon Instances · Parallel Threadpools · Running on commodity serverData SourcesData TargetsHHHH · Single Instance & Local Cluster StreamHorizon Deployments · Running on commodity serversHHHHHHHHHHHH
  • 40. Data Processing Grid Deployment •Enables your organisation to build Data Processing grid, that is, cluster of ETL (Data Processing) servers (analogue to Compute Grid for computation tasks). •Enables IT infrastructure to reduce total number of ETL servers by approximate factor of 4 •Ability to seamlessly upgrade your Data Processing grid via your compute grid scheduler (Data Synapse or other) or any other management software (Altirs etc.) Compute Grid Deployment •StreamHorizon enables you to process data (execute ETL logic) at your compute grid servers as soon as data is created (by Quant Library for example). •Helps to avoid sending complex/inefficient data structures (like middleware messages, XML or FpML) via network •Utilize StreamHorizon to process your data at compute grid nodes and directly persist your data to your persistence store •Eliminates need for expensive In-Memory Data Grid Solutions Data Processing Grid - ETL cluster
  • 41. Data Volumes Shipped via Network vs. XML (performance vs. readability) •Due to challenging volumes numerous architectural solutions gravitate away from XML (self descriptive, human readable data format, but not storage effective) •XML message size is unnecessarily large, dominated by metadata rather than data itself (over 80%) •XML does however makes daily job easier for your IT staff and Business Analysts as it is intuitive and human readable as data format The Solution – Best of both worlds •StreamHorizon deployed at your compute grid nodes enables you to keep XML as inter-process/inter-component message format – without burdening the network traffic. •StreamHorizon persist directly to your persistence store (most commonly a database) only data, thereby achieving minimal network congestion and latency (as if you where not using XML as cross component communication format). Data Processing Grid & XML
  • 42. Compute Grid - Distributed Deployment (Example: Finance Industry) StreamHorizon Embedded into Compute GridData Targets · StreamHorizon Single instance (Grid Node Deployments) · StreamHorizon Direct interaction with Data Target (Data Warehouse) QLQLQLQLQLQLConsolidatedDataWarehouseQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLData SourcesMarket DataTrade DataStatic Data Compute Grid Scheduler PnLCredit RiskMarket RiskQLQLQLQL- Quant Library Instance
  • 43. Compute Grid – Data Streaming (Example: Finance Industry) StreamHorizon Direct Streaming - Compute Grid Deployment Data Targets- StreamHorizon Direct Streaming ClientConsolidatedDataWarehouseData SourcesMarket DataTrade DataStatic Data Compute Grid Scheduler PnLCredit RiskMarket RiskStreamHorizonInstanceQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQLQL
  • 44. Big Data Analytics - Accelerated
  • 45. StreamHorizon & Big Data Integrates into your Data Processing Pipeline… •Seamlessly integrates at any point of your your data processing pipeline •Implements generic input/output HDFS connectivity. •Enables you to implement your own, customised input/output HDFS connectivity. •Ability to process data from heterogeneous sources like: •Storm •Kafka •TCP Streams •Netty •Local File System •Any Other… Accelerates your clients… •Reduces Network data congestion •Improves latency of Impala, Hive or any other Massively Parallel Processing SQL Query Engine. And more… •Portable Across Heterogeneous Hardware and Software Platforms •Portable from one platform to another •Impala •Hive •HBase •Any Other…
  • 46. StreamHorizon - Flavours – Big Data Processing Storm - Reactive, Fast, Real Time Processing •Guaranteed data processing •Guarantees no data loss •Real-time processing •Horizontal scalability •Fault-tolerance •Stateless nodes •Open Source Kafka •Designed for processing of real time activity stream data (metrics, KPI's, collections, social media streams) •A distributed Publish-Subscribe messaging system for Big Data •Acts as Producer, Broker, Consumer of message topics •Persists messages (has ability to rewind) •Initially developed by LinkedIn (current ownership of Apache) Hadoop – Big Batch Oriented Processing •Batch processing •Jobs runs to completion •Stateful nodes •Scalable •Guarantees no data loss •Open Source SeamlessIntegration
  • 47. StreamHorizon – Big Data Processing Pipeline
  • 48. Big Data & StreamHorizon – Data Persistence (Example: Finance Industry) Data Calculation, Data Processing & HDFS PersistenceBig Data – DataNode Cluster- StreamHorizon instance (daemon) Data ProcessingTasks HDFS NameNode QL- Data Source (Quant Library Instance) QLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQLQLHDFSQLQL
  • 49. Big Data & StreamHorizon – Data Retrieval (Example: Finance Industry) Nominal Network CongestionNetwork Congestion – StreamHorizon ‘Accelerator Method’Nominal Network CongestionUserGroups NameNodeHDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive Dimensional Data (Low cardinality) Dimensional AndFact Data Fact Data (High Cardinality) UserGroups NameNodeHDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive HDFS ImpalaHive
  • 50. Streaming Data Aggregations – impact on Big Data Query Latency Out of the BoxImpalaHiveMedium-LowHighMedium-LowMedium-HighHighNominalQuery LatencyMemory FootprintProcessing FootprintMediumHighSpace ConsumptionWith Streaming Data AggregationsImpalaHiveLowLowLowMedium-LowMedium - LowNominal - LowLowLow
  • 51. Other Technical details •Pluggable Architecture - extend and customize StreamHorizon platform without recompilation •Integrates with RDBMS vendors via JDBC (Oracle, MSSQL, MySQL, SybaseIQ, Sybase, Teradata, kdb+ and others) •Seamless integration with major programming and script languages (Java, SQL, Shell, Python, Scala, Windows Batch etc.) •Utilizes leading industry standard software development practices (GIT, Maven, TDD, IMDG…) •Instant startup time measured in seconds
  • 52. Commodity Server Deployment (details of StreamHorizon benchmark environments) Target throughput of 1 million records per second •Commodity Tier 2 SAN storage (no SSD, no local storage was used) •Commodity AMD 2.1 GHz Processors 6 processors (4 cores each) •Files of size of 100K records (of a minimum single line size of 400 bytes) •35 attributes per line which comprise 13 Dimensions (Kimball Star Schema Data Model, fact table has 18 attributes in total) •Tested with heavily indexed Database tables tuned to respond to queries scanning up to 10 million fact table records within 1.4 seconds (without use of aggregate tables, material views or equivalent facilities. Standard (no RAC, non Parallel Query) Enterprise Edition Oracle database instance) •Testing Query profile: oAggregate up to 10 million transactional records & return results oReturn up to 10,000 of transactional records without aggregation oAll queries execute against single table partition which contains 500 Million records
  • 53. Desktop Hardware Deployment (details of StreamHorizon benchmark environments) Target throughput of 250K records per second •Target hardware: laptops or workstations (commodity personal hardware) •Local (not SSD) Hard Disk storage used (single Hard Disk) •2 AMD (A8) processors (4 cores) •Files of size of 100K records (of a minimum single line size of 400 bytes) •35 attributes per line which comprise 13 Dimensions (Kimball Star Schema Data Model)