This document discusses enabling real-time analytics in the enterprise. It begins with an overview of the challenges of real-time analytics due to non-integrated systems, varied data types and volumes, and data management complexity. A case study on real-time quality analytics in automotive is presented, highlighting the need to analyze varied data sources quickly to address issues. The Hortonworks/Attunity solution is then introduced using Attunity Replicate to integrate data from various sources in real-time into Hortonworks Data Platform for analysis. A brief demonstration of data streaming from a database into Kafka and then Hortonworks Data Platform is shown.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Enabling Real-Time Analytics with Hortonworks and Attunity
1. Enabling the Real-Time
Analytical Enterprise
Michael Ger
General Manager, Manufacturing & Automotive,
Hortonworks
Jordan Martz,
Director, Technology Solutions
Attunity
Chris Gambino,
Solutions Engineer
Hortonworks
Nifi becomes the central ingest mechanism for all data coming into the environment. Nifi can be instantiated as its own WebServer listening to PUT/POST requests from different sources, Nifi can pull from different sources such as S3, SFTP, HTTPS, RDBMS, and more. Once the data is inside of Nifi we can do simple event processing, launch parsers that PerkinElmer has already created, perform data enrichment before landing into a Kafka queue. NOTE: throughout the duration of data lifecycle within Nifi and within HDP we will be tracking lineage such as where the data came from and how it was manipulated.
Placing the data in a Kafka queue will allow other engines to easily pick up the data to begin working with it such as Spark. Spark will be used as the heavy lifting ETL processing along with SQL, and Machine Learning it provides. Leveraging Spark on Hadoop allows Spark to adhere to Security (file,folder, column level security). Writing the data out as ORC file formats keeps the data more flexible to be used with other options such as Hive.
When the data is processed through Spark and stored down through the HiveContext to ORC file formats (or just left in spark) we can then expose to external sources such as SpotFire for analysts to do analysis on.
What the industry cares about:
Hadoop has moved out of test
Enterprise Use Case
Closer to Production
Business impact
Enterprise + Real-time VS Sqoop + Batch
Attunity + Replicate
High performance connectivity to Hadoop though native APIs for data ingest and publication
Automated schema generation in Hcatalog
Drag & drop configuration with Click-2-Replicate design
High-speed data load options:
Full reload with overwrite
Insert only appends
Change Data Capture(CDC)
In-memory data filtering and transformation
Monitoring dashboard with web-based metrics, alerts and log file management
Another advantage of Replicate is the agentless data replication for mainstream database systems. Recently a customer needed to ingest data from 4500 applications across hundreds of databases into Hadoop. With Replicate they are able to do this without installing an agent on each source system because Replicate extracts source logs remotely in an optimized manner and processes the data in-memory on the Replicate server. This also means that maintaining the product is simplified since it does not requiring maintaining and upgrading software agents each source or target system.
ATTU/CDC
Automated data ingest
Incremental updates with Change Data Capture (CDC)
Broad support for many enterprise data sources
HDP/HDF
Rapid deployments of HUGE data lakes
Continuous data refresh for RELEVANT analytics
COMPLETE datasets across databases, DWs and mainframes
HDF and HDP form the Connected Data Platform
Data in Motion (connected, real-time, tracked) and Data at Rest (massive scale analysis, retention, security)
Modern Data Applications are built on the Connected Data Platform
Metron for example
Customer built applications
One of the reasons several large technology companies trust and rely on Attunity for their own solutions is because of the robust CDC capability that Replicate provides.
There are several options that are built into the product that provide flexible and optimized ways to implement change data capture.
In addition to applying transactions in real-time and in-order, Replicate can handle varying volumes of changes on the source systems by applying the changes in optimized batches to improve throughput and latency
In order to provide high-speed data loads into data warehouse appliances, Replicate is integrated with native data warehouse loaders for fast data ingestion into the target and then changes are merged in the target. It does not rely on sub optimal ODBC for data loading into the ware house systems.
And recently, Attunity added support to write changes in message encoded format that can be published to Kafka message brokers as well.
Data transport and integration
Log data
Database changes
Sensors and device data
Monitoring streams
Call data records
Stock ticker data
Real-time stream processing
Monitoring
Asynchronous applications
Fraud and security
Data transport and integration
Log data
Database changes
Sensors and device data
Monitoring streams
Call data records
Stock ticker data
Real-time stream processing
Monitoring
Asynchronous applications
Fraud and security