Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

163 views

Published on

Abstract:

To make critical business decisions in real time, many businesses today rely on a variety of data, which arrives in large volumes. Variety and volume together make big data applications complex operations. Big data applications require businesses to combine transactional data with structured, semi-structured, and unstructured data for deep and holistic insights.

And, time is of the essence: to derive the most valuable insights and drive key decisions, large amounts of data have to be continuously ingested into Hadoop data lakes as well as other destinations. As a result, data ingestion poses the first challenge for businesses, which must be overcome before embarking on data analysis.

With its various Application Templates for ingestion, DataTorrent allows users to:

Ingest vast amounts of data with enterprise-grade operability and performance guarantees provided by its underlying Apache Apex framework. Those guarantees include fault tolerance, linear scalability, high throughput, low latency, and end-to-end exactly-once processing.

Quickly launch template applications to ingest raw data, while also providing an easy and iterative way to add business logic and such processing logic as parse, dedupe, filter, transform, enrich, and more to ingestion pipelines.

Visualize various metrics on throughput, latency and app data in real-time throughout execution.

This talk will cover demo on streaming ETL application with App Templates. The streaming application would extract data from Kafka with Kafka Operator, transform with help of Transform operator and filter them and load to HDFS.

Presenters:

Mohit Jotwani is a Product Manager with DataTorrent with more than 10 years of experience and expertise in Big Data Solutions

Deepak Narkhede is a Software Engineer at DataTorrent. He has worked on Storage systems in the past.

Have more questions? Connect with us! https://www.datatorrent.com/contact/

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
163
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

  1. 1. 1 Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS Deepak Narkhede, Mohit Jotwani deepak@datatorrent.com mohit@datatorrent.com
  2. 2. 2 •DataTorrent - Vision •About Apache Apex •App templates •Kafka to HDFS App Template •Live demo •Roadmap Agenda
  3. 3. 3 • Big Data is neither Productized nor Operationalized • Total Cost of Ownership (TCO) includes • Time to Develop + Time to Launch + Cost of ongoing Operations • Provide a Product to ... • Build Applications Rapidly with Simple Interfaces, Pre-Built Apps, Code Reuse & Debuggability • Support Dev, Test, Prod cycle to Launch Apps quickly • Manage and Visualize Applications for Operability DataTorrent Vision - Productize Big Data
  4. 4. 4 Next Gen Big Data Applications Browser Web Server Kafka Input (logs) Decompress, Parse, Filter Dimensions Aggregate Kafka Logs Kafka Variety of sources - IoT, Kafka, files, social media etc. Variety of sinks – Kafka, files, databases etc. * Supports low latency real time visualizations as well Unbounded and continuous data streams Batch support, batch processed as stream In-memory processing with temporal window boundaries Stateful operations: Aggregation, Rules etc --> Analytics
  5. 5. 5 Big Data Ecosystem: Where DataTorrent fits Data Sources Oper1 Oper2 Oper3 Hadoop (YARN + HDFS) Sensor Data Social Media Web Servers App Servers Click Streams Real-time analytics & Visualizations Real-time Data Visualization
  6. 6. 6 DataTorrent Architecture Solutions for Business Problems Ingestion & Data Prep ETL Pipelines Ease of Use Tools Real-Time Data VisualizationManagement & MonitoringGUI Application Assembly Application Templates Apex-Malhar Operator Library Big Data Infrastructure Hadoop 2.x – YARN + HDFS – On Prem & Cloud Core High-level API Transformation ML & Score SQL Analytics FileSync Dev Framework Batch Support Apache Apex Core Kafka HDFS HDFS HDFS JDBC HDFS JDBC Kafka
  7. 7. 7 • Building Apps such as Ingestion & Transform Apps for commonly patterns in customer use cases App Templates – Recurring patterns Use Case Pattern Sources Processors Sinks Data Synchronization, Staging Data for Analytics HDFS, Kafka, JDBC, S3 → HDFS, S3 Enriching Data before Staging HDFS, JDBC, Kafka Parser → Deduper → Enricher → Formatter HDFS, Cassandra Merge & Transform Data Streams Kafka, JDBC, File Stream Merge → Transform → Filter → Enricher HDFS Machine Scoring Kafka H2O or Custom HDFS
  8. 8. 8 • Central repository for big data application templates • Tested and published by DataTorrent • Accessible via dtManage on DataTorrent RTS and direct app download from website • Provides version updates via dtManage AppHub – App Template Repository
  9. 9. 9 App Templates advantages Ease of use Time to market and TCO Real-time Visualizations ✓ Quickly import and launch app template applications ✓ Easily add business logic by adding custom operators ✓ Reduces time to production drastically ✓ Reduces cost of operations in production ✓ Real-time visualizations of operational metrics such as throughput, latency etc. ✓ Real-time visualizations of application data such as number of files processed, amount of data transferred etc.
  10. 10. 10 Brief about Kafka ● Distributed Messaging System. ● Data Partitioning Capability. ● Fast Read and Writes. ● Basic Terminology ○ Topic ○ Producer ○ Consumer ○ Broker
  11. 11. 11 •Look at: https://www.datatorrent.com/apphub/ •Ready to use, customizable applications for big data ingestion use-cases. •Source : https://github.com/DataTorrent/app-templates (apache 2.0) App Template Demo
  12. 12. 12 Kafka to HDFS Filter app-template
  13. 13. 13
  14. 14. 14 • Visualizations – widgets on app data • Metrics such as size of data moved, lines per file, number of errors etc • Custom user defined metrics using apex auto-metrics • Schema enablement • Cloud Integrations • Amazon EMR, Microsoft Azure • Upcoming app templates • FTP → HDFS • SFTP → HDFS • Kinesis → S3 • Kinesis → Redshift • Kafka → JSON parse → filter → transform → HDFS • Kafka → CSV parse → filter → transform → HDFS Roadmap
  15. 15. 15 Questions •Send feedback to : https://groups.google.com/forum/#!forum/dt-users •Email to : dt-users@googlegroups.com
  16. 16. 16 Resources • Apache Apex - http://apex.apache.org/ • Subscribe to forums ᵒ Apex - http://apex.apache.org/community.html ᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users • Download - https://datatorrent.com/download/ • Twitter ᵒ @ApacheApex; Follow - https://twitter.com/apacheapex ᵒ @DataTorrent; Follow – https://twitter.com/datatorrent • Meetups - http://meetup.com/topics/apache-apex • Webinars - https://datatorrent.com/webinars/ • Videos - https://youtube.com/user/DataTorrent • Slides - http://slideshare.net/DataTorrent/presentations • Startup Accelerator – Free full featured enterprise product ᵒ https://datatorrent.com/product/startup-accelerator/ • Big Data Application Templates Hub – https://datatorrent.com/apphub
  17. 17. 17 We are hiring! jobs@datatorrent.com • Developers/Architects • QA Automation Developers • Information Developers • Build and Release • Community Leaders
  18. 18. 18

×