Deep Dive of Kafka to HDFS/Hadoop Ingestion App Template

258 views

Published on

Presenter: Chaitanya Chebolu, Committer for Apache Apex and Software Engineer at DataTorrent.

Abstract: This webinar is part of a series of presentations from DataTorrent on various application templates in its Application Hub that comes bundled with RTS. This presentation will cover Kafka to HDFS for ingestion of data into Hadoop. This pattern is being used in various production use cases. We will cover various features and corner cases that need to be handled to productize, operationalize, and scale this application template. We will also cover use cases where you can take this template and add your custom logic to meet your product requirements.

If you'd like to learn more about Apache Apex and DataTorrent, feel free to schedule a 30-minute consultation with our team here: https://www.datatorrent.com/30-minutes-consult/

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
258
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Deep Dive of Kafka to HDFS/Hadoop Ingestion App Template

  1. 1. © 2016 DataTorrent Chaitanya Chebolu Committer, Apache Apex Engineer, DataTorrent Dec 5, 2016 Data Ingestion - Kafka to HDFS
  2. 2. © 2016 DataTorrent Agenda 2 • Introduction about Apache Apex (Architecture, Application, Native Hadoop Integration) • What is Data Ingestion • Brief about Kafka • Kafka to HDFS App • App Templates • Kafka to HDFS Demo
  3. 3. © 2016 DataTorrent3 Apache Apex • Platform and runtime engine that enables development of scalable and fault-tolerant distributed applications • Hadoop native (Hadoop >= 2.2) No separate service to manage stream processing Streaming Engine built into Application Master and Containers • Process streaming or batch big data • High throughput and low latency • Library of commonly needed business logic • Write any custom business logic in your application
  4. 4. © 2016 DataTorrent4 Apex Architecture
  5. 5. © 2016 DataTorrent5 An Apex Application is a DAG (Directed Acyclic Graph) A DAG is composed of vertices (Operators) and edges (Streams). A Stream is a sequence of data tuples which connects operators at end-points called Ports An Operator takes one or more input streams, performs computations & emits one or more output streams ● Each operator is USER’s business logic, or built-in operator from our open source library ● Operator may have multiple instances that run in parallel
  6. 6. © 2016 DataTorrent Typical application example
  7. 7. © 2016 DataTorrent7 Apex - Native Hadoop Integration • YARN is the resource manager • HDFS used for storing any persistent state
  8. 8. © 2016 DataTorrent What is Data Ingestion? 8 • Data Ingestion A process of obtaining, importing, and analyzing data for later use or storage in a database • Big Data Ingestion Discovering the data sources Importing the data Processing data to produce intermediate data Sending data out to durable data stores
  9. 9. © 2016 DataTorrent Brief about Kafka 9 ● Distributed Messaging System. ● Data Partitioning Capability. ● Fast Read and Writes. ● Basic Terminology ○ Topic ○ Producer ○ Consumer ○ Broker
  10. 10. © 2016 DataTorrent Kafka to HDFS App 10 Kafka HDFS • Consuming data from Kafka • Writing the processed data to HDFS
  11. 11. © 2016 DataTorrent App Templates 11 ● Ready to use, customizable applications for big data ingestion use-cases. ● Look at: https://www.datatorrent.com/apphub/ ● Source : https://github.com/DataTorrent/app-templates (apache 2.0)
  12. 12. © 2016 DataTorrent Kafka to HDFS Demo 12 Demo
  13. 13. © 2016 DataTorrent Kafka to HDFS App Template • Import and Launch: https://www.youtube.com/watch?v=d0RSeazfjN8 • Add Custom Logic: https://www.youtube.com/watch?v=UKIgcYPNepI
  14. 14. © 2016 DataTorrent Resources 14 • http://apex.apache.org/ • Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups – http://www.meetup.com/pro/apacheapex/ • More examples: https://github.com/DataTorrent/examples • Slideshare: http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/
  15. 15. © 2016 DataTorrent
  16. 16. © 2016 DataTorrent • Wednesday, December 7, 2016 at 7:30pm IST – ETL using RTS Upcoming events...

×