What is data Ingestion? Data ingestion is the process of obtaining, importing, and processing data for later use or storage in a database. This process often involves altering individual files by editing their content and/or formatting them to fit into a larger document. An effective data ingestion methodology begins by validating the individual files, then prioritizes the sources for optimum processing, and finally validates the results. When numerous data sources exist in diverse formats (the sources may number in the hundreds and the formats in the dozens), maintaining reasonable speed and efficiency can become a major challenge. To that end, several vendors offer programs tailored to the task of data ingestion in specific applications or environments. Data Ingestion Frameworks Amazon Kinesis – real-time processing of streaming data at massive scale. Apache Chukwa – data collection system. Apache Flume – service to manage large amount of log data. Apache Kafka – distributed publish-subscribe messaging system. Apache Sqoop – tool to transfer data between Hadoop and a structured datastore. Cloudera Morphlines – framework that help ETL to Solr, HBase and HDFS. Facebook Scribe – streamed log data aggregator. Fluentd – tool to collect events and logs. Google Photon – geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency. Heka – open source stream processing software system. HIHO – framework for connecting disparate data sources with Hadoop. Kestrel – distributed message queue system. LinkedIn Databus – stream of change capture events for a database. LinkedIn Kamikaze – utility package for compressing sorted integer arrays. LinkedIn White Elephant – log aggregator and dashboard. Logstash – a tool for managing events and logs. Netflix Suro – log agregattor like Storm and Samza based on Chukwa. Pinterest Secor – is a service implementing Kafka log persistance. Linkedin Gobblin – linkedin’s universal data ingestion framework.
Flume A Flume agent is a Java virtual machine (JVM) process that hosts the components through which events flow. Each agent contains at the minimum a source, a channel and a sink. An agent can also run multiple sets of channels and sinks through a flow multiplexer that either replicates or selectively routes an event. Agents can cascade to form a multihop tiered collection topology until the final datastore is reached. A Flume event is a unit of data flow that contains a payload and an optional set of string attributes. An event is transmitted from its point of origination, normally called a client, to the source of an agent. When the source receives the event, it sends it to a channel that is a transient store for events within the agent. The associated sink can then remove the event from the channel and deliver it to the next agent or the event’s final destination. Flume is distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS. It is designed to be reliable and highly available, while providing a simple, flexible, and intuitive programming model based on streaming data flows. Flume provides extensibility for online analytic applications that process data stream in situ. Flume and Chukwa share similar goals and features. However, there are some notable differences. Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper. In contrast, Chukwa distributes this information more broadly among its services. Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machine are responsible for deciding what data to send. Resources : Flume Tutorial How to Refine and Visualize Server Log Data How To Refine and Visualize Sentiment Data
Chukwa Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) andMapReduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data. Log processing was one of the original purposes of MapReduce. Unfortunately, Hadoop is hard to use for this purpose. Writing MapReduce jobs to process logs is somewhat tedious and the batch nature of MapReduce makes it difficult to use with logs that are generated incrementally across many machines. Furthermore, HDFS stil does not support appending to existing files. Chukwa is a Hadoop subproject that bridges that gap between log handling and MapReduce. It provides a scalable distributed system for monitoring and analysis of log-based data. Some of the durability features include agent-side replying of data to recover from errors. Resources: Chukwa quick start guide
Sqoop Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. It offers two-way replication with both snapshots and incremental updates. Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems
Resources: Import from Microsoft SQL Server into the Hortonworks Sandbox using Sqoop Kafka Apache Kafka is a distributed publish-subscribe messaging system. It is designed to provide high throughput persistent messaging that’s scalable and allows for parallel data loads into Hadoop. Its features include the use of compression to optimize IO performance and mirroring to improve availability, scalability and to optimize performance in multiple-cluster scenarios. Fast A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Scalable Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers Durable Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Distributed by Design Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. Resources: Kafka Tutorial Storm and Kafka Simulating and Transporting Rea-Time Event Storm Hadoop is ideal for batch-mode processing over massive data sets, but it doesn’t support event-stream (a.k.a. message-stream) processing, i.e., responding to individual events within a reasonable time frame. (For limited scenarios, you could use a NoSQL database like HBase to capture incoming data in the form of append updates.) Storm is a general-purpose, event-processing system that is growing in popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses a cluster of services for scalability and reliability. In Storm terminology you create a topology that runs continuously over a stream of incoming data, which is analogous to a Hadoop job that runs as a batch process over a fixed data set and then terminates. An apt analogy is a continuous stream of water flowing through plumbing. The data sources for the topology are called spouts and each processing node is called a bolt. Bolts can perform arbitrarily sophisticated computations on the data, including output to data stores and other services. It is common for organizations to run a combination of Hadoop and Storm services to gain the best features of both platforms. How Storm Works A storm cluster has three sets of nodes: Nimbus node (master node, similar to the Hadoop JobTracker): Uploads computations for execution Distributes code across the cluster Launches workers across the cluster Monitors computation and reallocates workers as needed ZooKeeper nodes – coordinates the Storm cluster Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus Five key abstractions help to understand how Storm processes data: Tuples– an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7) Streams – an unbounded sequence of tuples. Spouts –sources of streams in a computation (e.g. a Twitter API) Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases. Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the following diagram) Storm users define topologies for how to process the data when it comes streaming in from the spout. When the data comes in, it is processed and the results are passed into Hadoop. Resources: Ingesting and processing Real-time events with Apache Storm Process real-time big data with Twitter Storm Real time Data Ingestion in HBase & Hive using Storm Bolt
Elasticsearch Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements. Here are a few sample use-cases that Elasticsearch could be used for: You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them. You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you. You run a price alerting platform which allows price-savvy customers to specify a rule like “I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month”. In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found. You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data. Resources: Exploring Elasticserach Logstash + Elasticsearch + Kibana Spring XD Spring XD is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export. The Spring XD project is an open source Apache 2 License licenced project whose goal is to tackle big data complexity. Much of the complexity in building real-world big data applications is related to integrating many disparate systems into one cohesive solution across a range of use-cases. Common use-cases encountered in creating a comprehensive big data solution are High throughput distributed data ingestion from a variety of input sources into big data store such as HDFS or Splunk Real-time analytics at ingestion time, e.g. gathering metrics and counting values. Workflow management via batch jobs. The jobs combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or HBase). High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database. Resources: SpringXD Guide SpringXD Quick Start HDP + SpringXD
Solr Apache Solr is a powerful search server, which supports REST like API. Solr is powered by Lucene which enables powerful matching capabilities like phrases, wildcards, joins, grouping and many more across various data types. It is highly optimized for high traffic using Apache Zookeeper. Apache Solr comes with a wide set of features and we have listed a subset of high impact features. Advanced Full-Text search capabilities. Standards based on Open Interfaces – XML, JSON and Http. Highly scalable and fault tolerant. Supports both Schema and Schemaless configuration. Faceted Search and Filtering. Support major languages like English, German, Chinese, Japanese, French and many more Rich Document Parsing. Resources: Solr for beginners Solr Tutorial HDP + Solr Additional Resources: Distcp Guide Real time Data Ingestion in HBase & Hive Elasticsearch Vs Solr
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows
1. Data Flow Challenges in an Enterprise
2. Introduction to Apache NiFi
3. Core Features
5. Demo – Simple Lambda Architecture
6. Use Cases
7. Q & A
Data Flow Challenges in an Enterprise
Connected Enterprises in a Distributed World
Evolution of DataProjects inanEnterprise
working directly on
(mostly web) using
a RDBMS at
trying to do BI
failed miserably in
version of truth
EDW, ETL, BI, OLAP,
Issues in Scalability,
started showing up
at totally different
Emergence of MPP
Self-service BI tools
Time analytics /
Capture & Analysis,
App App App
Caches & Derived Store
Poll For Changes
App App App
Log Aggregator /
H A D O O P
Data Guard /
Golden Gate /
App App App
H A D O O P
App App App
NoSQL / Key-Value
NoSQL / Key-Value
Andwe endupin a “Giant Hairball” Arch…
App App App
Caches & Derived Store
Poll For Changes
Data Guard /
Golden Gate /
H A D O O P
Log Aggregator /
App App App
NoSQL / Key-Value
Log Aggregator / Flume
- Purpose built ( Not designed with Universal
- Induces lot of complexity in project architecture
- Hard to extend
Common Data Integration Problems
Size and Velocity
Messages in Streaming manner
Tiny to small files in micro batches time.
Small files in mini batches.
Medium to large files in batches.
CSV, TSV, PSV, TEXT, JSON, XML, XLS, XLSX, PDF, BMP, PRN
Avro, Protocol Buffer, Parquet, RC, ORC, Sequence File
Zip, GZIP, TAR, LZIP, 7z, RAR
File Share, FTP, REST, HTTP, TCP, UDP
Once, Every day, every hour, every minute, every second,
Push / Pull / Poll
Asynchronous Operation Challenges
Fast edge consumers + slow processors = everything breaks
Process Message A first, all others can take a backseat.
Data should be secure – not just at rest, but in motion too.
Can you route a copy of this to our NoSQL store as well after
converting it to JSON.
Ability to run from failure (checkpoint / rerun / replay)
Merge small files to large files for Hadoop
Break large files into smaller manageable chunks for NoSQL
What is Apache NiFi, it’s History, and some terminology.
Whatis Apache NiFi
NiFi (short for “Niagara Files”) is a powerful enterprise grade dataflow tool that
can collect, route enrich, transform and Process data in a scalable manner.
NiFi is based on the concepts of flow-based programming (FBP). FBP is a
programming paradigm that defines applications as networks of "black box"
processes, which exchange data across predefined connections by message
passing, where the connections are specified externally to the processes.
Single combined platform for
Simple event processing
Transport and delivery
Designed to accommodate highly diverse and
• It has Visual command and control interface which allows you to define and
manipulate data flows in real-time and with great agility.
Developed by the National Security Agency (NSA) for over 8 years
Open sourced in Nov 2014 (Apache). Major contributors were ex-NSA
who formed a company named Onyara. Lead – Joe Witt.
Become Apache Top Level Project in July 2015
In August 2015, Hortonworks acquired Onyara
In September 2015, Hortonworks released HDF 1.0 powered by NiFi.
Current version is HDF 1.2
HDF has got solid backing of Hortonworks.
FlowFile (Information Packet)
Unit of data (each object) moving through the system
Content + Attributes (key/value pairs)
Processor (Black Box)
Performs the work, can access FlowFiles
Currently there are 135 different processors
Connection (Bounded Buffer)
Links between processors
Queues that can be dynamically prioritized
Process Group (Subnet)
Set of processors and their connections
Receive data via input ports, send data via output ports
Flow Controller (Scheduler)
Maintains the knowledge of how processes are connected and
manages the threads and their allocation.
<< UUID, Name, Size, In time, Attribute Map
Types (Events, Files,
Objects, Messages etc.),
Formats (JSON, XML,
AVRO, Text, Proprietary
etc.), Size (B to GBs)
Transformation (enrich, filter,
convert, split, aggregate,
custom), Mediation (push /
ApacheNiFi is not
• NiFi is not a distributed computation Engine
• An engine to do CEP (Complex event processing)
• A computational framework to do distributed Joins or Rolling Window Aggregations the way
• Hence it’s not based on Map Reduce / Spark or any other framework.
• NiFi doesn’t have any dependency on any big data tool like Hadoop or zookeeper etc. All it needs is Java.
• It’s not a full fledge ETL tool like Informatica / Pentaho / Talend / SSIS as of now. But it will be - eventually.
• It’s not a long term Data storage tool. It only holds data temporarily for re-run / data provenance purposes.
• It’s not a document indexer. It’s indexing capabilities are only to help in troubleshooting / debugging.
What are the core features and benefits of Apache NiFi?
• Even at very high scale, delivery is guaranteed
• Persistent Write Ahead Log (Flow File Repository) and Data Partitioning
(Content Repository) ensures this. They are together designed in a way that
• Very high transaction rates
• Effective load spreading
• Copy-on-write scheme (for every change in data)
• Supports buffering of all queued data.
• Ability to back-pressure (Even if there is no load balancing, nodes can say “Back-Off”
and other nodes in the pipeline pick up the slack.
• When backpressure is applied to a connection, it will cause the processor that is the
source of the connection to stop being scheduled to run until the queue clears out.
However, data will still queue up in that processor's incoming connections.
• NiFi allows the setting of one or more prioritization schemes for how data is
retrieved from a queue.
• Oldest First, Newest first, Largest first, Smallest First, or custom scheme
• The default is oldest first
• NiFi by design is Highly Extensible.
• One can write custom:
• These extensions are bundles in something called as NAR Files (NiFi
Visual Interface forCommand andControl
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
• View attributes and content at given points in time
(before and after each processor) !!!
• Records, indexes, and makes events available for
Single data-source agnostic collection platform
Intuitive, real-time visual user interface with drag-and-drop capabilities
Powerful Data security capabilities from source to storage
Highly granular data sharing policies
Ability to react in real time by leveraging bi-directional data flows and prioritized data feeds
Extremely scalable, extensible platform
High level architecture (single machine), Primary components
Host NiFi's HTTP-based command and
Real Brain. Provide and manage threads.
Runs within JVM. Processor / Controller
Service / Reporting Service / U I/
State of about a given FlowFile which is
presently active in the flow. WAL.
Actual content bytes of a given FlowFile.
Blocks of data in FS. More than 1 FS
All provenance event data is stored.
Saved on FS. Indexed / Searchable.
Building Ingestion and Delivery layers in IoT Solutions
Ingestion tier in Lambda Architecture (for feeding both speed and batch layers)
Ingestion tier in Data Lake Architectures
Cross Geography Data Replication in a secure manner
Integrating on premise system to on cloud system (Hybrid Cloud Architecture)
Simplifying existing Big Data architectures which are currently using Flume, Kafka, Logstash, Scribe etc. or
Developing Edge nodes for Trade repositories.
Enterprise Data Integration platform
And many more…