Data ingestion

INTRODUCTION
 Definition
 Data ingestion is the process of obtaining and importing data
or immediate use or storage in a database. To ingest
something is to "take something in or absorb something."
 Data can be streamed in real time or ingested in batches.
 When data is ingested in real time, each data item is
imported as it is emitted by the source.
 When data is ingested in batches, data items are
imported in discrete chunks at periodic intervals of time.
 Note
 An effective data ingestion process begins by prioritizing data
sources, validating individual files and routing data items to the
correct destination.

INTRODUCTION CONTD..
 When numerous big data sources exist in
diverse formats (the sources may often number in the
hundreds and the formats in the dozens), it can be
challenging for businesses to ingest data at a reasonable
speed and process it efficiently in order to maintain a
competitive advantage.
 To that end, vendors offer software programs that are
tailored to specific computing environments or
software applications.
 When data ingestion is automated, the software used to
carry out the process may also include data
preparation features to structure and organize data so it
can be analyzed on the fly or at a later time by business
intelligence (BI) and business analytics (BA) programs.

BIG DATA INGESTION PATTERNS
 A common pattern that a lot of companies use to
populate a Hadoop-based data lake is to get data
from pre-existing relational databases and data
warehouses.
 When planning to ingest data into the data lake, one
of the key considerations is to determine how to
organize data and enable consumers to access the
data.
 Hive and Impala provide a data infrastructure on top
of Hadoop – commonly referred to as SQL on
Hadoop – that provide a structure to the data and the
ability to query the data using a SQL-like language.

KEY ASPECTS TO CONSIDER
 Before you start to populate data into say
Hive databases/schema and tables, the two
key aspects one would need to consider are:
 Which data storage formats to use when storing
data? (HDFS supports a number of data formats
for files such as SequenceFile, RCFile, ORCFile,
AVRO, Parquet, and others.)
 What are the optimal compression options for
files stored on HDFS? (Examples include gzip,
LZO, Snappy and others.)

HADOOP DATA INGESTION
 Today, most data are generated and stored
out of Hadoop, e.g. relational databases,
plain files, etc. Therefore, data ingestion is
the first step to utilize the power of Hadoop.
Various utilities have been developed to
move data into Hadoop.

BATCH DATA INGESTION
 The File System Shell includes various shell-like
commands,
including copyFromLocaland copyToLocal, that
directly interact with the HDFS as well as other
file systems that Hadoop supports. Most of the
commands in File System Shell behave like
corresponding Unix commands. When the data
files are ready in local file system, the shell is a
great tool to ingest data into HDFS in batch. In
order to stream data into Hadoop for real time
analytics, however, we need more advanced
tools, e.g. Apache Flume and Apache
Chukwa.

STREAMING DATA INGESTION
 Apache Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log data
into HDFS.
 It has a simple and flexible architecture based on streaming data flows;
and robust and fault tolerant with tunable reliability mechanisms and
many failover and recovery mechanisms.
 It uses a simple extensible data model that allows for online analytic
application.
 Flume employs the familiar producer-consumer model. Source is the
entity through which data enters into Flume. Sources either actively poll
for data or passively wait for data to be delivered to them. On the other
hand, Sink is the entity that delivers the data to the destination. Flume
has many built-in sources (e.g. log4j and syslogs) and sinks (e.g. HDFS
and HBase). Channel is the conduit between the Source and the Sink.
Sources ingest events into the channel and the sinks drain the channel.
Channels allow decoupling of ingestion rate from drain rate. When data
are generated faster than what the destination can handle, the channel
size increases.

STREAMING DATA INGESTION
 Apache Chukwa is devoted to large-scale log collection and
analysis, built on top of MapReduce framework. Beyond data
ingestion, Chukwa also includes a flexible and powerful toolkit for
displaying monitoring and analyzing results. Different from Flume,
Chukwa is not a a continuous stream processing system but a
mini-batch system.
 Apache Kafka and Apache Storm may also be used to ingest
streaming data into Hadoop although they are mainly designed to
solve different problems. Kafka is a distributed publish-subscribe
messaging system. It is designed to provide high throughput
persistent messaging that’s scalable and allows for parallel data
loads into Hadoop. Storm is a distributed realtime computation
system for use cases such as realtime analytics, online machine
learning, continuous computation, etc.

STRUCTURED DATA INGESTION
 Apache Sqoop is a tool designed to efficiently
transfer data between Hadoop and relational
databases. We can use Sqoop to import data from a
relational database table into HDFS. The import
process is performed in parallel and thus generates
multiple files in the format of delimited text, Avro, or
SequenceFile. Besides, Sqoop generates a Java
class that encapsulates one row of the imported
table, which can be used in subsequent MapReduce
processing of the data. Moreover, Sqoop can export
the data (e.g. the results of MapReduce processing)
back to the relational database for consumption by
external applications or users.

DATA INGESTION TOOLS
 Apache Hive
 Apache Flume
 Apache NiFi
 Apache Sqoop
 Apache Kafka

APACHE FLUME
 A service for streaming logs into Hadoop
 Flume lets Hadoop users ingest high-volume streaming data into HDFS
for storage.
 Specifically, Flume allows users to:
 Stream data
 Ingest streaming data from multiple sources into Hadoop for storage and analysis
 Insulate systems
 Buffer storage platform from transient spikes, when the rate of incoming data exceeds
the rate at which data can be written to the destination
 Guarantee data delivery
 Flume NG uses channel-based transactions to guarantee reliable message delivery.
When a message moves from one agent to another, two transactions are started, one
on the agent that delivers the event and the other on the agent that receives the event.
This ensures guaranteed delivery semantics
 Scale horizontally
 To ingest new data streams and additional volume as needed

APACHE FLUME
 Enterprises use Flume’s powerful streaming
capabilities to land data from high-throughput
streams in the Hadoop Distributed File System
(HDFS). Typical sources of these streams are
application logs, sensor and machine data, geo-
location data and social media. These different
types of data can be landed in Hadoop for future
analysis using interactive queries in Apache
Hive. Or they can feed business dashboards
served ongoing data by Apache HBase.

EXAMPLE OF FLUME
 Flume is used to log manufacturing
operations. When one run of product comes
off the line, it generates a log file about that
run. Even if this occurs hundreds or
thousands of times per day, the large volume
log file data can stream through Flume into a
tool for same-day analysis with Apache
Storm or months or years of production runs
can be stored in HDFS and analyzed by a
quality assurance engineer using Apache
Hive.

HOW FLUME WORKS
 Flume’s high-level architecture is built on a
streamlined codebase that is easy to use and
extend. The project is highly reliable, without
the risk of data loss. Flume also supports
dynamic reconfiguration without the need for
a restart, which reduces downtime for its
agents.

COMPONENTS OF FLUME
 Event
 A singular unit of data that is transported by Flume (typically a single log entry)
 Source
 The entity through which data enters into Flume. Sources either actively poll for data or
passively wait for data to be delivered to them. A variety of sources allow data to be
collected, such as log4j logs and syslogs.
 Sink
 The entity that delivers the data to the destination. A variety of sinks allow data to be
streamed to a range of destinations. One example is the HDFS sink that writes events to
HDFS.
 Channel
 The conduit between the Source and the Sink. Sources ingest events into the channel and
the sinks drain the channel.
 Agent
 Any physical Java virtual machine running Flume. It is a collection of sources, sinks and
channels.
 Client
 The entity that produces and transmits the Event to the Source operating within the Agent.

COMPENENTS INTERACTION
 A flow in Flume starts from the Client.
 The Client transmits the Event to a Source operating within the Agent.
 The Source receiving this Event then delivers it to one or
more Channels.
 One or more Sinks operating within the same Agent drains
these Channels.
 Channels decouple the ingestion rate from drain rate using the familiar
producer-consumer model of data exchange.
 When spikes in client side activity cause data to be generated faster
than can be handled by the provisioned destination capacity can handle,
the Channel size increases. This allows sources to continue normal
operation for the duration of the spike.
 The Sink of one Agent can be chained to the Source of another Agent.
This chaining enables the creation of complex data flow topologies.
 Note
 Because Flume’s distributed architecture requires no central coordination
point. Each agent runs independently of others with no inherent single point
of failure, and Flume can easily scale horizontally.

APACHE NIFI
 Apache NiFi is a secure integrated platform for real time data
collection, simple event processing, transport and delivery from
source to storage. It is useful for moving distributed data to and
from your Hadoop cluster. NiFi has lots of distributed processing
capability to help reduce processing cost and get real-time
insights from many different data sources across many large
systems and can help aggregate that data into a single, or many
different places.
 NiFi lets users get the most value from their data. Specifically
NiFi allows users to:
 Stream data from multiple source
 Collect high volumes of data in real time
 Guarantee delivery of data
 Scale horizontally across many machines

HOW NIFI WORKS
 NiFi’s high-level architecture is focused on delivering a streamlined
interface that is easy to use and easy to set up.
 Basic Terminology
 Processor: Processors in NiFi are what makes the data move. Processors
can help generate data, run commands, move data, convert data, and many
many more. NiFi’s architecture and feature set is designed to be extended
these processors. They are at the very core of NiFi’s functionality.
 Processing Group: When data flows get very complex, it can be very useful
to group different parts together which perform certain functions. NiFi
abstracts this concept and calls them processing groups.
 FlowFile: A FlowFile in NiFi represents just a single piece of data. It is made
of different parts. Attributes and Contents. Attributes help give the data
context which are made of key-value pairs. Typically there are 3 attributes
which are present on all FlowFiles: uuid, filename, and path
 Connections and Relationships: NiFi allows users to simply drag and drop
connections between processors which controls how the data will flow. Each
connection will be assigned to different types of relationships for the
FlowFiles (such as successful processing, or a failure to process)

WORKING
 A FlowFile can originate from a processor in
NiFi. Processors can also receive the
flowfiles and transmit them to many other
processors. These processors can then drop
the data in the flowfile into various places
depending on the function of the processor.

WHAT YOU NEED
 Oracle VirtualBox virtual machine (VM).
 ODBC driver that matches the version of Excel
you are using (32-bit or 64-bit).
 Power View feature in Excel 2013 to visualize
the server log data.
 Power View is currently only available in Microsoft
Office Professional Plus and Microsoft Office 365
Professional Plus.
 Install Hortonworks DataFlow (HDF) on the
Sandbox, so you’ll need to download the latest
HDF release

Data ingestion

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data ingestion

Similar to Data ingestion (20)

Recently uploaded

Recently uploaded (20)

Data ingestion