SlideShare a Scribd company logo
DATA INGESTION
Hadoop
INTRODUCTION
 Definition
 Data ingestion is the process of obtaining and importing data
or immediate use or storage in a database. To ingest
something is to "take something in or absorb something."
 Data can be streamed in real time or ingested in batches.
 When data is ingested in real time, each data item is
imported as it is emitted by the source.
 When data is ingested in batches, data items are
imported in discrete chunks at periodic intervals of time.
 Note
 An effective data ingestion process begins by prioritizing data
sources, validating individual files and routing data items to the
correct destination.
INTRODUCTION CONTD..
 When numerous big data sources exist in
diverse formats (the sources may often number in the
hundreds and the formats in the dozens), it can be
challenging for businesses to ingest data at a reasonable
speed and process it efficiently in order to maintain a
competitive advantage.
 To that end, vendors offer software programs that are
tailored to specific computing environments or
software applications.
 When data ingestion is automated, the software used to
carry out the process may also include data
preparation features to structure and organize data so it
can be analyzed on the fly or at a later time by business
intelligence (BI) and business analytics (BA) programs.
BIG DATA INGESTION PATTERNS
 A common pattern that a lot of companies use to
populate a Hadoop-based data lake is to get data
from pre-existing relational databases and data
warehouses.
 When planning to ingest data into the data lake, one
of the key considerations is to determine how to
organize data and enable consumers to access the
data.
 Hive and Impala provide a data infrastructure on top
of Hadoop – commonly referred to as SQL on
Hadoop – that provide a structure to the data and the
ability to query the data using a SQL-like language.
KEY ASPECTS TO CONSIDER
 Before you start to populate data into say
Hive databases/schema and tables, the two
key aspects one would need to consider are:
 Which data storage formats to use when storing
data? (HDFS supports a number of data formats
for files such as SequenceFile, RCFile, ORCFile,
AVRO, Parquet, and others.)
 What are the optimal compression options for
files stored on HDFS? (Examples include gzip,
LZO, Snappy and others.)
HADOOP DATA INGESTION
 Today, most data are generated and stored
out of Hadoop, e.g. relational databases,
plain files, etc. Therefore, data ingestion is
the first step to utilize the power of Hadoop.
Various utilities have been developed to
move data into Hadoop.
BATCH DATA INGESTION
 The File System Shell includes various shell-like
commands,
including copyFromLocaland copyToLocal, that
directly interact with the HDFS as well as other
file systems that Hadoop supports. Most of the
commands in File System Shell behave like
corresponding Unix commands. When the data
files are ready in local file system, the shell is a
great tool to ingest data into HDFS in batch. In
order to stream data into Hadoop for real time
analytics, however, we need more advanced
tools, e.g. Apache Flume and Apache
Chukwa.
STREAMING DATA INGESTION
 Apache Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log data
into HDFS.
 It has a simple and flexible architecture based on streaming data flows;
and robust and fault tolerant with tunable reliability mechanisms and
many failover and recovery mechanisms.
 It uses a simple extensible data model that allows for online analytic
application.
 Flume employs the familiar producer-consumer model. Source is the
entity through which data enters into Flume. Sources either actively poll
for data or passively wait for data to be delivered to them. On the other
hand, Sink is the entity that delivers the data to the destination. Flume
has many built-in sources (e.g. log4j and syslogs) and sinks (e.g. HDFS
and HBase). Channel is the conduit between the Source and the Sink.
Sources ingest events into the channel and the sinks drain the channel.
Channels allow decoupling of ingestion rate from drain rate. When data
are generated faster than what the destination can handle, the channel
size increases.
STREAMING DATA INGESTION
 Apache Chukwa is devoted to large-scale log collection and
analysis, built on top of MapReduce framework. Beyond data
ingestion, Chukwa also includes a flexible and powerful toolkit for
displaying monitoring and analyzing results. Different from Flume,
Chukwa is not a a continuous stream processing system but a
mini-batch system.
 Apache Kafka and Apache Storm may also be used to ingest
streaming data into Hadoop although they are mainly designed to
solve different problems. Kafka is a distributed publish-subscribe
messaging system. It is designed to provide high throughput
persistent messaging that’s scalable and allows for parallel data
loads into Hadoop. Storm is a distributed realtime computation
system for use cases such as realtime analytics, online machine
learning, continuous computation, etc.
STRUCTURED DATA INGESTION
 Apache Sqoop is a tool designed to efficiently
transfer data between Hadoop and relational
databases. We can use Sqoop to import data from a
relational database table into HDFS. The import
process is performed in parallel and thus generates
multiple files in the format of delimited text, Avro, or
SequenceFile. Besides, Sqoop generates a Java
class that encapsulates one row of the imported
table, which can be used in subsequent MapReduce
processing of the data. Moreover, Sqoop can export
the data (e.g. the results of MapReduce processing)
back to the relational database for consumption by
external applications or users.
DATA INGESTION TOOLS
 Apache Hive
 Apache Flume
 Apache NiFi
 Apache Sqoop
 Apache Kafka
APACHE FLUME
 A service for streaming logs into Hadoop
 Flume lets Hadoop users ingest high-volume streaming data into HDFS
for storage.
 Specifically, Flume allows users to:
 Stream data
 Ingest streaming data from multiple sources into Hadoop for storage and analysis
 Insulate systems
 Buffer storage platform from transient spikes, when the rate of incoming data exceeds
the rate at which data can be written to the destination
 Guarantee data delivery
 Flume NG uses channel-based transactions to guarantee reliable message delivery.
When a message moves from one agent to another, two transactions are started, one
on the agent that delivers the event and the other on the agent that receives the event.
This ensures guaranteed delivery semantics
 Scale horizontally
 To ingest new data streams and additional volume as needed
APACHE FLUME
 Enterprises use Flume’s powerful streaming
capabilities to land data from high-throughput
streams in the Hadoop Distributed File System
(HDFS). Typical sources of these streams are
application logs, sensor and machine data, geo-
location data and social media. These different
types of data can be landed in Hadoop for future
analysis using interactive queries in Apache
Hive. Or they can feed business dashboards
served ongoing data by Apache HBase.
EXAMPLE OF FLUME
 Flume is used to log manufacturing
operations. When one run of product comes
off the line, it generates a log file about that
run. Even if this occurs hundreds or
thousands of times per day, the large volume
log file data can stream through Flume into a
tool for same-day analysis with Apache
Storm or months or years of production runs
can be stored in HDFS and analyzed by a
quality assurance engineer using Apache
Hive.
FLUME ILLUSTRATION
HOW FLUME WORKS
 Flume’s high-level architecture is built on a
streamlined codebase that is easy to use and
extend. The project is highly reliable, without
the risk of data loss. Flume also supports
dynamic reconfiguration without the need for
a restart, which reduces downtime for its
agents.
COMPONENTS OF FLUME
 Event
 A singular unit of data that is transported by Flume (typically a single log entry)
 Source
 The entity through which data enters into Flume. Sources either actively poll for data or
passively wait for data to be delivered to them. A variety of sources allow data to be
collected, such as log4j logs and syslogs.
 Sink
 The entity that delivers the data to the destination. A variety of sinks allow data to be
streamed to a range of destinations. One example is the HDFS sink that writes events to
HDFS.
 Channel
 The conduit between the Source and the Sink. Sources ingest events into the channel and
the sinks drain the channel.
 Agent
 Any physical Java virtual machine running Flume. It is a collection of sources, sinks and
channels.
 Client
 The entity that produces and transmits the Event to the Source operating within the Agent.
COMPENENTS INTERACTION
 A flow in Flume starts from the Client.
 The Client transmits the Event to a Source operating within the Agent.
 The Source receiving this Event then delivers it to one or
more Channels.
 One or more Sinks operating within the same Agent drains
these Channels.
 Channels decouple the ingestion rate from drain rate using the familiar
producer-consumer model of data exchange.
 When spikes in client side activity cause data to be generated faster
than can be handled by the provisioned destination capacity can handle,
the Channel size increases. This allows sources to continue normal
operation for the duration of the spike.
 The Sink of one Agent can be chained to the Source of another Agent.
This chaining enables the creation of complex data flow topologies.
 Note
 Because Flume’s distributed architecture requires no central coordination
point. Each agent runs independently of others with no inherent single point
of failure, and Flume can easily scale horizontally.
APACHE NIFI
 Apache NiFi is a secure integrated platform for real time data
collection, simple event processing, transport and delivery from
source to storage. It is useful for moving distributed data to and
from your Hadoop cluster. NiFi has lots of distributed processing
capability to help reduce processing cost and get real-time
insights from many different data sources across many large
systems and can help aggregate that data into a single, or many
different places.
 NiFi lets users get the most value from their data. Specifically
NiFi allows users to:
 Stream data from multiple source
 Collect high volumes of data in real time
 Guarantee delivery of data
 Scale horizontally across many machines
HOW NIFI WORKS
 NiFi’s high-level architecture is focused on delivering a streamlined
interface that is easy to use and easy to set up.
 Basic Terminology
 Processor: Processors in NiFi are what makes the data move. Processors
can help generate data, run commands, move data, convert data, and many
many more. NiFi’s architecture and feature set is designed to be extended
these processors. They are at the very core of NiFi’s functionality.
 Processing Group: When data flows get very complex, it can be very useful
to group different parts together which perform certain functions. NiFi
abstracts this concept and calls them processing groups.
 FlowFile: A FlowFile in NiFi represents just a single piece of data. It is made
of different parts. Attributes and Contents. Attributes help give the data
context which are made of key-value pairs. Typically there are 3 attributes
which are present on all FlowFiles: uuid, filename, and path
 Connections and Relationships: NiFi allows users to simply drag and drop
connections between processors which controls how the data will flow. Each
connection will be assigned to different types of relationships for the
FlowFiles (such as successful processing, or a failure to process)
WORKING
 A FlowFile can originate from a processor in
NiFi. Processors can also receive the
flowfiles and transmit them to many other
processors. These processors can then drop
the data in the flowfile into various places
depending on the function of the processor.
WHAT YOU NEED
 Oracle VirtualBox virtual machine (VM).
 ODBC driver that matches the version of Excel
you are using (32-bit or 64-bit).
 Power View feature in Excel 2013 to visualize
the server log data.
 Power View is currently only available in Microsoft
Office Professional Plus and Microsoft Office 365
Professional Plus.
 Install Hortonworks DataFlow (HDF) on the
Sandbox, so you’ll need to download the latest
HDF release
HOW NIFI LOOKS LIKE
IMPORT FLOW IN NIFI
THE FLOWW LOOKS LIKE THIS
VERIFYING THE IMPORT

More Related Content

What's hot

Data warehousing
Data warehousingData warehousing
Data warehousing
Shruti Dalela
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
Avkash Chauhan
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Sql server basics
Sql server basicsSql server basics
Sql server basics
VishalJharwade
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
hktripathy
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Query Optimization in SQL Server
Query Optimization in SQL ServerQuery Optimization in SQL Server
Query Optimization in SQL Server
Rajesh Gunasundaram
 
Oltp vs olap
Oltp vs olapOltp vs olap
Oltp vs olap
Mr. Fmhyudin
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
Fabio Fumarola
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
Mohit Saini
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
janani thirupathi
 
Hive
HiveHive
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 

What's hot (20)

Data warehousing
Data warehousingData warehousing
Data warehousing
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Sql server basics
Sql server basicsSql server basics
Sql server basics
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Query Optimization in SQL Server
Query Optimization in SQL ServerQuery Optimization in SQL Server
Query Optimization in SQL Server
 
Oltp vs olap
Oltp vs olapOltp vs olap
Oltp vs olap
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Hive
HiveHive
Hive
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 

Similar to Data ingestion

Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
Kowndinya Mannepalli
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
Ameet Paranjape
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
Supratim Ray
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Xavient - DiP
Xavient - DiPXavient - DiP
Xavient - DiP
Neeraj Sabharwal
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
Jonathan Bloom
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
Khushboo Kumari
 
GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptx
infinix8
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
Imviplav
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
Bigdata
BigdataBigdata
Bigdata
sweetysweety8
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
avenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
BlibBlobb
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
Douglas Bernardini
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 

Similar to Data ingestion (20)

Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Xavient - DiP
Xavient - DiPXavient - DiP
Xavient - DiP
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptx
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Bigdata
BigdataBigdata
Bigdata
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 

Recently uploaded

Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
ArianaBusciglio
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
ArianaBusciglio
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 

Recently uploaded (20)

Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 

Data ingestion

  • 2. INTRODUCTION  Definition  Data ingestion is the process of obtaining and importing data or immediate use or storage in a database. To ingest something is to "take something in or absorb something."  Data can be streamed in real time or ingested in batches.  When data is ingested in real time, each data item is imported as it is emitted by the source.  When data is ingested in batches, data items are imported in discrete chunks at periodic intervals of time.  Note  An effective data ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination.
  • 3. INTRODUCTION CONTD..  When numerous big data sources exist in diverse formats (the sources may often number in the hundreds and the formats in the dozens), it can be challenging for businesses to ingest data at a reasonable speed and process it efficiently in order to maintain a competitive advantage.  To that end, vendors offer software programs that are tailored to specific computing environments or software applications.  When data ingestion is automated, the software used to carry out the process may also include data preparation features to structure and organize data so it can be analyzed on the fly or at a later time by business intelligence (BI) and business analytics (BA) programs.
  • 4. BIG DATA INGESTION PATTERNS  A common pattern that a lot of companies use to populate a Hadoop-based data lake is to get data from pre-existing relational databases and data warehouses.  When planning to ingest data into the data lake, one of the key considerations is to determine how to organize data and enable consumers to access the data.  Hive and Impala provide a data infrastructure on top of Hadoop – commonly referred to as SQL on Hadoop – that provide a structure to the data and the ability to query the data using a SQL-like language.
  • 5. KEY ASPECTS TO CONSIDER  Before you start to populate data into say Hive databases/schema and tables, the two key aspects one would need to consider are:  Which data storage formats to use when storing data? (HDFS supports a number of data formats for files such as SequenceFile, RCFile, ORCFile, AVRO, Parquet, and others.)  What are the optimal compression options for files stored on HDFS? (Examples include gzip, LZO, Snappy and others.)
  • 6. HADOOP DATA INGESTION  Today, most data are generated and stored out of Hadoop, e.g. relational databases, plain files, etc. Therefore, data ingestion is the first step to utilize the power of Hadoop. Various utilities have been developed to move data into Hadoop.
  • 7. BATCH DATA INGESTION  The File System Shell includes various shell-like commands, including copyFromLocaland copyToLocal, that directly interact with the HDFS as well as other file systems that Hadoop supports. Most of the commands in File System Shell behave like corresponding Unix commands. When the data files are ready in local file system, the shell is a great tool to ingest data into HDFS in batch. In order to stream data into Hadoop for real time analytics, however, we need more advanced tools, e.g. Apache Flume and Apache Chukwa.
  • 8. STREAMING DATA INGESTION  Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS.  It has a simple and flexible architecture based on streaming data flows; and robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.  It uses a simple extensible data model that allows for online analytic application.  Flume employs the familiar producer-consumer model. Source is the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. On the other hand, Sink is the entity that delivers the data to the destination. Flume has many built-in sources (e.g. log4j and syslogs) and sinks (e.g. HDFS and HBase). Channel is the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel. Channels allow decoupling of ingestion rate from drain rate. When data are generated faster than what the destination can handle, the channel size increases.
  • 9. STREAMING DATA INGESTION  Apache Chukwa is devoted to large-scale log collection and analysis, built on top of MapReduce framework. Beyond data ingestion, Chukwa also includes a flexible and powerful toolkit for displaying monitoring and analyzing results. Different from Flume, Chukwa is not a a continuous stream processing system but a mini-batch system.  Apache Kafka and Apache Storm may also be used to ingest streaming data into Hadoop although they are mainly designed to solve different problems. Kafka is a distributed publish-subscribe messaging system. It is designed to provide high throughput persistent messaging that’s scalable and allows for parallel data loads into Hadoop. Storm is a distributed realtime computation system for use cases such as realtime analytics, online machine learning, continuous computation, etc.
  • 10. STRUCTURED DATA INGESTION  Apache Sqoop is a tool designed to efficiently transfer data between Hadoop and relational databases. We can use Sqoop to import data from a relational database table into HDFS. The import process is performed in parallel and thus generates multiple files in the format of delimited text, Avro, or SequenceFile. Besides, Sqoop generates a Java class that encapsulates one row of the imported table, which can be used in subsequent MapReduce processing of the data. Moreover, Sqoop can export the data (e.g. the results of MapReduce processing) back to the relational database for consumption by external applications or users.
  • 11. DATA INGESTION TOOLS  Apache Hive  Apache Flume  Apache NiFi  Apache Sqoop  Apache Kafka
  • 12. APACHE FLUME  A service for streaming logs into Hadoop  Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage.  Specifically, Flume allows users to:  Stream data  Ingest streaming data from multiple sources into Hadoop for storage and analysis  Insulate systems  Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination  Guarantee data delivery  Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. This ensures guaranteed delivery semantics  Scale horizontally  To ingest new data streams and additional volume as needed
  • 13. APACHE FLUME  Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo- location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.
  • 14. EXAMPLE OF FLUME  Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run. Even if this occurs hundreds or thousands of times per day, the large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive.
  • 16. HOW FLUME WORKS  Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend. The project is highly reliable, without the risk of data loss. Flume also supports dynamic reconfiguration without the need for a restart, which reduces downtime for its agents.
  • 17. COMPONENTS OF FLUME  Event  A singular unit of data that is transported by Flume (typically a single log entry)  Source  The entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.  Sink  The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.  Channel  The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.  Agent  Any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.  Client  The entity that produces and transmits the Event to the Source operating within the Agent.
  • 18. COMPENENTS INTERACTION  A flow in Flume starts from the Client.  The Client transmits the Event to a Source operating within the Agent.  The Source receiving this Event then delivers it to one or more Channels.  One or more Sinks operating within the same Agent drains these Channels.  Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange.  When spikes in client side activity cause data to be generated faster than can be handled by the provisioned destination capacity can handle, the Channel size increases. This allows sources to continue normal operation for the duration of the spike.  The Sink of one Agent can be chained to the Source of another Agent. This chaining enables the creation of complex data flow topologies.  Note  Because Flume’s distributed architecture requires no central coordination point. Each agent runs independently of others with no inherent single point of failure, and Flume can easily scale horizontally.
  • 19. APACHE NIFI  Apache NiFi is a secure integrated platform for real time data collection, simple event processing, transport and delivery from source to storage. It is useful for moving distributed data to and from your Hadoop cluster. NiFi has lots of distributed processing capability to help reduce processing cost and get real-time insights from many different data sources across many large systems and can help aggregate that data into a single, or many different places.  NiFi lets users get the most value from their data. Specifically NiFi allows users to:  Stream data from multiple source  Collect high volumes of data in real time  Guarantee delivery of data  Scale horizontally across many machines
  • 20. HOW NIFI WORKS  NiFi’s high-level architecture is focused on delivering a streamlined interface that is easy to use and easy to set up.  Basic Terminology  Processor: Processors in NiFi are what makes the data move. Processors can help generate data, run commands, move data, convert data, and many many more. NiFi’s architecture and feature set is designed to be extended these processors. They are at the very core of NiFi’s functionality.  Processing Group: When data flows get very complex, it can be very useful to group different parts together which perform certain functions. NiFi abstracts this concept and calls them processing groups.  FlowFile: A FlowFile in NiFi represents just a single piece of data. It is made of different parts. Attributes and Contents. Attributes help give the data context which are made of key-value pairs. Typically there are 3 attributes which are present on all FlowFiles: uuid, filename, and path  Connections and Relationships: NiFi allows users to simply drag and drop connections between processors which controls how the data will flow. Each connection will be assigned to different types of relationships for the FlowFiles (such as successful processing, or a failure to process)
  • 21. WORKING  A FlowFile can originate from a processor in NiFi. Processors can also receive the flowfiles and transmit them to many other processors. These processors can then drop the data in the flowfile into various places depending on the function of the processor.
  • 22. WHAT YOU NEED  Oracle VirtualBox virtual machine (VM).  ODBC driver that matches the version of Excel you are using (32-bit or 64-bit).  Power View feature in Excel 2013 to visualize the server log data.  Power View is currently only available in Microsoft Office Professional Plus and Microsoft Office 365 Professional Plus.  Install Hortonworks DataFlow (HDF) on the Sandbox, so you’ll need to download the latest HDF release
  • 25. THE FLOWW LOOKS LIKE THIS