2. To get data into Hadoop, you'll typically follow these
steps:
Prepare Your Data: Make sure your data is
cleaned,without noice, formatted, and ready for
ingestion. Data quality and preparation are important for
efficient processing.
Choose Storage Format: Select an appropriate file format for
storing your data. Formats like Avro, Parquet, and ORC are
common choices due to their compression and query
performance benefits.
3. Use HDFS:
You can use HDFS to store your data. Here's how you
can do it:
•HDFS Command Line: You can use Hadoop's
command-line tools like HDFS DFS to copy files from
your local system to HDFS.
•Hadoop Ecosystem Tools: There are tools like
Apache NiFi, Sqoop, and Flume that specialize in data
ingestion, making the process easier and more
automated.
5. Apache kafka
Apache Kafka is a distributed event store
and stream-processing platform. It is an
open-source system developed by the
Apache Software Foundation written in Java
and Scala
7. HADOO SCOOP
• Apache Sqoop , like the name says it’s a big data tool
for transferring data between Hadoop and relational
database servers. It is used to transfer data from RDBMS
(relational database management system) like MySQL and
Oracle to HDFS (Hadoop Distributed File System)
• Sqoop has two main functions: importing and exporting.
Importing transfers structured data into HDFS; exporting
moves this data from Hadoop to external databases in the
cloud or on-premises
8.
9. HADOOP STREAMING
• MapReduce is a processing technique and a program model for
distributed computing based on java. On a daily basis the micro-
blogging site Twitter receives nearly 500 million tweets, i.e., 3000
tweets per second. We can see the illustration on Twitter with the
help of MapReduce. In the above example Twitter data is an input,
and MapReduce performs the actions like Tokenize, filter, count and
aggregate counters.
• Basically Hadoop Streaming allows us to write Map/reduce jobs in
any languages (such as Python, Perl, Ruby, C++, etc) and run as
mapper/reducer. Thus it enables a person who is not having any
knowledge of Java to write MapReduce job in the language of its own
choice
11. APACHE FLUME:
Apache Flume is a part of the Hadoop ecosystem and is mainly used
for real-time data ingestion from different web applications into
storage like HDFS, HBase, etc.
13. APACHE NIFI:
Apache NiFi is an integrated data logistics
platform for automating the movement of data
between disparate systems. It provides real-time
control that makes it easy to manage the
movement of data between any source and any
destination.