2. What is Big Data?
•Data which is very large in size and yet growing
exponentially with time is called as Big data.
3. Why we need Big Data?
• For any application that contains limited amount of data we normally
use SQL / PostgreSQL / Oracle / MySQL.
• What in case of large applications like Facebook, Google, YouTube?
• This data is so large and complex that none of the traditional data
management system is able to store and process it.
• Facebook generates 500+ TB data per day.
4. Where does Big Data come from?
•Social data
•Share Market
•E-commerce site
•Airplane
5. What is the need for storing such
huge amount of data?
• The main reason behind storing data is analysis.
• Data analysis
• More accurate analyses
6. Example
• When we search anything on e-commerce websites(eBay,
Amazon), we get some recommendations of product that we
search.
• Similarly, why Facebook stores our images ,videos?
• a) Global marketing
• b) Target marketing
7. Categories Of Big Data
• Structured Data:
• Example :Relational database management system.
• Unstructured Data:
• Example Text files ,images ,videos ,email ,customer service
interactions ,webpages ,PDF files ,PPT ,social media data etc.
8.
9. What is Hadoop?
• Apache Hadoop is an open-source java framework that is used to
store ,analyze and process Big data in a distributed systems
environment across clusters of computers.
• It is used by Google, Facebook, Yahoo, YouTube, Twitter, LinkedIn and
many more.
• It is inspired by google MapReduce algorithm for running distributed
application.
• Hadoop is written in java programming language with some code in C
and some commands in shell-scripts.
11. Architecture of Hadoop:
• Hadoop follows master-slave architecture
• Two important components of Hadoop are:
• 1.HDFS (Data storage)
• 2.Map-Reduce (Analyzing and Processing)
12.
13. Hadoop Distributed File System
(HDFS)
• Hadoop Distributed File System is distributed file system
used to store very huge amount of data.
• HDFS follows master-slave architecture means there is
one master machine(Name Node) and multiple slave
machines(Data Node).
• The data that you give to hadoop is stored across these
machines in the cluster.
14.
15. Various components of HDFS
• Blocks:
• Name Node:
• Data Node:
• Secondary Name Node:
16. Map reduce
• MapReduce is a framework and processing technique using which we can
write applications to process huge amounts of data, in parallel, on large
clusters of commodity hardware in a reliable manner.
• Map reduce program is written by default in java but we can use other
language also like pig, apache pig.
• The Map Reduce algorithm contains two important tasks:
• 1) Map
• 2) Reduce
17. Map
• Mapper takes a set of data and converts it into another set
of data, where individual elements are broken down into
tuples (key/value pairs).
18. Reduce
• Reducer takes the output from a map as input and
combines those data tuples into a smaller set of tuples.
• The reduce job is always performed after the map job.
21. JobTracker process
• JobTracker receives the requests from the client.
• JobTracker talks to the NameNode to determine the location of the
data.
• JobTracker finds the best TaskTracker nodes to execute tasks.
• The JobTracker submits the work to the chosen TaskTracker nodes.
• The TaskTracker nodes are monitored. If they do not submit heartbeat
signals then work is scheduled on a different TaskTracker.
• When the work is completed, the JobTracker updates its status and
submit back the overall status of job back to the client.
• The JobTracker is a point of failure for the Hadoop MapReduce service.
If it goes down, all running jobs are halted.
22. TaskTracker process
• The JobTracker submits the work to the TaskTracker nodes.
• TaskTracker run the tasks and report the status of task to
JobTracker.
• It has function of following the orders of the job tracker and
updating the job tracker with its progress status periodically.
• TaskTracker will be in constant communication with the
JobTracker.
• TaskTracker failure is not considered fatal. When a TaskTracker
becomes unresponsive, JobTracker will assign the task to another
node.
23.
24. Sqoop
• Sqoop is an open source framework provided by Apache.
• Used to import/export data to/from relational databases.
25. Why Sqoop is used?
• For Hadoop developers, the work starts after data is loaded
into HDFS.
• The data residing in the RDBMS need to be transferred to
HDFS and might need to transfer back to RDBMS.
• Sqoop uses MapReduce framework to import and export
the data, which provides parallel mechanism as well as fault
tolerance.
• Sqoop makes developers life easy by providing command
line interface.
• Developers just need to provide basic information like
source, destination and database authentication details in
the sqoop command.