SlideShare a Scribd company logo
1 of 92
1- Introduction .
2- What is big data?
3- The characters of big data.
4- The handling with data.
5- Building a Successful Big Data Management.
6- Big data applications .
7- History of Hadoop.
8- The Core of Apache Hadoop .
9- Workflow and data movement .
10- Apache hadoop Ecosystem .
2
INTRODUCTION
“big data” became more than just a technical term for scientists,
engineers, and other technologists.
The term entered the main stream on a myriad of fronts,
becoming a household word in news ,business , health care, and
people’s personal lives.
The term became synonymous with intelligence gathering and
spy craft ,
3
These days, increased data generation rate so as to increase the
sources that generate such data. thus , the data becomes huge ”Big
data”.
The traditional data were generated from employee , now in the era of
massive data become from:
- Employee.
- Users .
- Machines .
are all generate large and different type of data Constantly.
4
What is big data?
Big data is a broad term for data sets so large or complex that
traditional data processing applications are inappropriate.
5
Example
Every day, we create 2.5 quintillion bytes of data — so much that
90% of the data in the world today created in last two years alone.
- data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to name a
few. This data is big data.
6
Big data is not a single technology but a combination of old and new
technologies that helps companies gain actionable insight.
- big data the capability to manage a huge volume
- at the right speed
- in right time frame to allow real-time analysis and reaction.
Why the Big Data is important?
7
QUESTIONS
1- What are the examples of machines that generate big data ?
2- What are the examples of employees that generate big data ?
3- What are the examples of users that generate big data ?
8
The examples of machines that generate Big data
1- GPS Data:
records the exact position of a device at a
specific moment in time. GPS events can
be transformed easily into position and
movement information.
EX: vehicles on a road network rely on
the accurate and sophisticated
processing of GPS information.
9
2- Sensor Data:
The availability of low cost, intelligent
sensors, coupled with the latest 3G and 4G
wireless technology has driven a dramatic
increase in the volume of sensor data, but the
need to extract operational intelligence in
real-time from the data.
Ex: include industrial automation plants, smart metering.
10
The examples of employees that generate Big data
11
The examples of users that generate Big data
12
THE CHARACTERS OF BIG DATA
13
Big Data is typically broken down by three
characteristics:
✓ Volume: How much data
✓ Velocity: How fast that data is processed
✓ Variety: The various types of data
Big Data
14
The Variety
- Big data combines all data.
- structured data.
- unstructured data .
- semi structure data.
This kind of data management requires that companies
leverage both their structured and unstructured data.
15
The characters of big data
16
Structure Unstructured
Analog Data
Big data
Semi-structure
Variety
XML
Enterprise
system(CRM,
ERP.. etc)
Data
warehouses
Audio/video
streams
GPS tracking
information
Databases
EDI
E-Mail
17
Looking at semi-structured data
Semi-structured data is a kind of data
- falls between structured and unstructured data.
- Semi-structured data not necessarily conform to a fixed schema.
- but may be self-describing and may have simple label/value pairs.
18
Looking at semi-structured data
For example, label/value pairs might include:
<family>=Jones, <mother>=Jane, and
<daughter>=Sarah.
Examples of semi-structured data include:
EDI, SWIFT, and XML.
You can think of them as sort of payloads for processing complex
events.
19
Traditional data & Big data
Traditional Data
- Documents.
- Finances .
- Stock records.
- Personal files.
Big Data
- Photographs .
- Audio & video .
- 3D models .
- Simulation .
- Location data.
20
Real time
Near real time
Hourly
Daily
Weekly
monthly
Batch And so on ….
v
e
l
o
c
i
t
y
v
o
l
u
m
e
Megabytes
Gigabytes
Terabytes
Petabytes
And more …
velocityvolume
21
The benefit gained from the ability to process large amounts of
information is the main attraction of big data analytics.
This volume presents the most immediate challenge to
conventional IT structures.
It calls for scalable storage, and a distributed approach to querying.
The volume
22
The Velocity
- It’s not just the velocity of the incoming data.
- it’s possible to stream fast-moving data into bulk storage
for later batch processing.
- The velocity of big data, coupled with its variety.
- cause a move toward real-time observations.
- allowing better decision making or quick action .
23
- The importance lies in the speed of the feedback loop, taking data
from input through to decision.
- A commercial from IBM makes the point that you wouldn’t cross the
road if all you had was a five-minute old snapshot of traffic location.
There are times when you simply won’t be able to wait for a report to
run or a Hadoop job to complete. The importance lies in the speed of
the feedback loop.
Example
24
Product categories for handling streaming data divide into :
1- Established proprietary products such as :
- IBM’s InfoSphere Streams and the lesspolished
2- Still emergent open source frameworks originating in the web industry:
- Twitter’s Storm and Yahoo S4.
Categories for handling streaming data
“Velocity ”
25
Practice example on Big data
Example 1
Example 2
Example 3
- These are good web sites to absorb how much of data generated
in the world
26
Different approaches to handling data exist
based on whether
- It is data in motion.
- It is data at rest.
Different approaches
To
handling data
27
- Data at rest would be used by a business analyst to better understand
customers’ current buying patterns based on all aspects of the customer
relationship, including sales, social media data, and customer service
interactions.
Here’s a quick example of each:-->
- Data in motion would be used if a company is able to analyze the quality
of its products during the manufacturing process to avoid costly errors.
28
Managing Big data
With Big data, now possible to virtualize data.
- stored efficiently, utilizing cloud-based storage.
- more cost- effectively .
- improvements network speed .
- reliability have removed other physical limitations to manage massive
amounts of data at an acceptable pace.
29
Building a Successful Big Data Management
- capture.
- organize.
- Integrate.
- analyze.
- act .
Big data management should beginning with:
The cycle of big data management.
30
- data must first be captured.
- Then organized and integrated.
- After this phase is successfully implemented.
- data analyzed based on problem being addressed.
Finally, management takes action based on the outcome of that analysis.
Building a Successful
Big Data
Management
31
The importance of Big data
In
our world & The our future
Big data provides a competitive advantage for organizations .
- helps to make decisions are thus increasing efficiency and
profit and loss reduction.
- extend benefit to including energy, education, health and
huge scientific projects like” the human genome project”
(the entire genetic material for the study of human beings).
32
- Healthcare.
- Manufacturing.
- Management .
- traffic management .
Big data applications
Some of the emerging applications are in areas such as :
They rely on huge volumes, velocities , and varieties data to transform
the behavior of a market.
33
In healthcare: a big data application might be able to monitor
premature infants to determine when data indicates when
intervention is needed.
In manufacturing, a big data application can be used to prevent a
machine from shutting down during a production run.
Example 1
Example 2
34
Let’s summarize some benefits of Big data
Some of benefit of big data are:
- Increase of storage capacity. = scalable.
- Increase processing power. =real-time.
- Availability of data . = full tolerant.
- less cost . = commodity hard ware
35
Hadoop was created by Doug Cutting
and Mike Cafarella in 2005. Cutting,
who was working at Yahoo! at the time,
named it after his son's toy elephant. It
was originally developed to support
distribution for the Nutch search engine
project.
History of
36
Hadoop is designed to process huge amounts of structured,
unstructured data (terabytes to petabytes) and is implemented
on racks of commodity servers as a Hadoop cluster.
Hadoop is designed to parallelize data processing across computing
nodes to speed computations and hide latency.
37
Apache Hadoop is a set of algorithms.
- Open source software framework written in Java.
- Distributed storage.
- Distributed processing .
- Built from commodity hardware.
- Files are replicated to handle hardware failure
- Detect failures and recovers from them
What is ?
38
Some of Hadoop users
- Facebook.
- IBM.
- Google.
- Yahoo!.
- New York Times.
- Amazon/A9.
- And there are others
39
The Core of Apache Hadoop
At its core, Hadoop has two primary components:
1- storage part
- Hadoop Distributed File System <HDFS>.
# can support petabytes of data.
2- processing part
- MapReduce.
#computes results in batch.
40
HDFS:
- Stores large files across a commodity cluster.
- typically in the range of gigabytes to terabytes .
- Scalable, and portable file-system .
- written in Java for the Hadoop framework .
- Replicating data across multiple hosts.
Hadoop Distributed File System HDFS
41
- Default replication value, 3, data is stored on three nodes:
- two on the same rack, and one on a different rack.
- Data nodes can talk to each other:
- to rebalance data.
- to move copies around .
- to keep the replication of data high.
HDFS:
42
Question
- Why we need to replicate files in HDF ?
To achieve reliability .
if any failure occur on any node we can continue in
processing
43
HDFS works by breaking large files into smaller pieces called blocks.
- The blocks are stored on data nodes.
- NameNode responsibility to know what blocks on which data
nodes make up the complete file.
“keeps track of where data is physically stored”.
- NameNode acts as a “traffic cop,” managing all access to the files .
HDFS
44
How a Hadoop cluster is mapped to hardware.45
The responsibility of NameNode
1- reads data blocks on the data nodes.
2- writes data blocks on the data nodes.
3- creates data blocks on the data nodes.
4- deletes data blocks on the data nodes.
5- replication of data blocks on the data nodes.
- NameNode acts as a “traffic cop,” managing all access to the files
including :
46
NameNode and data nodes :
- they operate in a “loosely coupled” fashion
that allows the cluster elements to behave
dynamically:
- adding (or subtracting) servers as the
demand increases (or decreases).
How a Hadoop cluster is mapped to hardware.
The Relationship
Between
NameNode & DataNodes
47
Are DataNodes also smart?
NameNode is very smart
Data nodes are not very smart
48
Are DataNodes also smart?
- Data nodes are not very smart, but the NameNode is.
- Because the DataNodes constantly ask the NameNode whether there
anything for them to do.
- tells NameNode what data nodes out there and how busy they are.
- The NameNode so critical for correct operation of the cluster, can and
should be replicated to guard against a single point failure.
49
Map : distribute a computational
problem across a cluster .
Reduce : master node collects the
answers to all the sub- problems
and combines them .
master
copy
copy
copy
Map Reduce
50
An example of an inverted index being created in MapReduce
52
public static class Map
extends Mapper<LongWritable, Text, Text, Text> {
private Text documentId;
private Text word = new Text();
@Override
protected void setup(Context context) {
String filename =
((FileSplit) context.getInputSplit()).getPath().getName();
documentId = new Text(filename);
}
@Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
for (String token :
StringUtils.split(value.toString())) {
word.set(token);
context.write(word, documentId);}}}
shows the mapper code
53
public static class Map
extends Mapper<LongWritable, Text, Text, Text> {
When you extend the MapReduce mapper class you specify the key/value
types
for your inputs and outputs. You use the MapReduce default InputFormat for
your job, which supplies keys as byte offsets into the input file, and values as
each line in the file. Your map emits Text key/value pairs.
The following shows the mapper code
54
To cut down on object creation you create a single Text object, which you’ll reuse
private Text word = new Text();
private Text documentId;
A Text object to store the document ID (filename) for your input
- InputFormat decides how file going to be
broken into smaller pieces for processing
using a function called InputSplit.
- It then assigns a RecordReader to
transform the raw data for processing by
the map.
- Then the map requires two inputs: a key
and a value.
Workflow and data movement
in a small Hadoop cluster
Hadoop MapReduce
55
- Your data is now in form acceptable to map.
- For each input pair, a distinct instance of map
is called to process data.
- map and reduce need to work together to
process your data.
- OutputCollector collects output from
independent mappers.
The mapping begin
56
- A Reporter function provides information
gathered from map tasks.
- to know when or if map tasks are
complete.
- All this work is being performed on
multiple nodes in the Hadoop cluster
simultaneously.
The mapping begin. Cont.
57
- Some of output may on a node different from the node where
reducers for that specific output will run.
- a partitioner and a sort gather and shuffle of intermediate
results.
- Map tasks deliver results to specific partition
- as inputs to the reduce tasks.
Workflow and data movement
- After all the map tasks are complete.
- the intermediate results are gathered in the partition.
- reduce shuffle ,sorting output for optimal processing.
58
Reduce & Combine
- For each output pair, reduce is called to perform its task.
- Reduce gathers its output while all the tasks are processing.
- Reduce can’t begin until all the mapping is done, and
- It isn’t finished until all instances are complete.
- The output of reduce a key and a value.
- OutputFormat takes the key-value pair .
- organizes the output for writing to HDFS.
- RecordWriter takes OutputFormat data and writes it to HDFS.
59
The Benefits of MapReduce
- Hadoop MapReduce is the heart of the Hadoop system.
- MR provides capabilities you need to break big data into
manageable chunks.
- MR process data in parallel on your cluster.
- MR makes data available for user consumption.
- MapReduce does all works in a highly resilient, fault-tolerant
manner.
60
61
Apache Hive
Apache Pig
Apache HBase
SQL-like language and
metadata repository
High-level language for expressing
data analysis programs
The Hadoop database. Random,
real -time read/write access
Hue
Browser-based desktop interface
for interacting with Hadoop
Oozie
Server-based workflow engine
for Hadoop activities
Sqoop Apache Whirr
Library for running
Hadoop in the cloud
Flume
Apache Zookeeper
Highly reliable distributed
coordination service
Distributed service for collecting
and aggregating log and event
data
Integrating Hadoop
with RDBMS
The major Utilities of Hadoop
62
- distributed, nonrelational columnar) database that utilizes HDFS
as its persistence store.
- Modeled after Google BigTable.
- Layered on Hadoop clusters .
- Capable of hosting very large tables (billions of columns/rows).
- Provides random, real-time read/write access .
- Highly configurable, providing flexibility to address huge of data
efficiently.
63
64
Mining Big Data with Hive
Hive is a batch-oriented, data-warehousing layer .
- built on the core elements of Hadoop (HDFS and MapReduce).
- It provides users who know SQL with HiveQL.
- Hive queries can take several minutes or hours depending on complexity.
- Hive best used for data mining and deeper analytics.
- relies on Hadoop foundation.
- very extensible, scalable, and resilient.
65
Hive uses three mechanisms for data organization:
✓ Tables .
✓ Partitions .
✓ Buckets .
supports multitable queries and inserts by
sharing input data within a single HiveQL
statement.
66
Hive tables same as RDBMS tables consisting rows and columns.
- Tables are mapped to directories in the file system.
- Hive supports tables stored in other native file systems.
Tables
67
- A Hive table can support one or more partitions.
- partitions mapped to subdirectories in underlying file system & represent
the distribution of data throughout the table.
For example:
If a table is called autos, with a key value of 12345 and a maker value Ford,
the path to the partition would be /hivewh/autos/kv=12345/Ford.
Partitions
68
- data may be divided into buckets.
- Buckets stored as files in partition directory in underlying file system.
- The buckets based on hash of a column in table.
In the preceding example, you might have a bucket called Focus, containing all
the attributes of a Ford Focus auto.
Buckets
69
Pig and Pig Latin
Pig make Hadoop more approachable and usable by non-
developers.
- interactive, or script-based,
- execution environment supporting Pig Latin.
- language used to express data flows.
- Pig Latin language supports loading, processing input data
transform the input data and produce the desired output.
70
Pig execution environment has two modes :
✓ Local mode: All scripts are run on a single machine.
Hadoop MapReduce and HDFS are not required.
✓ Hadoop: Also called MapReduce mode, all scripts are
run on a given Hadoop cluster.
Pig and Pig Latin
71
- Pig Latin language abstract way to get answers from big data .
- focusing on data and not structure of custom software program.
- Pig makes prototyping very simple.
For example
you can run a Pig script on small representation of your big data
environment
to ensure getting desired results before commit to processing all data.
Pig and Pig Latin. cont.
72
- Pig programs run in three different ways, all compatible with local
and Hadoop mode:
✓ Script .
✓ Grunt .
✓ Embedded .
Pig and Pig Latin. cont.
73
✓ Script: Simply a file containing Pig Latin commands
- identified by .pig suffix (for example, file.pig or myscript.pig).
- commands interpreted by Pig, executed in sequential order.
Script
74
✓ Grunt: Grunt is a command interpreter.
- can type Pig Latin on grunt command line .
- Grunt execute command on your behalf.
- very useful for prototyping and “what if” scenarios.
✓ Embedded: Pig programs executed as part of Java program.
Grunt & Embedded
75
Pig Latin has rich syntax. It supports operators for following
operations:
✓ Loading and storing of data .
✓ Streaming data .
✓ Filtering data .
✓ Grouping and joining data .
✓ Sorting data .
✓ Combining and splitting data .
Pig Latin supports wide variety of types, expressions, functions, diagnostic
operators, macros, and file system commands.
Pig and Pig Latin. cont.
76
77
Apache Sqoop
- Sqoop (SQL-to-Hadoop)
- tool that offers capability to extract data from non-
Hadoop data stores.
- transform data into form usable by Hadoop.
- then load data into HDFS.
- This process is called ETL, for Extract, Transform, and
Load.
- Sqoop commands executed one at a time.
78
Features of keys in Sqoop
✓ Bulk import:
- Sqoop import individual tables or entire databases into HDFS.
- data stored in native directories & files in HDFS file system.
✓ Direct input:
- Sqoop import , map SQL databases directly into Hive and HBase.
79
✓ Data interaction:
- Sqoop generate Java classes .
- you interact with data programmatically.
✓ Data export:
Sqoop export data directly from HDFS into a relational database.
Features of keys in Sqoop .Cont.
80
The Apache Sqoop working
- Sqoop works by looking at database you want to import .
- Then selecting appropriate import function for source data.
- Then recognizes input, then reads metadata for table or database .
- Then creates a class definition of your input requirements.
81
82
- Zookeeper Hadoop’s way coordinating all elements of distributed
applications.
- simple, but its features are powerful.
- managing groups of nodes in service to single distributed application.
- best implemented across racks.
83
Some of the capabilities of Zookeeper are as follows:
✓ Process synchronization .
✓ Configuration management .
✓ Self-election .
✓ Reliable messaging .
The capabilities of Zookeeper
Zookeeper
84
The capabilities of Zookeeper. Cont.
✓ Process synchronization:
- coordinates starting and stopping of multiple nodes in cluster.
- This ensures all processing occurs in intended order.
- When entire process group complete, processing occur.
85
✓ Configuration management:
- used to send configuration attributes to any or all nodes in
cluster.
- processing dependent on particular resources being available
on all nodes.
- ensures consistency of configurations.
The capabilities of Zookeeper. Cont.
86
✓ Self-election:
- Zookeeper understands makeup of cluster.
- can assign “leader” role to one of nodes.
- leader/master handles all client requests on behalf of
cluster.
- if leader node fail, another leader will be elected from
remaining nodes.
The capabilities of Zookeeper. Cont.
87
✓ Reliable messaging: Even though workloads in Zookeeper
- Loosely coupled.
- Zookeeper offers a publish/subscribe capability.
- allows creation of queue.
- queue guarantees message delivery even in case of node failure.
The capabilities of Zookeeper. Cont.
88
The Benefits of Hadoop
- represented most pragmatic that allow companies to manage huge
volumes of data easily.
- Allowed big problems to broken down into smaller elements .
- Analysis done quickly and cost-effectively.
- Big data processed in parallel.
- Process information and regroup small pieces to present results.
89
90
Any Questions?
91
92

More Related Content

What's hot

Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data TechnologiesMahindra Comviva
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Big Data Overview 2013-2014
Big Data Overview 2013-2014Big Data Overview 2013-2014
Big Data Overview 2013-2014KMS Technology
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 finalAmjid Ali
 

What's hot (20)

Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big data
Big dataBig data
Big data
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Big Data Overview 2013-2014
Big Data Overview 2013-2014Big Data Overview 2013-2014
Big Data Overview 2013-2014
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 final
 
Our big data
Our big dataOur big data
Our big data
 

Viewers also liked

It governance implementation
It governance implementationIt governance implementation
It governance implementationBodin Kon Dedee
 
วิจัยเพื่อพัฒนาโมเดลใหม่หรือรูปแบบใหม่
วิจัยเพื่อพัฒนาโมเดลใหม่หรือรูปแบบใหม่วิจัยเพื่อพัฒนาโมเดลใหม่หรือรูปแบบใหม่
วิจัยเพื่อพัฒนาโมเดลใหม่หรือรูปแบบใหม่DrDanai Thienphut
 
Teaching and learning development strategy
Teaching and learning development strategyTeaching and learning development strategy
Teaching and learning development strategyDrDanai Thienphut
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Learning Analytics in Education:  Using Student’s Big Data to Improve TeachingLearning Analytics in Education:  Using Student’s Big Data to Improve Teaching
Learning Analytics in Education: Using Student’s Big Data to Improve TeachingRafael Scapin, Ph.D.
 

Viewers also liked (8)

It governance implementation
It governance implementationIt governance implementation
It governance implementation
 
วิจัยเพื่อพัฒนาโมเดลใหม่หรือรูปแบบใหม่
วิจัยเพื่อพัฒนาโมเดลใหม่หรือรูปแบบใหม่วิจัยเพื่อพัฒนาโมเดลใหม่หรือรูปแบบใหม่
วิจัยเพื่อพัฒนาโมเดลใหม่หรือรูปแบบใหม่
 
Linux command-reference
Linux command-referenceLinux command-reference
Linux command-reference
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
 
Teaching and learning development strategy
Teaching and learning development strategyTeaching and learning development strategy
Teaching and learning development strategy
 
Multimedia Privacy
Multimedia PrivacyMultimedia Privacy
Multimedia Privacy
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Learning Analytics in Education:  Using Student’s Big Data to Improve TeachingLearning Analytics in Education:  Using Student’s Big Data to Improve Teaching
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
 

Similar to Big data with hadoop (20)

Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
Fundamentals of Big Data
Fundamentals of Big DataFundamentals of Big Data
Fundamentals of Big Data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Big Data
Big DataBig Data
Big Data
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 
big-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptx
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Bigdata
BigdataBigdata
Bigdata
 
An Overview of BigData
An Overview of BigDataAn Overview of BigData
An Overview of BigData
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
13 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v313 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v3
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 

Recently uploaded

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 

Recently uploaded (20)

9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 

Big data with hadoop

  • 1.
  • 2. 1- Introduction . 2- What is big data? 3- The characters of big data. 4- The handling with data. 5- Building a Successful Big Data Management. 6- Big data applications . 7- History of Hadoop. 8- The Core of Apache Hadoop . 9- Workflow and data movement . 10- Apache hadoop Ecosystem . 2
  • 3. INTRODUCTION “big data” became more than just a technical term for scientists, engineers, and other technologists. The term entered the main stream on a myriad of fronts, becoming a household word in news ,business , health care, and people’s personal lives. The term became synonymous with intelligence gathering and spy craft , 3
  • 4. These days, increased data generation rate so as to increase the sources that generate such data. thus , the data becomes huge ”Big data”. The traditional data were generated from employee , now in the era of massive data become from: - Employee. - Users . - Machines . are all generate large and different type of data Constantly. 4
  • 5. What is big data? Big data is a broad term for data sets so large or complex that traditional data processing applications are inappropriate. 5
  • 6. Example Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today created in last two years alone. - data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. 6
  • 7. Big data is not a single technology but a combination of old and new technologies that helps companies gain actionable insight. - big data the capability to manage a huge volume - at the right speed - in right time frame to allow real-time analysis and reaction. Why the Big Data is important? 7
  • 8. QUESTIONS 1- What are the examples of machines that generate big data ? 2- What are the examples of employees that generate big data ? 3- What are the examples of users that generate big data ? 8
  • 9. The examples of machines that generate Big data 1- GPS Data: records the exact position of a device at a specific moment in time. GPS events can be transformed easily into position and movement information. EX: vehicles on a road network rely on the accurate and sophisticated processing of GPS information. 9
  • 10. 2- Sensor Data: The availability of low cost, intelligent sensors, coupled with the latest 3G and 4G wireless technology has driven a dramatic increase in the volume of sensor data, but the need to extract operational intelligence in real-time from the data. Ex: include industrial automation plants, smart metering. 10
  • 11. The examples of employees that generate Big data 11
  • 12. The examples of users that generate Big data 12
  • 13. THE CHARACTERS OF BIG DATA 13
  • 14. Big Data is typically broken down by three characteristics: ✓ Volume: How much data ✓ Velocity: How fast that data is processed ✓ Variety: The various types of data Big Data 14
  • 15. The Variety - Big data combines all data. - structured data. - unstructured data . - semi structure data. This kind of data management requires that companies leverage both their structured and unstructured data. 15
  • 16. The characters of big data 16
  • 17. Structure Unstructured Analog Data Big data Semi-structure Variety XML Enterprise system(CRM, ERP.. etc) Data warehouses Audio/video streams GPS tracking information Databases EDI E-Mail 17
  • 18. Looking at semi-structured data Semi-structured data is a kind of data - falls between structured and unstructured data. - Semi-structured data not necessarily conform to a fixed schema. - but may be self-describing and may have simple label/value pairs. 18
  • 19. Looking at semi-structured data For example, label/value pairs might include: <family>=Jones, <mother>=Jane, and <daughter>=Sarah. Examples of semi-structured data include: EDI, SWIFT, and XML. You can think of them as sort of payloads for processing complex events. 19
  • 20. Traditional data & Big data Traditional Data - Documents. - Finances . - Stock records. - Personal files. Big Data - Photographs . - Audio & video . - 3D models . - Simulation . - Location data. 20
  • 21. Real time Near real time Hourly Daily Weekly monthly Batch And so on …. v e l o c i t y v o l u m e Megabytes Gigabytes Terabytes Petabytes And more … velocityvolume 21
  • 22. The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics. This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. The volume 22
  • 23. The Velocity - It’s not just the velocity of the incoming data. - it’s possible to stream fast-moving data into bulk storage for later batch processing. - The velocity of big data, coupled with its variety. - cause a move toward real-time observations. - allowing better decision making or quick action . 23
  • 24. - The importance lies in the speed of the feedback loop, taking data from input through to decision. - A commercial from IBM makes the point that you wouldn’t cross the road if all you had was a five-minute old snapshot of traffic location. There are times when you simply won’t be able to wait for a report to run or a Hadoop job to complete. The importance lies in the speed of the feedback loop. Example 24
  • 25. Product categories for handling streaming data divide into : 1- Established proprietary products such as : - IBM’s InfoSphere Streams and the lesspolished 2- Still emergent open source frameworks originating in the web industry: - Twitter’s Storm and Yahoo S4. Categories for handling streaming data “Velocity ” 25
  • 26. Practice example on Big data Example 1 Example 2 Example 3 - These are good web sites to absorb how much of data generated in the world 26
  • 27. Different approaches to handling data exist based on whether - It is data in motion. - It is data at rest. Different approaches To handling data 27
  • 28. - Data at rest would be used by a business analyst to better understand customers’ current buying patterns based on all aspects of the customer relationship, including sales, social media data, and customer service interactions. Here’s a quick example of each:--> - Data in motion would be used if a company is able to analyze the quality of its products during the manufacturing process to avoid costly errors. 28
  • 29. Managing Big data With Big data, now possible to virtualize data. - stored efficiently, utilizing cloud-based storage. - more cost- effectively . - improvements network speed . - reliability have removed other physical limitations to manage massive amounts of data at an acceptable pace. 29
  • 30. Building a Successful Big Data Management - capture. - organize. - Integrate. - analyze. - act . Big data management should beginning with: The cycle of big data management. 30
  • 31. - data must first be captured. - Then organized and integrated. - After this phase is successfully implemented. - data analyzed based on problem being addressed. Finally, management takes action based on the outcome of that analysis. Building a Successful Big Data Management 31
  • 32. The importance of Big data In our world & The our future Big data provides a competitive advantage for organizations . - helps to make decisions are thus increasing efficiency and profit and loss reduction. - extend benefit to including energy, education, health and huge scientific projects like” the human genome project” (the entire genetic material for the study of human beings). 32
  • 33. - Healthcare. - Manufacturing. - Management . - traffic management . Big data applications Some of the emerging applications are in areas such as : They rely on huge volumes, velocities , and varieties data to transform the behavior of a market. 33
  • 34. In healthcare: a big data application might be able to monitor premature infants to determine when data indicates when intervention is needed. In manufacturing, a big data application can be used to prevent a machine from shutting down during a production run. Example 1 Example 2 34
  • 35. Let’s summarize some benefits of Big data Some of benefit of big data are: - Increase of storage capacity. = scalable. - Increase processing power. =real-time. - Availability of data . = full tolerant. - less cost . = commodity hard ware 35
  • 36. Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. History of 36
  • 37. Hadoop is designed to process huge amounts of structured, unstructured data (terabytes to petabytes) and is implemented on racks of commodity servers as a Hadoop cluster. Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency. 37
  • 38. Apache Hadoop is a set of algorithms. - Open source software framework written in Java. - Distributed storage. - Distributed processing . - Built from commodity hardware. - Files are replicated to handle hardware failure - Detect failures and recovers from them What is ? 38
  • 39. Some of Hadoop users - Facebook. - IBM. - Google. - Yahoo!. - New York Times. - Amazon/A9. - And there are others 39
  • 40. The Core of Apache Hadoop At its core, Hadoop has two primary components: 1- storage part - Hadoop Distributed File System <HDFS>. # can support petabytes of data. 2- processing part - MapReduce. #computes results in batch. 40
  • 41. HDFS: - Stores large files across a commodity cluster. - typically in the range of gigabytes to terabytes . - Scalable, and portable file-system . - written in Java for the Hadoop framework . - Replicating data across multiple hosts. Hadoop Distributed File System HDFS 41
  • 42. - Default replication value, 3, data is stored on three nodes: - two on the same rack, and one on a different rack. - Data nodes can talk to each other: - to rebalance data. - to move copies around . - to keep the replication of data high. HDFS: 42
  • 43. Question - Why we need to replicate files in HDF ? To achieve reliability . if any failure occur on any node we can continue in processing 43
  • 44. HDFS works by breaking large files into smaller pieces called blocks. - The blocks are stored on data nodes. - NameNode responsibility to know what blocks on which data nodes make up the complete file. “keeps track of where data is physically stored”. - NameNode acts as a “traffic cop,” managing all access to the files . HDFS 44
  • 45. How a Hadoop cluster is mapped to hardware.45
  • 46. The responsibility of NameNode 1- reads data blocks on the data nodes. 2- writes data blocks on the data nodes. 3- creates data blocks on the data nodes. 4- deletes data blocks on the data nodes. 5- replication of data blocks on the data nodes. - NameNode acts as a “traffic cop,” managing all access to the files including : 46
  • 47. NameNode and data nodes : - they operate in a “loosely coupled” fashion that allows the cluster elements to behave dynamically: - adding (or subtracting) servers as the demand increases (or decreases). How a Hadoop cluster is mapped to hardware. The Relationship Between NameNode & DataNodes 47
  • 48. Are DataNodes also smart? NameNode is very smart Data nodes are not very smart 48
  • 49. Are DataNodes also smart? - Data nodes are not very smart, but the NameNode is. - Because the DataNodes constantly ask the NameNode whether there anything for them to do. - tells NameNode what data nodes out there and how busy they are. - The NameNode so critical for correct operation of the cluster, can and should be replicated to guard against a single point failure. 49
  • 50. Map : distribute a computational problem across a cluster . Reduce : master node collects the answers to all the sub- problems and combines them . master copy copy copy Map Reduce 50
  • 51. An example of an inverted index being created in MapReduce
  • 52. 52 public static class Map extends Mapper<LongWritable, Text, Text, Text> { private Text documentId; private Text word = new Text(); @Override protected void setup(Context context) { String filename = ((FileSplit) context.getInputSplit()).getPath().getName(); documentId = new Text(filename); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { for (String token : StringUtils.split(value.toString())) { word.set(token); context.write(word, documentId);}}} shows the mapper code
  • 53. 53 public static class Map extends Mapper<LongWritable, Text, Text, Text> { When you extend the MapReduce mapper class you specify the key/value types for your inputs and outputs. You use the MapReduce default InputFormat for your job, which supplies keys as byte offsets into the input file, and values as each line in the file. Your map emits Text key/value pairs. The following shows the mapper code
  • 54. 54 To cut down on object creation you create a single Text object, which you’ll reuse private Text word = new Text(); private Text documentId; A Text object to store the document ID (filename) for your input
  • 55. - InputFormat decides how file going to be broken into smaller pieces for processing using a function called InputSplit. - It then assigns a RecordReader to transform the raw data for processing by the map. - Then the map requires two inputs: a key and a value. Workflow and data movement in a small Hadoop cluster Hadoop MapReduce 55
  • 56. - Your data is now in form acceptable to map. - For each input pair, a distinct instance of map is called to process data. - map and reduce need to work together to process your data. - OutputCollector collects output from independent mappers. The mapping begin 56
  • 57. - A Reporter function provides information gathered from map tasks. - to know when or if map tasks are complete. - All this work is being performed on multiple nodes in the Hadoop cluster simultaneously. The mapping begin. Cont. 57
  • 58. - Some of output may on a node different from the node where reducers for that specific output will run. - a partitioner and a sort gather and shuffle of intermediate results. - Map tasks deliver results to specific partition - as inputs to the reduce tasks. Workflow and data movement - After all the map tasks are complete. - the intermediate results are gathered in the partition. - reduce shuffle ,sorting output for optimal processing. 58
  • 59. Reduce & Combine - For each output pair, reduce is called to perform its task. - Reduce gathers its output while all the tasks are processing. - Reduce can’t begin until all the mapping is done, and - It isn’t finished until all instances are complete. - The output of reduce a key and a value. - OutputFormat takes the key-value pair . - organizes the output for writing to HDFS. - RecordWriter takes OutputFormat data and writes it to HDFS. 59
  • 60. The Benefits of MapReduce - Hadoop MapReduce is the heart of the Hadoop system. - MR provides capabilities you need to break big data into manageable chunks. - MR process data in parallel on your cluster. - MR makes data available for user consumption. - MapReduce does all works in a highly resilient, fault-tolerant manner. 60
  • 61. 61
  • 62. Apache Hive Apache Pig Apache HBase SQL-like language and metadata repository High-level language for expressing data analysis programs The Hadoop database. Random, real -time read/write access Hue Browser-based desktop interface for interacting with Hadoop Oozie Server-based workflow engine for Hadoop activities Sqoop Apache Whirr Library for running Hadoop in the cloud Flume Apache Zookeeper Highly reliable distributed coordination service Distributed service for collecting and aggregating log and event data Integrating Hadoop with RDBMS The major Utilities of Hadoop 62
  • 63. - distributed, nonrelational columnar) database that utilizes HDFS as its persistence store. - Modeled after Google BigTable. - Layered on Hadoop clusters . - Capable of hosting very large tables (billions of columns/rows). - Provides random, real-time read/write access . - Highly configurable, providing flexibility to address huge of data efficiently. 63
  • 64. 64
  • 65. Mining Big Data with Hive Hive is a batch-oriented, data-warehousing layer . - built on the core elements of Hadoop (HDFS and MapReduce). - It provides users who know SQL with HiveQL. - Hive queries can take several minutes or hours depending on complexity. - Hive best used for data mining and deeper analytics. - relies on Hadoop foundation. - very extensible, scalable, and resilient. 65
  • 66. Hive uses three mechanisms for data organization: ✓ Tables . ✓ Partitions . ✓ Buckets . supports multitable queries and inserts by sharing input data within a single HiveQL statement. 66
  • 67. Hive tables same as RDBMS tables consisting rows and columns. - Tables are mapped to directories in the file system. - Hive supports tables stored in other native file systems. Tables 67
  • 68. - A Hive table can support one or more partitions. - partitions mapped to subdirectories in underlying file system & represent the distribution of data throughout the table. For example: If a table is called autos, with a key value of 12345 and a maker value Ford, the path to the partition would be /hivewh/autos/kv=12345/Ford. Partitions 68
  • 69. - data may be divided into buckets. - Buckets stored as files in partition directory in underlying file system. - The buckets based on hash of a column in table. In the preceding example, you might have a bucket called Focus, containing all the attributes of a Ford Focus auto. Buckets 69
  • 70. Pig and Pig Latin Pig make Hadoop more approachable and usable by non- developers. - interactive, or script-based, - execution environment supporting Pig Latin. - language used to express data flows. - Pig Latin language supports loading, processing input data transform the input data and produce the desired output. 70
  • 71. Pig execution environment has two modes : ✓ Local mode: All scripts are run on a single machine. Hadoop MapReduce and HDFS are not required. ✓ Hadoop: Also called MapReduce mode, all scripts are run on a given Hadoop cluster. Pig and Pig Latin 71
  • 72. - Pig Latin language abstract way to get answers from big data . - focusing on data and not structure of custom software program. - Pig makes prototyping very simple. For example you can run a Pig script on small representation of your big data environment to ensure getting desired results before commit to processing all data. Pig and Pig Latin. cont. 72
  • 73. - Pig programs run in three different ways, all compatible with local and Hadoop mode: ✓ Script . ✓ Grunt . ✓ Embedded . Pig and Pig Latin. cont. 73
  • 74. ✓ Script: Simply a file containing Pig Latin commands - identified by .pig suffix (for example, file.pig or myscript.pig). - commands interpreted by Pig, executed in sequential order. Script 74
  • 75. ✓ Grunt: Grunt is a command interpreter. - can type Pig Latin on grunt command line . - Grunt execute command on your behalf. - very useful for prototyping and “what if” scenarios. ✓ Embedded: Pig programs executed as part of Java program. Grunt & Embedded 75
  • 76. Pig Latin has rich syntax. It supports operators for following operations: ✓ Loading and storing of data . ✓ Streaming data . ✓ Filtering data . ✓ Grouping and joining data . ✓ Sorting data . ✓ Combining and splitting data . Pig Latin supports wide variety of types, expressions, functions, diagnostic operators, macros, and file system commands. Pig and Pig Latin. cont. 76
  • 77. 77
  • 78. Apache Sqoop - Sqoop (SQL-to-Hadoop) - tool that offers capability to extract data from non- Hadoop data stores. - transform data into form usable by Hadoop. - then load data into HDFS. - This process is called ETL, for Extract, Transform, and Load. - Sqoop commands executed one at a time. 78
  • 79. Features of keys in Sqoop ✓ Bulk import: - Sqoop import individual tables or entire databases into HDFS. - data stored in native directories & files in HDFS file system. ✓ Direct input: - Sqoop import , map SQL databases directly into Hive and HBase. 79
  • 80. ✓ Data interaction: - Sqoop generate Java classes . - you interact with data programmatically. ✓ Data export: Sqoop export data directly from HDFS into a relational database. Features of keys in Sqoop .Cont. 80
  • 81. The Apache Sqoop working - Sqoop works by looking at database you want to import . - Then selecting appropriate import function for source data. - Then recognizes input, then reads metadata for table or database . - Then creates a class definition of your input requirements. 81
  • 82. 82
  • 83. - Zookeeper Hadoop’s way coordinating all elements of distributed applications. - simple, but its features are powerful. - managing groups of nodes in service to single distributed application. - best implemented across racks. 83
  • 84. Some of the capabilities of Zookeeper are as follows: ✓ Process synchronization . ✓ Configuration management . ✓ Self-election . ✓ Reliable messaging . The capabilities of Zookeeper Zookeeper 84
  • 85. The capabilities of Zookeeper. Cont. ✓ Process synchronization: - coordinates starting and stopping of multiple nodes in cluster. - This ensures all processing occurs in intended order. - When entire process group complete, processing occur. 85
  • 86. ✓ Configuration management: - used to send configuration attributes to any or all nodes in cluster. - processing dependent on particular resources being available on all nodes. - ensures consistency of configurations. The capabilities of Zookeeper. Cont. 86
  • 87. ✓ Self-election: - Zookeeper understands makeup of cluster. - can assign “leader” role to one of nodes. - leader/master handles all client requests on behalf of cluster. - if leader node fail, another leader will be elected from remaining nodes. The capabilities of Zookeeper. Cont. 87
  • 88. ✓ Reliable messaging: Even though workloads in Zookeeper - Loosely coupled. - Zookeeper offers a publish/subscribe capability. - allows creation of queue. - queue guarantees message delivery even in case of node failure. The capabilities of Zookeeper. Cont. 88
  • 89. The Benefits of Hadoop - represented most pragmatic that allow companies to manage huge volumes of data easily. - Allowed big problems to broken down into smaller elements . - Analysis done quickly and cost-effectively. - Big data processed in parallel. - Process information and regroup small pieces to present results. 89
  • 90. 90
  • 92. 92