Big data

Prepared By:
JASKOMAL KAUR
CSE(7TH SEM)
ROLL NO:21/14

 ‘Big Data’ is similar to ‘small data’, but bigger in size
 but having data bigger it requires different approaches:
◦ Techniques, tools and architecture
 an aim to solve new problems or old problems in a better
way
 Big Data generates data that cannot be analyzed with
traditional computing techniques.

 Big Data may well be the Next Big Thing in the IT
world.
 Big data burst upon the scene in the first decade of
the 21st century.
 The first organizations to embrace it were online
and startup firms. Firms like Google, eBay,
LinkedIn, and Facebook.
 Big data can bring about dramatic cost and time
reductions.

Volume
•Data
quantity
Velocity
•Data
Speed
Variety
•Data
Types

•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Facebook ingests 500 terabytes of new data every day.
•240 terabytes of flight data during a single flight across the US.
• The IOT, the data they create and consume .

Mobile Devices
Readers/Scanners
Science facilities
Microphones
Social Media
Cameras
Programs/ Software

 Clickstreams and ad impressions capture user behavior at
millions of events per second
 high-frequency stock trading algorithms reflect market
changes within microseconds
 machine to machine processes exchange data between billions
of devices
 infrastructure and sensors generate massive log data in real-
time

 Big Data isn't just numbers, dates, and strings. Big
Data is also geospatial data, 3D data, audio and
video, and unstructured text, including log files and
social media.
 Traditional database systems were designed to
address smaller volumes of structured data, fewer
updates or a predictable, consistent data structure.
 Big Data analysis includes different types of data

HADOOP TECHNOLOGY
What is Hadoop Technology??
•The most well known technology used for Big Data is
Hadoop.
•It is actually a large scale batch data processing system.
•Distributed cluster system
•Platform for massively scalable applications
•Enables parallel data processing

Developers of Hadoop Technology:
Michael j. cafarella
Doug cutting

Hadoop Features
•Hadoop provides access to the file systems
• The Hadoop Common package contains the
necessary JAR files and scripts
.

Introduction to Hadoop
 HDFS
 Hadoop Distributed File System
 Provides high-throughput access to application data.
 Runs on large clusters of commodity machines
 Is used to store large datasets.
 MapReduce
 Distributed data processing model and execution environment
that runs on large clusters of commodity machines.

Benefits of Hadoop…
•Cost Saving and efficient and reliable data processing
•Provides an economically scalable solution
•Storing and processing of large amount of data
•It is deployed on industry standard servers rather than expensive
specialized data storage systems.
• Parallel processing of huge amounts of data across inexpensive,
industry-standard servers.

PIG
 High level data flow language for exploring very large datasets.
 Provides an engine for executing data flows in parallel on Hadoop.
 Compiler that produces sequences of MapReduce programs
 Operates on files in HDFS
 Key Properties of Pig:
 Ease of programming: Trivial to achieve parallel execution of simple and
parallel data analysis tasks
 Optimization opportunities: Allows the user to focus on semantics rather than
efficiency
 Extensibility: Users can create their own functions to do special-purpose
processing

Who is using Pig?
Source:http://wiki.apache.org/pig/PoweredBy

Pig Execution Modes
 Local Mode
 Need access to a single machine
 All files are installed and run using your local host and file system
 Is invoked by using the -x local flag
 pig -x local
 MapReduce Mode
 Mapreduce mode is the default mode
 Need access to a Hadoop cluster and HDFS installation.
 Can also be invoked by using the -x mapreduce flag or just pig
 pig
 pig -x mapreduce

• Scalable: It can reliably store and process petabytes.
• Economical: It distributes the data and processing across
clusters of commonly available computers (in
thousands).
• Efficient: By distributing the data, it can process it in
parallel on the nodes where the data is located.
• Reliable: It automatically maintains multiple copies of
data and automatically redeploys computing tasks based
on failures.

 Hive resides on top of Hadoop to summarize
Big Data, and makes querying and analyzing
easy.
 Initially Hive was developed by Facebook,
later the Apache Software Foundation took it
up.

All the data types in hive are classified into
four types
 Column Types
 Literals
 Null Values
 Complex Types

Olympic:
DATA
SET:https://drive.google.com/file/d/0
B1QaXx7tpw3SaEE3bEFTQTMzNzg/vie
w?usp=sharing
Olympic Data analysis using Hive
•Using the dataset list the total number
of medals won by each country in
swimming
•Display real life number of medals
India won year wise.
•Find the total number of medals each
country won display the name along
with total medals.
•Find the real life number of gold
medals each country won.

Creation of Table in Hive and Loading
of data
create table olympic (athelete
STRING,age INT,country STRING,year
STRING,closing STRING,sport
STRING,gold INT,silver INT,bronze
INT,total INT)
row format delimited
fields terminated by ‘t’
stored as textfile;
Load from local memory
load data local inpath
‘/home/acadgild/Downloads/olympic_
data.csv’ into table olympic;

select country, SUM(total) from olympic where sport
= “Swimming” GROUP BY country;
2. Display real life number of
medals India won year wise.
select year,SUM(total) from olympic where
country = “India” GROUP BY year;

3.Find the total number of
medals each country won
display the name along with
total medals.
select country,SUM(total) from olympic
GROUP BY country;
4.Find the real life number of
gold medals each country won.
select country,SUM(gold) from olympic
GROUP BY country;

 So major companies like facebook amazon,yahoo,etc. are
adapting Hadoop and in future there can be many names in the
list.
 This technology has bright future scope because day by day
need of data would increase and security issues also the major
point.
 Hence Hadoop Technology is the best appropriate approach
for handling the large data in smart way and its future is
bright…

Big data

More Related Content

What's hot

Similar to Big data

Recently uploaded

Big data

Editor's Notes