Prepared By:
JASKOMAL KAUR
CSE(7TH SEM)
ROLL NO:21/14
 ‘Big Data’ is similar to ‘small data’, but bigger in size
 but having data bigger it requires different approaches:
◦ Techniques, tools and architecture
 an aim to solve new problems or old problems in a better
way
 Big Data generates data that cannot be analyzed with
traditional computing techniques.
 Big Data may well be the Next Big Thing in the IT
world.
 Big data burst upon the scene in the first decade of
the 21st century.
 The first organizations to embrace it were online
and startup firms. Firms like Google, eBay,
LinkedIn, and Facebook.
 Big data can bring about dramatic cost and time
reductions.
Volume
•Data
quantity
Velocity
•Data
Speed
Variety
•Data
Types
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Facebook ingests 500 terabytes of new data every day.
•240 terabytes of flight data during a single flight across the US.
• The IOT, the data they create and consume .
Mobile Devices
Readers/Scanners
Science facilities
Microphones
Social Media
Cameras
Programs/ Software
 Clickstreams and ad impressions capture user behavior at
millions of events per second
 high-frequency stock trading algorithms reflect market
changes within microseconds
 machine to machine processes exchange data between billions
of devices
 infrastructure and sensors generate massive log data in real-
time
 Big Data isn't just numbers, dates, and strings. Big
Data is also geospatial data, 3D data, audio and
video, and unstructured text, including log files and
social media.
 Traditional database systems were designed to
address smaller volumes of structured data, fewer
updates or a predictable, consistent data structure.
 Big Data analysis includes different types of data
HADOOP TECHNOLOGY
What is Hadoop Technology??
•The most well known technology used for Big Data is
Hadoop.
•It is actually a large scale batch data processing system.
•Distributed cluster system
•Platform for massively scalable applications
•Enables parallel data processing
Developers of Hadoop Technology:
Michael j. cafarella
Doug cutting
Famous Hadoop users
Hadoop Features
•Hadoop provides access to the file systems
• The Hadoop Common package contains the
necessary JAR files and scripts
.
Core-Components of Hadoop:
Introduction to Hadoop
 HDFS
 Hadoop Distributed File System
 Provides high-throughput access to application data.
 Runs on large clusters of commodity machines
 Is used to store large datasets.
 MapReduce
 Distributed data processing model and execution environment
that runs on large clusters of commodity machines.
Benefits of Hadoop…
•Cost Saving and efficient and reliable data processing
•Provides an economically scalable solution
•Storing and processing of large amount of data
•It is deployed on industry standard servers rather than expensive
specialized data storage systems.
• Parallel processing of huge amounts of data across inexpensive,
industry-standard servers.
PIG
 High level data flow language for exploring very large datasets.
 Provides an engine for executing data flows in parallel on Hadoop.
 Compiler that produces sequences of MapReduce programs
 Operates on files in HDFS
 Key Properties of Pig:
 Ease of programming: Trivial to achieve parallel execution of simple and
parallel data analysis tasks
 Optimization opportunities: Allows the user to focus on semantics rather than
efficiency
 Extensibility: Users can create their own functions to do special-purpose
processing
Who is using Pig?
Source:http://wiki.apache.org/pig/PoweredBy
Pig Execution Modes
 Local Mode
 Need access to a single machine
 All files are installed and run using your local host and file system
 Is invoked by using the -x local flag
 pig -x local
 MapReduce Mode
 Mapreduce mode is the default mode
 Need access to a Hadoop cluster and HDFS installation.
 Can also be invoked by using the -x mapreduce flag or just pig
 pig
 pig -x mapreduce
Why Pig?
Equivalent Java MapReduceCode
• Scalable: It can reliably store and process petabytes.
• Economical: It distributes the data and processing across
clusters of commonly available computers (in
thousands).
• Efficient: By distributing the data, it can process it in
parallel on the nodes where the data is located.
• Reliable: It automatically maintains multiple copies of
data and automatically redeploys computing tasks based
on failures.
 Hive resides on top of Hadoop to summarize
Big Data, and makes querying and analyzing
easy.
 Initially Hive was developed by Facebook,
later the Apache Software Foundation took it
up.
All the data types in hive are classified into
four types
 Column Types
 Literals
 Null Values
 Complex Types
Olympic:
DATA
SET:https://drive.google.com/file/d/0
B1QaXx7tpw3SaEE3bEFTQTMzNzg/vie
w?usp=sharing
Olympic Data analysis using Hive
•Using the dataset list the total number
of medals won by each country in
swimming
•Display real life number of medals
India won year wise.
•Find the total number of medals each
country won display the name along
with total medals.
•Find the real life number of gold
medals each country won.
Creation of Table in Hive and Loading
of data
create table olympic (athelete
STRING,age INT,country STRING,year
STRING,closing STRING,sport
STRING,gold INT,silver INT,bronze
INT,total INT)
row format delimited
fields terminated by ‘t’
stored as textfile;
Load from local memory
load data local inpath
‘/home/acadgild/Downloads/olympic_
data.csv’ into table olympic;
select country, SUM(total) from olympic where sport
= “Swimming” GROUP BY country;
2. Display real life number of
medals India won year wise.
select year,SUM(total) from olympic where
country = “India” GROUP BY year;
3.Find the total number of
medals each country won
display the name along with
total medals.
select country,SUM(total) from olympic
GROUP BY country;
4.Find the real life number of
gold medals each country won.
select country,SUM(gold) from olympic
GROUP BY country;
 So major companies like facebook amazon,yahoo,etc. are
adapting Hadoop and in future there can be many names in the
list.
 This technology has bright future scope because day by day
need of data would increase and security issues also the major
point.
 Hence Hadoop Technology is the best appropriate approach
for handling the large data in smart way and its future is
bright…
Big data

Big data

  • 1.
  • 2.
     ‘Big Data’is similar to ‘small data’, but bigger in size  but having data bigger it requires different approaches: ◦ Techniques, tools and architecture  an aim to solve new problems or old problems in a better way  Big Data generates data that cannot be analyzed with traditional computing techniques.
  • 3.
     Big Datamay well be the Next Big Thing in the IT world.  Big data burst upon the scene in the first decade of the 21st century.  The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook.  Big data can bring about dramatic cost and time reductions.
  • 4.
  • 5.
    •A typical PCmight have had 10 gigabytes of storage in 2000. •Today, Facebook ingests 500 terabytes of new data every day. •240 terabytes of flight data during a single flight across the US. • The IOT, the data they create and consume .
  • 6.
  • 8.
     Clickstreams andad impressions capture user behavior at millions of events per second  high-frequency stock trading algorithms reflect market changes within microseconds  machine to machine processes exchange data between billions of devices  infrastructure and sensors generate massive log data in real- time
  • 9.
     Big Dataisn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.  Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.  Big Data analysis includes different types of data
  • 13.
    HADOOP TECHNOLOGY What isHadoop Technology?? •The most well known technology used for Big Data is Hadoop. •It is actually a large scale batch data processing system. •Distributed cluster system •Platform for massively scalable applications •Enables parallel data processing
  • 14.
    Developers of HadoopTechnology: Michael j. cafarella Doug cutting
  • 15.
  • 16.
    Hadoop Features •Hadoop providesaccess to the file systems • The Hadoop Common package contains the necessary JAR files and scripts .
  • 17.
  • 18.
    Introduction to Hadoop HDFS  Hadoop Distributed File System  Provides high-throughput access to application data.  Runs on large clusters of commodity machines  Is used to store large datasets.  MapReduce  Distributed data processing model and execution environment that runs on large clusters of commodity machines.
  • 20.
    Benefits of Hadoop… •CostSaving and efficient and reliable data processing •Provides an economically scalable solution •Storing and processing of large amount of data •It is deployed on industry standard servers rather than expensive specialized data storage systems. • Parallel processing of huge amounts of data across inexpensive, industry-standard servers.
  • 21.
    PIG  High leveldata flow language for exploring very large datasets.  Provides an engine for executing data flows in parallel on Hadoop.  Compiler that produces sequences of MapReduce programs  Operates on files in HDFS  Key Properties of Pig:  Ease of programming: Trivial to achieve parallel execution of simple and parallel data analysis tasks  Optimization opportunities: Allows the user to focus on semantics rather than efficiency  Extensibility: Users can create their own functions to do special-purpose processing
  • 22.
    Who is usingPig? Source:http://wiki.apache.org/pig/PoweredBy
  • 23.
    Pig Execution Modes Local Mode  Need access to a single machine  All files are installed and run using your local host and file system  Is invoked by using the -x local flag  pig -x local  MapReduce Mode  Mapreduce mode is the default mode  Need access to a Hadoop cluster and HDFS installation.  Can also be invoked by using the -x mapreduce flag or just pig  pig  pig -x mapreduce
  • 24.
  • 25.
  • 26.
    • Scalable: Itcan reliably store and process petabytes. • Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). • Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. • Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.
  • 28.
     Hive resideson top of Hadoop to summarize Big Data, and makes querying and analyzing easy.  Initially Hive was developed by Facebook, later the Apache Software Foundation took it up.
  • 29.
    All the datatypes in hive are classified into four types  Column Types  Literals  Null Values  Complex Types
  • 30.
    Olympic: DATA SET:https://drive.google.com/file/d/0 B1QaXx7tpw3SaEE3bEFTQTMzNzg/vie w?usp=sharing Olympic Data analysisusing Hive •Using the dataset list the total number of medals won by each country in swimming •Display real life number of medals India won year wise. •Find the total number of medals each country won display the name along with total medals. •Find the real life number of gold medals each country won.
  • 31.
    Creation of Tablein Hive and Loading of data create table olympic (athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by ‘t’ stored as textfile; Load from local memory load data local inpath ‘/home/acadgild/Downloads/olympic_ data.csv’ into table olympic;
  • 32.
    select country, SUM(total)from olympic where sport = “Swimming” GROUP BY country; 2. Display real life number of medals India won year wise. select year,SUM(total) from olympic where country = “India” GROUP BY year;
  • 33.
    3.Find the totalnumber of medals each country won display the name along with total medals. select country,SUM(total) from olympic GROUP BY country; 4.Find the real life number of gold medals each country won. select country,SUM(gold) from olympic GROUP BY country;
  • 39.
     So majorcompanies like facebook amazon,yahoo,etc. are adapting Hadoop and in future there can be many names in the list.  This technology has bright future scope because day by day need of data would increase and security issues also the major point.  Hence Hadoop Technology is the best appropriate approach for handling the large data in smart way and its future is bright…

Editor's Notes