SlideShare a Scribd company logo
1 of 13
Download to read offline
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Unit V
Pig: Introduction to PIG, Execution Modes of Pig, Comparison of Pig with Databases, Grunt,
Pig Latin, User Defined Functions, Data Processing operators.
Hive: Hive Shell, Hive Services, Hive Metastore, Comparison with Traditional Databases,
HiveQL, Tables, Querying Data and User Defined Functions.
Hbase: HBasics, Concepts, Clients, Example, Hbase Versus RDBMS.
Big SQL Data Analytics with R Machine Learning: Introduction, Supervised Learning,
Unsupervised Learning, Collaborative Filtering.
PIG
Pig is a high-level data flow platform for executing Map Reduce
programs of Hadoop. It was developed by Yahoo. The language for Pig
is pig Latin.
Apache Pig is a high-level data flow platform for executing MapReduce
programs of Hadoop. The language used for Pig is Pig Latin.
Pig can handle any type of data, i.e., structured, semi-structured or
unstructured and stores the corresponding results into Hadoop Data File System. Every task
which can be achieved using PIG can also be achieved using java used in MapReduce.
Features of Apache Pig
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-programmers. Pig
makes this process easy. In the Pig, the queries are converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to execute over the
data set.
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, filter and joins.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Differences between Apache MapReduce and PIG
Apache MapReduce Apache PIG
It is a low-level data processing tool. It is a high-level data flow tool.
Here, it is required to develop complex
programs using Java or Python.
It is not required to develop complex programs.
It is difficult to perform data operations
in MapReduce.
It provides built-in operators to perform data
operations like union, sorting and ordering.
It doesn't allow nested data types. It provides nested data types like tuple, bag, and
map.
Advantages of Apache Pig
• Less code - The Pig consumes less line of code to perform any operation.
• Reusability - The Pig code is flexible enough to reuse again.
• Nested data types - The Pig provides a useful concept of nested data types like tuple,
bag, and map.
Apache Pig Run Modes
Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
• It executes in a single JVM and is used for development experimenting and
prototyping.
• Here, files are installed and run using localhost.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
• The local mode works on a local file system. The input and output data stored in the
local file system.
MapReduce Mode
• The MapReduce mode is also known as Hadoop Mode.
• It is the default mode.
• In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster.
• It can be executed against semi-distributed or fully distributed Hadoop installation.
• Here, the input and output data are present on HDFS.
Ways to execute Pig Program
These are the following ways of executing a Pig program on local and MapReduce mode: -
• Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt
shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
• Batch Mode - In this mode, we can run a script file having a .pig extension. These files
contain Pig Latin commands.
• Embedded Mode - In this mode, we can define our own functions. These functions can
be called as UDF (User Defined Functions). Here, we use programming languages like
Java and Python.
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is
a textual language that abstracts the programming from the Java MapReduce idiom into a
notation.
Pig Latin Statements
The Pig Latin statements are used to process the data. It is an operator that accepts a relation
as an input and generates another relation as an output.
• It can span multiple lines.
• Each statement must end with a semi-colon.
• It may include expression and schemas.
• By default, these statements are processed using multi-query execution
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Pig Latin Conventions
Convention Description
( ) The parenthesis can enclose one or more items. It can also be used to
indicate the tuple data type.
Example - (10, xyz, (3,6,9))
[ ] The straight brackets can enclose one or more items. It can also be used to
indicate the map data type.
Example - [INNER | OUTER]
{ } The curly brackets enclose two or more items. It can also be used to
indicate the bag data type
Example - { block | nested_block }
... The horizontal ellipsis points indicate that you can repeat a portion of the
code.
Example - cat path [path ...]
Latin Data Types
Simple Data Types
Type Description
int It defines the signed 32-bit integer.
Example - 2
long It defines the signed 64-bit integer.
Example - 2L or 2l
float It defines 32-bit floating point number.
Example - 2.5F or 2.5f or 2.5e2f or 2.5.E2F
double It defines 64-bit floating point number.
Example - 2.5 or 2.5 or 2.5e2f or 2.5.E2F
chararray It defines character array in Unicode UTF-8 format.
Example - javatpoint
bytearray It defines the byte array.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
boolean It defines the boolean type values.
Example - true/false
datetime It defines the values in datetime order.
Example - 1970-01- 01T00:00:00.000+00:00
biginteger It defines Java BigInteger values.
Example - 5000000000000
bigdecimal It defines Java BigDecimal values.
Example - 52.232344535345
Complex Types
Type Description
tuple It defines an ordered set of fields.
Example - (15,12)
bag It defines a collection of tuples.
Example - {(15,12), (12,15)}
map It defines a set of key-value pairs.
Example - [open#apache]
Pig Data Types
Apache Pig supports many data types. A list of Apache Pig Data Types with description and
examples are given below.
Type Description Example
Int Signed 32 bit integer 2
Long Signed 64 bit integer 15L or 15l
Float 32 bit floating point 2.5f or 2.5F
Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2
charArray Character array hello javatpoint
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
byteArray BLOB(Byte array)
tuple Ordered set of fields (12,43)
bag Collection f tuples {(12,43),(54,28)}
map collection of tuples [open#apache]
Hive
Apache Hive is a data ware house system for Hadoop that runs
SQL like queries called HQL (Hive query language) which gets
internally converted to map reduce jobs. Hive was developed by
Facebook. It supports Data definition Language, Data
Manipulation Language and user defined functions.
Hive provides the functionality of reading, writing, and
managing large datasets residing in distributed storage. It runs
SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Features of Hive
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce
or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
o Hive is not capable of handling real-time data.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Differences between Hive and Pig
Hive Pig
Hive is commonly used by Data Analysts. Pig is commonly used by programmers.
It follows SQL-like queries. It follows the data-flow language.
It can handle structured data. It can handle semi-structured data.
It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.
Hive is slower than Pig. Pig is comparatively faster than Hive.
Hive Architecture
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
HIVE Data Types
Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.
Integer Types
Type Size Range
TINYINT 1-byte signed
integer
-128 to 127
SMALLINT 2-byte signed
integer
32,768 to 32,767
INT 4-byte signed
integer
2,147,483,648 to 2,147,483,647
BIGINT 8-byte signed
integer
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
Decimal Type
Type Size Range
FLOAT 4-byte Single precision floating point number
DOUBLE 8-byte Double precision floating point number
Date/Time Types
TIMESTAMP
o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9
decimal place precision)
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
DATES
The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between 0000-
-01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.
CHAR
The char is a fixed-length type whose maximum length is fixed at 255.
Complex Type
Type Size Range
Struct It is similar to C struct or an object where fields
are accessed using the "dot" notation.
struct('James','Roy')
Map It contains the key-value tuples where the fields
are accessed using array notation.
map('first','James','last','Roy')
Array It is a collection of similar type of values that
indexable using zero-based integers.
array('James','Roy')
HBase
Hbase is an open source and sorted map data built on Hadoop. It is column oriented and
horizontally scalable.
It is based on Google's Big Table.It has set of tables which keep data in key value format.
Hbase is well suited for sparse data sets which are very common in big data use cases. Hbase
provides APIs enabling development in practically any programming language. It is a part of
the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Why HBase
o RDBMS get exponentially slow as the data becomes large
o Expects data to be highly structured, i.e. ability to fit in a well-defined schema
o Any change in schema might require a downtime
o For sparse datasets, too much of overhead of maintaining NULL values
Features of Hbase
o Horizontally scalable: You can add any number of columns anytime.
o Automatic Failover: Automatic failover is a resource that allows a system
administrator to automatically switch data handling to a standby system in the event
of system compromise
o Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
o sparse, distributed, persistent, multidimensional sorted map, which is indexed by
rowkey, column key,and timestamp.
o Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
o fundamentally, it's a platform for storing and retrieving data with random access.
o It doesn't care about datatypes(storing an integer in one row and a string in another
for the same column).
o It doesn't enforce relationships within your data.
o It is designed to run on a cluster of computers, built using commodity hardware.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
HBase Data Model
HBase Read
A read against HBase must be reconciled between the HFiles, MemStore & BLOCKCACHE.The
BlockCache is designed to keep frequently accessed data from the HFiles in memory so as to
avoid disk reads.Each column family has its own BlockCache.BlockCache contains data in
form of 'block', as unit of data that HBase reads from disk in a single pass.The HFile is
physically laid out as a sequence of blocks plus an index over those blocks. This means
reading a block from HBase requires only looking up that block's location in the index and
retrieving it from disk.
Block: It is the smallest indexed unit of data and is the smallest unit of data that can be read
from disk. default size 64KB.
Scenario, when smaller block size is preferred: To perform random lookups. Having
smaller blocks creates a larger index and thereby consumes more memory.
Scenario, when larger block size is preferred: To perform sequential scans frequently.
This allows you to save on memory because larger blocks mean fewer index entries and thus
a smaller index.
Reading a row from HBase requires first checking the MemStore, then the BlockCache,
Finally, HFiles on disk are accessed.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
HBase Write
When a write is made, by default, it goes into two places:
o write-ahead log (WAL), HLog, and
o in-memory write buffer, MemStore.
Clients don't interact directly with the underlying HFiles during writes, rather writes goes to
WAL & MemStore in parallel. Every write to HBase requires confirmation from both the WAL
and the MemStore.
HBase MemStore
o The MemStore is a write buffer where HBase accumulates data in memory before a
permanent write.
o Its contents are flushed to disk to form an HFile when the MemStore fills up.
o It doesn't write to an existing HFile but instead forms a new file on every flush.
o The HFile is the underlying storage format for HBase.
o HFiles belong to a column family(one MemStore per column family). A column family
can have multiple HFiles, but the reverse isn't true.
o size of the MemStore is defined in hbase-site.xml called
hbase.hregion.memstore.flush.size.

More Related Content

Similar to Unit V.pdf

Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.Triloki Gupta
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...IJCSES Journal
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxmrudulasb
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 

Similar to Unit V.pdf (20)

Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
Apache pig
Apache pigApache pig
Apache pig
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Pig
PigPig
Pig
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 

Recently uploaded

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Recently uploaded (20)

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Unit V.pdf

  • 1. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Unit V Pig: Introduction to PIG, Execution Modes of Pig, Comparison of Pig with Databases, Grunt, Pig Latin, User Defined Functions, Data Processing operators. Hive: Hive Shell, Hive Services, Hive Metastore, Comparison with Traditional Databases, HiveQL, Tables, Querying Data and User Defined Functions. Hbase: HBasics, Concepts, Clients, Example, Hbase Versus RDBMS. Big SQL Data Analytics with R Machine Learning: Introduction, Supervised Learning, Unsupervised Learning, Collaborative Filtering. PIG Pig is a high-level data flow platform for executing Map Reduce programs of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin. Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop. The language used for Pig is Pig Latin. Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the corresponding results into Hadoop Data File System. Every task which can be achieved using PIG can also be achieved using java used in MapReduce. Features of Apache Pig 1) Ease of programming Writing complex java programs for map reduce is quite tough for non-programmers. Pig makes this process easy. In the Pig, the queries are converted to MapReduce internally. 2) Optimization opportunities It is how tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. 3) Extensibility A user-defined function is written in which the user can write their logic to execute over the data set. 4) Flexible It can easily handle structured as well as unstructured data. 5) In-built operators It contains various type of operators such as sort, filter and joins.
  • 2. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Differences between Apache MapReduce and PIG Apache MapReduce Apache PIG It is a low-level data processing tool. It is a high-level data flow tool. Here, it is required to develop complex programs using Java or Python. It is not required to develop complex programs. It is difficult to perform data operations in MapReduce. It provides built-in operators to perform data operations like union, sorting and ordering. It doesn't allow nested data types. It provides nested data types like tuple, bag, and map. Advantages of Apache Pig • Less code - The Pig consumes less line of code to perform any operation. • Reusability - The Pig code is flexible enough to reuse again. • Nested data types - The Pig provides a useful concept of nested data types like tuple, bag, and map. Apache Pig Run Modes Apache Pig executes in two modes: Local Mode and MapReduce Mode. Local Mode • It executes in a single JVM and is used for development experimenting and prototyping. • Here, files are installed and run using localhost.
  • 3. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny • The local mode works on a local file system. The input and output data stored in the local file system. MapReduce Mode • The MapReduce mode is also known as Hadoop Mode. • It is the default mode. • In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster. • It can be executed against semi-distributed or fully distributed Hadoop installation. • Here, the input and output data are present on HDFS. Ways to execute Pig Program These are the following ways of executing a Pig program on local and MapReduce mode: - • Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin statements and command interactively at the command line. • Batch Mode - In this mode, we can run a script file having a .pig extension. These files contain Pig Latin commands. • Embedded Mode - In this mode, we can define our own functions. These functions can be called as UDF (User Defined Functions). Here, we use programming languages like Java and Python. Pig Latin The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a textual language that abstracts the programming from the Java MapReduce idiom into a notation. Pig Latin Statements The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an input and generates another relation as an output. • It can span multiple lines. • Each statement must end with a semi-colon. • It may include expression and schemas. • By default, these statements are processed using multi-query execution
  • 4. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Pig Latin Conventions Convention Description ( ) The parenthesis can enclose one or more items. It can also be used to indicate the tuple data type. Example - (10, xyz, (3,6,9)) [ ] The straight brackets can enclose one or more items. It can also be used to indicate the map data type. Example - [INNER | OUTER] { } The curly brackets enclose two or more items. It can also be used to indicate the bag data type Example - { block | nested_block } ... The horizontal ellipsis points indicate that you can repeat a portion of the code. Example - cat path [path ...] Latin Data Types Simple Data Types Type Description int It defines the signed 32-bit integer. Example - 2 long It defines the signed 64-bit integer. Example - 2L or 2l float It defines 32-bit floating point number. Example - 2.5F or 2.5f or 2.5e2f or 2.5.E2F double It defines 64-bit floating point number. Example - 2.5 or 2.5 or 2.5e2f or 2.5.E2F chararray It defines character array in Unicode UTF-8 format. Example - javatpoint bytearray It defines the byte array.
  • 5. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny boolean It defines the boolean type values. Example - true/false datetime It defines the values in datetime order. Example - 1970-01- 01T00:00:00.000+00:00 biginteger It defines Java BigInteger values. Example - 5000000000000 bigdecimal It defines Java BigDecimal values. Example - 52.232344535345 Complex Types Type Description tuple It defines an ordered set of fields. Example - (15,12) bag It defines a collection of tuples. Example - {(15,12), (12,15)} map It defines a set of key-value pairs. Example - [open#apache] Pig Data Types Apache Pig supports many data types. A list of Apache Pig Data Types with description and examples are given below. Type Description Example Int Signed 32 bit integer 2 Long Signed 64 bit integer 15L or 15l Float 32 bit floating point 2.5f or 2.5F Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2 charArray Character array hello javatpoint
  • 6. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny byteArray BLOB(Byte array) tuple Ordered set of fields (12,43) bag Collection f tuples {(12,43),(54,28)} map collection of tuples [open#apache] Hive Apache Hive is a data ware house system for Hadoop that runs SQL like queries called HQL (Hive query language) which gets internally converted to map reduce jobs. Hive was developed by Facebook. It supports Data definition Language, Data Manipulation Language and user defined functions. Hive provides the functionality of reading, writing, and managing large datasets residing in distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally converted to MapReduce jobs. Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and User Defined Functions (UDF). Features of Hive o Hive is fast and scalable. o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs. o It is capable of analyzing large datasets stored in HDFS. o It allows different storage types such as plain text, RCFile, and HBase. o It uses indexing to accelerate queries. o It can operate on compressed data stored in the Hadoop ecosystem. o It supports user-defined functions (UDFs) where user can provide its functionality. Limitations of Hive o Hive is not capable of handling real-time data.
  • 7. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny o It is not designed for online transaction processing. o Hive queries contain high latency. Differences between Hive and Pig Hive Pig Hive is commonly used by Data Analysts. Pig is commonly used by programmers. It follows SQL-like queries. It follows the data-flow language. It can handle structured data. It can handle semi-structured data. It works on server-side of HDFS cluster. It works on client-side of HDFS cluster. Hive is slower than Pig. Pig is comparatively faster than Hive. Hive Architecture
  • 8. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Hive Client Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types of clients such as:- o Thrift Server - It is a cross-language service provider platform that serves the request from all those programming languages that supports Thrift. o JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver. o ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive. Hive Services The following are the services provided by Hive:- o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands. o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for executing Hive queries and commands. o Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the corresponding HDFS files where the data is stored. o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver. o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler. o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs. o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of their dependencies.
  • 9. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny HIVE Data Types Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of Hive data types is given below. Integer Types Type Size Range TINYINT 1-byte signed integer -128 to 127 SMALLINT 2-byte signed integer 32,768 to 32,767 INT 4-byte signed integer 2,147,483,648 to 2,147,483,647 BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 Decimal Type Type Size Range FLOAT 4-byte Single precision floating point number DOUBLE 8-byte Double precision floating point number Date/Time Types TIMESTAMP o It supports traditional UNIX timestamp with optional nanosecond precision. o As Integer numeric type, it is interpreted as UNIX timestamp in seconds. o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal precision. o As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal place precision)
  • 10. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny DATES The Date value is used to specify a particular year, month and day, in the form YYYY--MM-- DD. However, it didn't provide the time of the day. The range of Date type lies between 0000- -01--01 to 9999--12--31. String Types STRING The string is a sequence of characters. It values can be enclosed within single quotes (') or double quotes ("). Varchar The varchar is a variable length type whose range lies between 1 and 65535, which specifies that the maximum number of characters allowed in the character string. CHAR The char is a fixed-length type whose maximum length is fixed at 255. Complex Type Type Size Range Struct It is similar to C struct or an object where fields are accessed using the "dot" notation. struct('James','Roy') Map It contains the key-value tuples where the fields are accessed using array notation. map('first','James','last','Roy') Array It is a collection of similar type of values that indexable using zero-based integers. array('James','Roy') HBase Hbase is an open source and sorted map data built on Hadoop. It is column oriented and horizontally scalable. It is based on Google's Big Table.It has set of tables which keep data in key value format. Hbase is well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs enabling development in practically any programming language. It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.
  • 11. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Why HBase o RDBMS get exponentially slow as the data becomes large o Expects data to be highly structured, i.e. ability to fit in a well-defined schema o Any change in schema might require a downtime o For sparse datasets, too much of overhead of maintaining NULL values Features of Hbase o Horizontally scalable: You can add any number of columns anytime. o Automatic Failover: Automatic failover is a resource that allows a system administrator to automatically switch data handling to a standby system in the event of system compromise o Integrations with Map/Reduce framework: Al the commands and java codes internally implement Map/ Reduce to do the task and it is built over Hadoop Distributed File System. o sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key,and timestamp. o Often referred as a key value store or column family-oriented database, or storing versioned maps of maps. o fundamentally, it's a platform for storing and retrieving data with random access. o It doesn't care about datatypes(storing an integer in one row and a string in another for the same column). o It doesn't enforce relationships within your data. o It is designed to run on a cluster of computers, built using commodity hardware.
  • 12. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny HBase Data Model HBase Read A read against HBase must be reconciled between the HFiles, MemStore & BLOCKCACHE.The BlockCache is designed to keep frequently accessed data from the HFiles in memory so as to avoid disk reads.Each column family has its own BlockCache.BlockCache contains data in form of 'block', as unit of data that HBase reads from disk in a single pass.The HFile is physically laid out as a sequence of blocks plus an index over those blocks. This means reading a block from HBase requires only looking up that block's location in the index and retrieving it from disk. Block: It is the smallest indexed unit of data and is the smallest unit of data that can be read from disk. default size 64KB. Scenario, when smaller block size is preferred: To perform random lookups. Having smaller blocks creates a larger index and thereby consumes more memory. Scenario, when larger block size is preferred: To perform sequential scans frequently. This allows you to save on memory because larger blocks mean fewer index entries and thus a smaller index. Reading a row from HBase requires first checking the MemStore, then the BlockCache, Finally, HFiles on disk are accessed.
  • 13. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny HBase Write When a write is made, by default, it goes into two places: o write-ahead log (WAL), HLog, and o in-memory write buffer, MemStore. Clients don't interact directly with the underlying HFiles during writes, rather writes goes to WAL & MemStore in parallel. Every write to HBase requires confirmation from both the WAL and the MemStore. HBase MemStore o The MemStore is a write buffer where HBase accumulates data in memory before a permanent write. o Its contents are flushed to disk to form an HFile when the MemStore fills up. o It doesn't write to an existing HFile but instead forms a new file on every flush. o The HFile is the underlying storage format for HBase. o HFiles belong to a column family(one MemStore per column family). A column family can have multiple HFiles, but the reverse isn't true. o size of the MemStore is defined in hbase-site.xml called hbase.hregion.memstore.flush.size.