The Apache Hadoop HIVE
Omoyayi Ibrahim Omodamilola
Student No.; 20174831
PhD Biomedical Engineering
Outline
• Big Data
• History of Database (NoSQL vs SQL)
• New SQL database
• SQL
• NoSQL
• Factors Affecting the Selection of a database
• Hadoop Hive
– Functions of Hive on Hadoop
– Hive vs Java vs Pig
• Hadoop Distributed File System
• Hive Architecture
• Work Flow of Hive
• List of Reference
INTRODUCTION
• The initiation of The Hadoop Apache Hive began in 2007 by
Facebook due to its data growth.
• This ETL system began to fail over few years as more people
joined Facebook.
• In August 2008, Facebook decided to move to scalable a
more scalable open-source Hadoop environment; Hive
• Facebook, Netflix and Amazons support the Apache Hive
SQL now known as the HiveQL
SQL (left) vs NoSQL (right)
Source: Google Images
NEW STRUCTURED QUERY LANGUAGE
NewSQL
• Relational + NoSQL
• designed for Web-scale applications
• provide many of the traditional SQL
operations
Class of modern relational database management systems that seek to
provide the same scalable performance of NoSQL systems for online
transaction processing (OLTP) read-write workloads while still
maintaining the ACID guarantees of a traditional database system.
RELATIONAL DATABASES SQL
• Structured Query Language (SQL)
• Consists of two or more tables with columns and
row
• Relationship between tables and field types is called
a schema
• (SQL) is a programming language used by database
(MySQL, Sybase, Oracle, or IBM DM2, SQL)
architects to design relational databases.
• These databases are well understood and widely
supported
Popular SQL databases and RDBMS’s
• MySQL—the most popular open-source database
• Oracle—an object-relational DBMS written in the C++ language.
• IMB DB2—a family of database server products from IBM that are
built to handle advanced “big data” analytics.
• Sybase—a relational model database server product for
businesses primarily used on the Unix OS and Linux
• MS SQL Server—a Microsoft-developed RDBMS for enterprise-
level databases that supports both SQL and NoSQL architectures.
• Microsoft Azure—a cloud computing platform that supports any
operating system, and lets you store, compute, and scale data
• MariaDB—an enhanced, drop-in version of MySQL.
• PostgreSQL—an enterprise-level, object-relational DBMS that uses
procedural languages like Perl and Python.
NOSQL DATABASES
• Easy to access
• Greater flexibility
• Documents oriented data
• Massive amounts of data
• Uncleared data requirements
• Data Includes: sensor data, social sharing, personal
settings, photos, location-based information, online
activity, usage metrics, etc
Source: UpWork
POPULAR NOSQL DATABASES
• MongoDB—the most popular NoSQL system
• Apache’s CouchDB—a true DB for the web, it uses the
JSON data exchange format to store its documents
• HBase—another Apache project, developed as a part of
Hadoop, this open-source, non-relational “column
store”
• Oracle NoSQL—Oracle’s entry into the NoSQL category.
• Apache’s Cassandra DB—born at Facebook, handling
massive amounts of structured data. Examples:
Instagram, Comcast, Apple, and Spotify (growing app).
• Riak—It has fault-tolerance replication and automatic
data distribution built in for excellent performance.
SQL
Pros Cons
Relational databases work with structured data. Relational Databases do not scale out
horizontally very well (concurrency and data
size), only vertically.
They support ACID (Atomicity, Consistency,
Isolation, Durability) transactional consistency
and support.
Data is normalized, meaning lots of joins, which
affects speed.
They come with built-in data integrity and a
large eco-system.
Data is normalized, meaning lots of joins, which
affects speed.
Relationships in this system have constraints. They have problems working with semi-
structured data.
There is limitless indexing. Strong SQL
NoSQL
Pros Cons
They scale out horizontally and work with
unstructured and semi-structured data.
Data is deformalized, requiring mass updates
(i.e. product name change).
Some support ACID transactional
consistency.
Weaker or eventual consistency instead of
ACID
Schema-free or Schema-on-read options. Does not have built-in data integrity (must
do in code)
High availability of language training, setup,
and developments cost
Limited support
Databases are open source and so “free” Does not have built-in data integrity (must
do in code)
Numerous commercial products available.
Hadoop
• Facebook, Google, Yahoo, Amazon, and Microsoft
• Exponential growth of data
• Doug Cutting developed an open source version of
MapReduce system called Hadoop
• Hadoop is a software ecosystem that allows for
massively parallel computing
• Large data procedure which might takes 20 hours of
processing time on relational database may only
take 3 minutes with Hadoop
• Hive looks like old SQL - HQL
Hadoop clusters on Client computers
Hive is not
• A relational database
• A design for OnLine Transaction Processing
OLTP
• A language for real-time queries and row-level
updates
FUCTIONS OF HIVE ON HADOOP
• Data Warehouse system built on top of Hadoop
• Takes advantages of Hadoop processing power
• Facilitates data summarization, ad-hoc queries,
analysis of large datasets stored in Hadoop
• Provides a SQL interface (known as Hive QL – HQL)
which is widely familiar to most programmers
• Saves times using Hadoop MapReduce programmes
• Provides mechanism to project structure onto
Hadoop datasets
• Loads fast and allow flexibility at the cost of query
time
Apaches framework
• Sqoop: It is used to import and export data to
and from between HDFS and RDBMS.
• Pig: It is a procedural language platform used
to develop a script for MapReduce operations.
• Hive: It is a platform used to develop SQL type
scripts to do MapReduce operations
Hive vs Java and Pig
Java Pig
• Word Count MapReduce
example: List words and
number of occurrences in a
document
Java takes 63 lines of java codes
to write this hive only takes 7
easy lines of code.
• High level programming
language
• Good for ETL
• Powerful transformation
capabilities
• Often used in combination with
HIVE.
Hive Architecture
HIVE DIRECTORY STRUCTURE
• Lib directory
– SHIVE_HOME/lib
– Location of the Hive JAR files
– Contain the actual Java code that implement the Hive
functionality
• Bin directory
– SHIVE_HOME/bin
– Location of Hive Scripts/Services
• Conf directory
– HIVE_HOME/conf
– Location of configuration files
Summary & Conclusion
• Hive is a data warehouse infrastructure tool to process
structured data in Hadoop.
• It resides on top Hadoop to summarize Big Data, and
makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later the
Apache Software Foundation took it up and
• Developed it further as an open source under the
name Apache Hive.
• It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
REFERENCES
• http://www.dataversity.net/review-pros-cons-
different-databases-relational-versus-non-
relational/
• https://segment.com/blog/choosing-a-
database-for-analytics/
• https://www.upwork.com/hiring/data/sql-vs-
nosql-databases-whats-the-difference/
DON’T THANK ME THANK HIVE

Apache Hadoop Hive

  • 1.
    The Apache HadoopHIVE Omoyayi Ibrahim Omodamilola Student No.; 20174831 PhD Biomedical Engineering
  • 2.
    Outline • Big Data •History of Database (NoSQL vs SQL) • New SQL database • SQL • NoSQL • Factors Affecting the Selection of a database • Hadoop Hive – Functions of Hive on Hadoop – Hive vs Java vs Pig • Hadoop Distributed File System • Hive Architecture • Work Flow of Hive • List of Reference
  • 4.
    INTRODUCTION • The initiationof The Hadoop Apache Hive began in 2007 by Facebook due to its data growth. • This ETL system began to fail over few years as more people joined Facebook. • In August 2008, Facebook decided to move to scalable a more scalable open-source Hadoop environment; Hive • Facebook, Netflix and Amazons support the Apache Hive SQL now known as the HiveQL
  • 5.
    SQL (left) vsNoSQL (right) Source: Google Images
  • 6.
    NEW STRUCTURED QUERYLANGUAGE NewSQL • Relational + NoSQL • designed for Web-scale applications • provide many of the traditional SQL operations Class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.
  • 7.
    RELATIONAL DATABASES SQL •Structured Query Language (SQL) • Consists of two or more tables with columns and row • Relationship between tables and field types is called a schema • (SQL) is a programming language used by database (MySQL, Sybase, Oracle, or IBM DM2, SQL) architects to design relational databases. • These databases are well understood and widely supported
  • 8.
    Popular SQL databasesand RDBMS’s • MySQL—the most popular open-source database • Oracle—an object-relational DBMS written in the C++ language. • IMB DB2—a family of database server products from IBM that are built to handle advanced “big data” analytics. • Sybase—a relational model database server product for businesses primarily used on the Unix OS and Linux • MS SQL Server—a Microsoft-developed RDBMS for enterprise- level databases that supports both SQL and NoSQL architectures. • Microsoft Azure—a cloud computing platform that supports any operating system, and lets you store, compute, and scale data • MariaDB—an enhanced, drop-in version of MySQL. • PostgreSQL—an enterprise-level, object-relational DBMS that uses procedural languages like Perl and Python.
  • 9.
    NOSQL DATABASES • Easyto access • Greater flexibility • Documents oriented data • Massive amounts of data • Uncleared data requirements • Data Includes: sensor data, social sharing, personal settings, photos, location-based information, online activity, usage metrics, etc
  • 10.
  • 11.
    POPULAR NOSQL DATABASES •MongoDB—the most popular NoSQL system • Apache’s CouchDB—a true DB for the web, it uses the JSON data exchange format to store its documents • HBase—another Apache project, developed as a part of Hadoop, this open-source, non-relational “column store” • Oracle NoSQL—Oracle’s entry into the NoSQL category. • Apache’s Cassandra DB—born at Facebook, handling massive amounts of structured data. Examples: Instagram, Comcast, Apple, and Spotify (growing app). • Riak—It has fault-tolerance replication and automatic data distribution built in for excellent performance.
  • 13.
    SQL Pros Cons Relational databaseswork with structured data. Relational Databases do not scale out horizontally very well (concurrency and data size), only vertically. They support ACID (Atomicity, Consistency, Isolation, Durability) transactional consistency and support. Data is normalized, meaning lots of joins, which affects speed. They come with built-in data integrity and a large eco-system. Data is normalized, meaning lots of joins, which affects speed. Relationships in this system have constraints. They have problems working with semi- structured data. There is limitless indexing. Strong SQL
  • 14.
    NoSQL Pros Cons They scaleout horizontally and work with unstructured and semi-structured data. Data is deformalized, requiring mass updates (i.e. product name change). Some support ACID transactional consistency. Weaker or eventual consistency instead of ACID Schema-free or Schema-on-read options. Does not have built-in data integrity (must do in code) High availability of language training, setup, and developments cost Limited support Databases are open source and so “free” Does not have built-in data integrity (must do in code) Numerous commercial products available.
  • 15.
    Hadoop • Facebook, Google,Yahoo, Amazon, and Microsoft • Exponential growth of data • Doug Cutting developed an open source version of MapReduce system called Hadoop • Hadoop is a software ecosystem that allows for massively parallel computing • Large data procedure which might takes 20 hours of processing time on relational database may only take 3 minutes with Hadoop • Hive looks like old SQL - HQL
  • 16.
    Hadoop clusters onClient computers
  • 17.
    Hive is not •A relational database • A design for OnLine Transaction Processing OLTP • A language for real-time queries and row-level updates
  • 18.
    FUCTIONS OF HIVEON HADOOP • Data Warehouse system built on top of Hadoop • Takes advantages of Hadoop processing power • Facilitates data summarization, ad-hoc queries, analysis of large datasets stored in Hadoop • Provides a SQL interface (known as Hive QL – HQL) which is widely familiar to most programmers • Saves times using Hadoop MapReduce programmes • Provides mechanism to project structure onto Hadoop datasets • Loads fast and allow flexibility at the cost of query time
  • 19.
    Apaches framework • Sqoop:It is used to import and export data to and from between HDFS and RDBMS. • Pig: It is a procedural language platform used to develop a script for MapReduce operations. • Hive: It is a platform used to develop SQL type scripts to do MapReduce operations
  • 20.
    Hive vs Javaand Pig Java Pig • Word Count MapReduce example: List words and number of occurrences in a document Java takes 63 lines of java codes to write this hive only takes 7 easy lines of code. • High level programming language • Good for ETL • Powerful transformation capabilities • Often used in combination with HIVE.
  • 21.
  • 22.
    HIVE DIRECTORY STRUCTURE •Lib directory – SHIVE_HOME/lib – Location of the Hive JAR files – Contain the actual Java code that implement the Hive functionality • Bin directory – SHIVE_HOME/bin – Location of Hive Scripts/Services • Conf directory – HIVE_HOME/conf – Location of configuration files
  • 23.
    Summary & Conclusion •Hive is a data warehouse infrastructure tool to process structured data in Hadoop. • It resides on top Hadoop to summarize Big Data, and makes querying and analyzing easy. • Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and • Developed it further as an open source under the name Apache Hive. • It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
  • 24.
  • 25.
    DON’T THANK METHANK HIVE