This document discusses several key differences between traditional databases and Hive. Hive uses a schema-on-read model where the schema is not enforced during data loading, making the initial load much faster. However, this impacts query performance since indexing and compression cannot be applied during loading. Pig Latin is a data flow language where each step transforms the input relation, unlike SQL which is declarative. While Hive originally lacked features like updates, transactions and indexing, the developers are working to integrate HBase and improve support for these features.
2. ī§ A traditional database in many ways (such as
supporting an SQL interface), its HDFS and
map reduce underpinnings mean that there are
a number of architectural differences that
directly influence the features that hive
supports, which in turn affects the uses that
hive can be put to.
3. īŧ In a traditional database, a tableâs schema is enforced at data
load time. If the data being loaded doesnât conform to the
schema, then it is rejected. This design is sometimes called
schema on write, since the data is checked against the schema
when it is written into the database.
īŧ Hive, on the other hand, doesnât verify the data when it is
loaded, but rather when a query is issued. This is called schema
on read. There are trade-offs between the two approaches.
Schema on read makes for a very fast initial load, since the
data does not have to be read, parsed, and serialized to disk in
the databaseâs internal format.
4. īŧ Having seen Pig in action, it might seem that Pig Latin is
similar to SQL. The presence of such operators as GROUP BY
and DESCRIBE reinforces this impression. However, there are
several differences between the two languages, and between
Pig and RDBMSs in general.
īŧ The most significant difference is that Pig Latin is a data flow
programming language, whereas SQL is a declarative
programming language. In other words, a Pig Latin program is
a step-by-step set of operations on an input relation, in which
each step is a single transformation.
īŧ Pig Latin is like working at the level of an RDBMS query
planner, which figures out how to turn a declarative statement
into a system of steps.
5. īŧ The load operation is just a file copy or move. It is more
flexible, too: consider having two schemas for the same
underlying data, depending on the analysis being performed.
(This is possible in Hive using external tables, see âManaged
Tables and External Tablesâ .).
īŧ Schema on write makes query time performance faster, since
the database can index columns and perform compression on
the data. The trade-off, however, is that it takes longer to load
data into the database. Furthermore, there are many scenarios
where the schema is not known at load time, so there are no
indexes to apply, since the queries have not been formulated
yet. These scenarios are where Hive shines.
6. ī§ Updates, transactions, and indexes are mainstays of
traditional databases. Yet, until recently, these
features have not been considered a part of Hiveâs
feature set.
ī§ This is because Hive was built to operate over
HDFS data using Map Reduce, where full-table
scans are the norm and a table update is achieved by
transforming the data into a new table.
7. ī§ On the transactions front, Hive doesnât define clear semantics
for concurrent access to tables, which means applications need
to build their own application-level concurrency or locking
mechanism.
ī§ The Hive team is actively working on improvements in all
these areas. Change is also coming from another direction: H
Base integration. H Base ( H Base Chapter ) has different
storage characteristics to HDFS, such as the ability to do row
updates and column indexing, so we can expect to see these
features used by Hive in future releases. H Base integration
with Hive is still in the early stages of development.
8. Analytical data warehouses and data marts:
ī After a company sorts through the massive amounts of data
available, it is often pragmatic to take the subset of data that
reveals patterns and put it into a form thatâs available to the
business.
ī These warehouses and marts provide compression, multilevel
partitioning, and a massively parallel processing architecture.
Big data analytics:
ī The capability to manage and analyze pet bytes of data enables
companies to deal with clusters of information that could have
an impact on the business.
ī This requires analytical engines that can manage this highly
distributed data and provide results that can be optimized to
solve a business problem.Analytics can get quite complex with
big data.
9. Reporting and visualization:
ī Organizations have always relied on the capability to create reports
to give them an understanding of what the data tells them about
everything from monthly sales figures to projections of growth.
ī Big data changes the way that data is managed and used. If a
company can collect, manage, and analyze enough data, it can use
a new generation of tools to help management truly understand
the impact not just of a collection of data elements but also how
these data elements offer context based on the business problem
being addressed.
ī With big data, reporting and data visualization become tools for
looking at the context of how data is related and the impact of
those relationships on the future.
10. Big data applications:
ī Traditionally, the business expected that data would be used to
answer questions about what to do and when to do it. Data was
often integrated as fields into general-purpose business
applications.
ī With the advent of big data, this is changing. Now, we are seeing
the development of applications that are designed specifically to
take advantage of the unique characteristics of big data.
ī Some of the emerging applications are in areas such as healthcare,
manufacturing management, traffic management, and so on.
ī They rely on huge volumes, velocities, and varieties of data to
transform the behavior of a market. In healthcare, a big data
application might be able to monitor premature infants to
determine when data indicates when intervention is needed.
11. Pig Latin:
ī This section gives an informal description of the
syntax and semantics of the Pig Latin programming
language.
ī It is not meant to offer a complete reference to the
language,§ but there should be enough here for you
to get a good understanding of Pig Latinâs constructs.
ī Pigâs support for complex, nested data structures
differentiates it from SQL, which operates on flatter
data structures.
12. Structure :
A Pig Latin program consists of a collection of statements. A
statement can be thought of as an operation, or a command.â For
example, a GROUP operation is a type of statement:
grouped_records = GROUP records BY year;
īļ Statements are usually terminated with a semicolon, as in the
example of the GROUP statement. In fact, this is an example of a
statement that must be terminated with a semicolon: it is a syntax
error to omit it. The ls command, on the other hand, does not have
to be terminated with a semicolon. As a general guideline,
statements or commands for interactive use in Grunt do not need
the terminating semicolon.