BRINGING SQL TO HADOOP
In our recent post, 8 SQL-on-Hadoop challenges, we quickly listed several
tools that help to bridge the gap between the two technologies without
going into details. This time we’ll dive in and learn about 12 tools that
bring SQL to Hadoop in various ways.
Initially developed by Facebook, Apache Hive is a data warehouse
infrastructure that is built on top of Hadoop. It allows querying data
stored on HDFS for analysis via HQL, an SQL-like language that is
translated to MapReduce jobs. Although it seems to provide SQL
functionality, Hive performs batch processing on Hadoop and does not
provide interactive querying. It stores metadata in a relational database
and requires maintaining a schema for the data. Only four file formats are
supported by Hive: text, SequenceFile, ORC and RCFile. Hive supports
processing compressed data on Hadoop and also user defined functions.
▪ Bottom line - batch processing on Hadoop with an SQL like language
Apache Sqoop allows importing and exporting data from relational
databases to Hadoop via JDBC, the standard API for connecting to
databases with Java. It can also work without JDBC as long as the relevant
tools allow bulk import/export of data. Sqoop works by running a query
on the relational database and exporting the resulting rows into files in
either one of these formats: text, binary, Avro, or Sequence Files. These
files can then be saved on Hadoop’s HDFS. They can also be exported from
Hadoop back into a relational database. Finally, Sqoop integrates with
HCatalog, a table and storage management service for Hadoop that allows
querying Sqoop’s imported files via Hive or Pig. See our Sqoop blog
post for more info.
▪ Bottom line - import/export data from SQL databases to/from Apache
BigSQL is a pre-made package of PostgreSQL and Hadoop that you can
easily download and install to try out on your local machine. Aside from
Apache Hadoop and PostgreSQL, it also includes Cassandra, Tez, Hive,
Zookeeper, and HadoopFDW. Extra components such as Pig, Sqoop, and
HBase can be downloaded additionally.
▪ Bottom line - pre-made package for trying out Hadoop with PostgreSQL
on your machine
While other tools provide SQL-like syntax, Cascading’s Lingual claims to
provide a full ANSI SQL interface for Hadoop, thus allowing for easier
integration with existing BI tools and helping SQL skilled personnel to use
Hadoop immediately. Lingual supports JDBC and also includes an SQL
shell. Despite the SQL interface, it still executes queries on Hadoop in
▪ Bottom line - ANSI SQL interface for Hadoop
Apache Phoenix is an SQL skin for interactive queries over HBase. It
compiles SQL queries into a series of HBase scans and produces JDBC
result sets. Note that it requires maintaining a schema which could be
built from scratch or mapped from an existing HBase table. Furthermore,
there are several features Phoenix doesn’t support: full transaction
support, derived tables, relational operators, and misc built-in functions
(although they can be added manually). The project is mainly maintained
by Salesforce, Intel, and Hortonworks.
▪ Bottom line - interactive SQL over HBase
Cloudera’s Impala is a query engine that runs on top of Hadoop and
executes interactive SQL queries on HDFS and HBase. While Hive runs in
batch processing, Impala runs the queries in real-time, thus integrating
SQL based business intelligence tools with Hadoop. Although Cloudera is
the main developer behind this tool, it is fully open source and supports
the following file formats: text, LZO, SequenceFile, Avro and RCFile.
Impala can also run on the cloud via Amazon’s Elastic MapReduce.
▪ Bottom line - Cloudera’s solution for interactive SQL queries over HDFS
Presto is also an interactive SQL query engine. It runs on top of Hive,
HBase, and even relational databases and proprietary data stores, thus
combining data from multiple sources across the organization. Facebook is
the main developer behind Presto and the company uses it to query
internal data stores, including a 300PB data warehouse. Airbnb and
Dropbox also use Presto, so it seems tried and tested for the enterprise.
▪ Bottom line - Facebook’s solution for interactive SQL queries over Hive
CitusDB (not to be confused with CitrusDB) is another interactive querying
engine with SQL-like functionality that works over Hadoop. It’s based on
Dremel, Google’s version of a real-time analytics database to process Big
Data, and unlike Impala and Presto it uses PostgreSQL as the SQL engine
that works behind the scenes. CitusDB can run on-premise or in the cloud
and supports features such as full-text search and geo search as well as
ODBC/JDBC compatibility. However, being an analytical database it only
supports loading the data in batches.
▪ Bottom line - SQL on Hadoop interactive querying with PostgreSQL
InfiniDB is a columnar database that integrates with HDFS to perform real-
time analytics on Hadoop with MySQL compatibility. The data is stored in
their own columnar format on disk with support for MySQL’s major data
types. Other formats and non-relational data structures aren’t supported,
although Parquet is on the long term road map. They recently
ran benchmarks against other open source SQL-on-Hadoop engines and
claim to have much better performance than Hive and Presto. InfiniDB
also supports windowing functions for analytics.
Hadapt is a commercial product that brings a native SQL implementation
to Hadoop. Because it combines Hadoop with a storage layer of a
relational database, it allows querying Hadoop via SQL interactively rather
than as a batch process. They can handle structured and unstructured
data without a predefined schema.
▪ Bottom line - interactive SQL querying on Hadoop
Jethro claims the title of "fastest SQL on Hadoop" by providing an SQL
engine for Hadoop that automatically indexes the data as soon as it is
written to Hadoop. According to them, it executes queries 100 times
faster than Hive and 10 times faster than Impala. Jethro can be added to
an existing Hadoop cluster and is supposed to be non-intrusive and it isn’t
installed on any of the Hadoop storage nodes.
▪ Bottom line - fast non-intrusive SQL-on-Hadoop via auto-indexing
HAWQ (HAdoop With Query) is a commercial SQL-on-Hadoop platform by
Pivotal, a subsidiary of EMC. It provides a parallel SQL query engine using
Pivotal’s Greenplum Analytic Database and Hadoop’s HDFS for data
storage. This engine is supposed to be useful for analytics with full
transaction support and supports creating external tables on HDFS that
read text, Hive, HBase, and soon Parquet. Pivotal received
some criticism about a year ago that this is not a true Hadoop product
because they claim to have over 300 engineers working on Hadoop, yet
none of them contribute to any of the Hadoop related projects. As these
lines are written, that’s still true.
▪ Bottom line - Pivotal’s SQL-on-Hadoop