Unleash Your Potential - Namagunga Girls Coding Club
Hive
1. Apache Hive :
• Open sourced software, developed by Apache (Released initial
version in 2012).
• Data warehouse software built on top of HDFS system, used for
data summarization, ad-hoc querying and managing large datasets.
• Database structure - analogous to RDBMS.
• Data distribution - tables are divided into ‘partitions’, which are
further divided into ‘buckets’.
• HiveQL (SQL like language ) used for data analysis.
• Does not not support OLTP / real-time queries / row-level updates.
• Implemented in Java – supports C++, PHP, Python.
• Metadata storage makes the data look up easy.
Editor's Notes
Hive (an Apache project) is a data warehouse software built over hadoop (HDFS) and designed for scalable data analysis, querying and managing large amounts of data. Hive’s initial version was released in 2012.
Data structure is similar to traditional relational database with tables, which supports queries like Unions, Joins, select queries, sub- queries. HiveQL, a query language similar to SQL is used to query the HIVE database. HiveQL supports DDL operations ( Creation / Insertion of records) and does not allow DML operations ( deletions / update of records). Access methods such as CLI,JDBC,ODBC & Thrift are used to access the data in Apache Hive Framework and interface with the external system for business continuity.
In Hive, the basic structural unit is a table which has an associated HDFS directory. Each table has more than one partitions and the data is distributed within HDFS through sub-directories. Each partition is further divided into buckets that represents hash of a column data and stored as a file inside a partition.
Hive is popularly used by organization for data mining, ad-hoc querying, reporting and research on latest trends. Mostly used for batch jobs over large sets of append only data like Web-logs,etc. Prominently known for scalability, extensibility, fault tolerance and loose coupling with its input format.