SlideShare a Scribd company logo
1 of 36
Download to read offline
Hive Quick Start

  © 2010 Cloudera, Inc.
Background
•  Started at Facebook	
•  Data was collected
   by nightly cron
   jobs into Oracle DB	
•  “ETL” via hand-
   coded python	
•  Grew from 10s of
   GBs (2006) to 1 TB/
   day new data
   (2007), now 10x
   that.	


                  © 2010 Cloudera, Inc.
Hadoop as Enterprise Data
          Warehouse
•  Scribe and MySQL data loaded
   into Hadoop HDFS	
•  Hadoop MapReduce jobs to
   process data	
•  Missing components:	
 –  Command-line interface for “end users”
 –  Ad-hoc query support
   • … without writing full MapReduce jobs
 –  Schema information
                 © 2010 Cloudera, Inc.
Hive Applications
•  Log processing	
•  Text mining	
•  Document indexing	
•  Customer-facing business
   intelligence (e.g., Google
   Analytics)	
•  Predictive modeling, hypothesis
   testing	
             © 2010 Cloudera, Inc.
Hive Architecture




     © 2010 Cloudera, Inc.
Data Model
•  Tables	
  –  Typed columns (int, float, string, date,
     boolean)
  –  Also, array/map/struct for JSON-like data
•  Partitions	
  –  e.g., to range-partition tables by date
•  Buckets	
  –  Hash partitions within ranges (useful for
     sampling, join optimization)

                  © 2010 Cloudera, Inc.
Column Data Types
                        
CREATE TABLE t (
  s STRING,
  f FLOAT,
  a ARRAY<MAP<STRING, STRUCT<p1:INT,
  p2:INT>>);

SELECT s, f, a[0][‘foobar’].p2 FROM t;




                 © 2010 Cloudera, Inc.
Metastore
•  Database: namespace containing
   a set of tables	
•  Holds Table/Partition
   definitions (column types,
   mappings to HDFS directories)	
•  Statistics	
•  Implemented with DataNucleus
   ORM. Runs on Derby, MySQL, and
   many other relational databases	
             © 2010 Cloudera, Inc.
Physical Layout
•  Warehouse directory in HDFS	
    –  e.g., /user/hive/warehouse
•  Table row data stored in
   subdirectories of warehouse	
•  Partitions form subdirectories of
   table directories	
•  Actual data stored in flat files	
    –  Control char-delimited text, or
       SequenceFiles
    –  With custom SerDe, can use arbitrary
       format
                  © 2010 Cloudera, Inc.
Installing Hive
                         
From a Release Tarball:	

 $ wget http://archive.apache.org/dist/
 hadoop/hive/hive-0.5.0/hive-0.5.0-
 bin.tar.gz
 $ tar xvzf hive-0.5.0-bin.tar.gz
 $ cd hive-0.5.0-bin
 $ export HIVE_HOME=$PWD
 $ export PATH=$HIVE_HOME/bin:$PATH


                © 2010 Cloudera, Inc.
Installing Hive
                          
Building from Source:	
 $ svn co http://svn.apache.org/repos/asf/
  hadoop/hive/trunk hive
  $ cd hive
  $ ant package
  $ cd build/dist
  $ export HIVE_HOME=$PWD
  $ export PATH=$HIVE_HOME/bin:$PATH



                © 2010 Cloudera, Inc.
Installing Hive
                          
Other Options:	
•  Use a Git Mirror:	
  –  git://github.com/apache/hive.git


•  Cloudera Hive Packages	
  –  Redhat and Debian
  –  Packages include backported patches
  –  See archive.cloudera.com

                 © 2010 Cloudera, Inc.
Hive Dependencies
                      
•  Java 1.6	
•  Hadoop 0.17-0.20	
•  Hive *MUST* be able to find
   Hadoop:	
 –  $HADOOP_HOME=<hadoop-install-dir>
 –  Add $HADOOP_HOME/bin to $PATH




              © 2010 Cloudera, Inc.
Hive Dependencies
                             
•  Hive needs r/w access to /tmp
   and /user/hive/warehouse on
   HDFS:	

$   hadoop   fs   –mkdir   /tmp
$   hadoop   fs   –mkdir   /user/hive/warehouse
$   hadoop   fs   –chmod   g+w /tmp
$   hadoop   fs   –chmod   g+w /user/hive/warehouse



                       © 2010 Cloudera, Inc.
Hive Configuration
•  Default configuration in
   $HIVE_HOME/conf/hive-default.xml	
    –  DO NOT TOUCH THIS FILE!

•  Re(Define) properties in
   $HIVE_HOME/conf/hive-site.xml	

•  Use $HIVE_CONF_DIR to specify
   alternate conf dir location	

               © 2010 Cloudera, Inc.
Hive Configuration
•  You can override Hadoop
   configuration properties in
   Hive’s configuration, e.g:	
 – mapred.reduce.tasks=1




              © 2010 Cloudera, Inc.
Logging
•  Hive uses log4j	
•  Log4j configuration located in
   $HIVE_HOME/conf/hive-
   log4j.properties	
•  Logs are stored in /tmp/$
   {user.name}/hive.log	



             © 2010 Cloudera, Inc.
Starting the Hive CLI
•  Start a terminal and run	
 $ hive


•  Should see a prompt like:	
 hive>




                © 2010 Cloudera, Inc.
Hive CLI Commands
                      
•  Set a Hive or Hadoop conf prop:	
 – hive> set propkey=value;
•  List all properties and values:	
 – hive> set –v;
•  Add a resource to the DCache:	
 – hive> add [ARCHIVE|FILE|JAR]
   filename;


              © 2010 Cloudera, Inc.
Hive CLI Commands
                      
•  List tables:	
  – hive> show tables;
•  Describe a table:	
  – hive> describe <tablename>;
•  More information:	
  – hive> describe extended
    <tablename>;


               © 2010 Cloudera, Inc.
Hive CLI Commands
                      
•  List Functions:	
  – hive> show functions;
•  More information:	
  – hive> describe function
    <functionname>;




               © 2010 Cloudera, Inc.
Selecting data
hive> SELECT * FROM <tablename> LIMIT 10;

hive> SELECT * FROM <tablename>
  WHERE freq > 100 SORT BY freq ASC
  LIMIT 10;




                 © 2010 Cloudera, Inc.
Manipulating Tables
                         
•  DDL operations	
  –  SHOW TABLES
  –  CREATE TABLE
  –  ALTER TABLE
  –  DROP TABLE




                © 2010 Cloudera, Inc.
Creating Tables in Hive
                             
•  Most straightforward:	

CREATE TABLE foo(id INT, msg STRING);

•  Assumes default table layout	
  –  Text files; fields terminated with ^A, lines terminated with
     n




                        © 2010 Cloudera, Inc.
Changing Row Format
                       
• Arbitrary field, record
  separators are possible.
  e.g., CSV format:	

 CREATE TABLE foo(id INT, msg STRING)
 DELIMITED FIELDS TERMINATED BY ‘,’
 LINES TERMINATED BY ‘n’;



              © 2010 Cloudera, Inc.
Partitioning Data
                          
•  One or more partition columns may be
   specified:	

 CREATE TABLE foo (id INT, msg STRING)
 PARTITIONED BY (dt STRING);

•  Creates a subdirectory for each value of
   the partition column, e.g.:	
 	/user/hive/warehouse/foo/dt=2009-03-20/

•  Queries with partition columns in WHERE
   clause will scan through only a subset of
   the data	

                 © 2010 Cloudera, Inc.
Sqoop = SQL-to-Hadoop




       © 2010 Cloudera, Inc.
Sqoop: Features
                             
•  JDBC-based interface (MySQL, Oracle, PostgreSQL,
   etc…)	
•  Automatic datatype generation	
    –  Reads column info from table and generates Java classes
    –  Can be used in further MapReduce processing passes
•  Uses MapReduce to read tables from database	
    –  Can select individual table (or subset of columns)
    –  Can read all tables in database
•  Supports most JDBC standard types and null values	




                        © 2010 Cloudera, Inc.
Example input
                               
mysql> use corp;
Database changed
mysql> describe employees;
+------------+-------------+------+-----+---------+----------------+
| Field      | Type        | Null | Key | Default | Extra          |
+------------+-------------+------+-----+---------+----------------+
| id         | int(11)     | NO   | PRI | NULL    | auto_increment |
| firstname | varchar(32) | YES |       | NULL    |                |
| lastname   | varchar(32) | YES |      | NULL    |                |
| jobtitle   | varchar(64) | YES |      | NULL    |                |
| start_date | date        | YES |      | NULL    |                |
| dept_id    | int(11)     | YES |      | NULL    |                |
+------------+-------------+------+-----+---------+----------------+




                          © 2010 Cloudera, Inc.
Loading into HDFS


$ sqoop --connect jdbc:mysql://db.foo.com/corp 
     --table employees



•  Imports “employees” table into HDFS directory	




                     © 2010 Cloudera, Inc.
Hive Integration
$ sqoop --connect jdbc:mysql://db.foo.com/
  corp --hive-import --table employees

•  Auto-generates CREATE TABLE / LOAD DATA
   INPATH statements for Hive	
•  After data is imported to HDFS, auto-executes
   Hive script	
•  Follow-up step: Loading into partitions	




                   © 2010 Cloudera, Inc.
Hive Project Status
•  Open source, Apache 2.0 license	
•  Official subproject of Apache
   Hadoop	
•  Current version is 0.5.0	
•  Supports Hadoop 0.17-0.20	




             © 2010 Cloudera, Inc.
Conclusions

•  Supports rapid iteration of ad-
   hoc queries	
•  High-level Interface (HiveQL)
   to low-level infrastructure
   (Hadoop).	
•  Scales to handle much more data
   than many similar systems	

             © 2010 Cloudera, Inc.
Hive Resources
                      
Documentation	
•  wiki.apache.org/hadoop/Hive	

Mailing Lists	
•  hive-user@hadoop.apache.org	

IRC	
•  ##hive on Freenode	
             © 2010 Cloudera, Inc.
Carl Steinbach
carl@cloudera.com
Hive Quick Start Tutorial

More Related Content

What's hot

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop ClusterEdureka!
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterEdureka!
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsAnandMHadoop
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 

What's hot (19)

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
6.hive
6.hive6.hive
6.hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic Commands
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hadoop 24/7
Hadoop 24/7Hadoop 24/7
Hadoop 24/7
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Viewers also liked

Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2IMC Institute
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartIMC Institute
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 

Viewers also liked (20)

Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 

Similar to Hive Quick Start Tutorial

Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Hortonworks
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data_blue
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single clusterSalil Navgire
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 

Similar to Hive Quick Start Tutorial (20)

Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
מיכאל
מיכאלמיכאל
מיכאל
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Hive Quick Start Tutorial

  • 1. Hive Quick Start © 2010 Cloudera, Inc.
  • 2. Background •  Started at Facebook •  Data was collected by nightly cron jobs into Oracle DB •  “ETL” via hand- coded python •  Grew from 10s of GBs (2006) to 1 TB/ day new data (2007), now 10x that. © 2010 Cloudera, Inc.
  • 3. Hadoop as Enterprise Data Warehouse •  Scribe and MySQL data loaded into Hadoop HDFS •  Hadoop MapReduce jobs to process data •  Missing components: –  Command-line interface for “end users” –  Ad-hoc query support • … without writing full MapReduce jobs –  Schema information © 2010 Cloudera, Inc.
  • 4. Hive Applications •  Log processing •  Text mining •  Document indexing •  Customer-facing business intelligence (e.g., Google Analytics) •  Predictive modeling, hypothesis testing © 2010 Cloudera, Inc.
  • 5. Hive Architecture © 2010 Cloudera, Inc.
  • 6. Data Model •  Tables –  Typed columns (int, float, string, date, boolean) –  Also, array/map/struct for JSON-like data •  Partitions –  e.g., to range-partition tables by date •  Buckets –  Hash partitions within ranges (useful for sampling, join optimization) © 2010 Cloudera, Inc.
  • 7. Column Data Types CREATE TABLE t ( s STRING, f FLOAT, a ARRAY<MAP<STRING, STRUCT<p1:INT, p2:INT>>); SELECT s, f, a[0][‘foobar’].p2 FROM t; © 2010 Cloudera, Inc.
  • 8. Metastore •  Database: namespace containing a set of tables •  Holds Table/Partition definitions (column types, mappings to HDFS directories) •  Statistics •  Implemented with DataNucleus ORM. Runs on Derby, MySQL, and many other relational databases © 2010 Cloudera, Inc.
  • 9. Physical Layout •  Warehouse directory in HDFS –  e.g., /user/hive/warehouse •  Table row data stored in subdirectories of warehouse •  Partitions form subdirectories of table directories •  Actual data stored in flat files –  Control char-delimited text, or SequenceFiles –  With custom SerDe, can use arbitrary format © 2010 Cloudera, Inc.
  • 10. Installing Hive From a Release Tarball: $ wget http://archive.apache.org/dist/ hadoop/hive/hive-0.5.0/hive-0.5.0- bin.tar.gz $ tar xvzf hive-0.5.0-bin.tar.gz $ cd hive-0.5.0-bin $ export HIVE_HOME=$PWD $ export PATH=$HIVE_HOME/bin:$PATH © 2010 Cloudera, Inc.
  • 11. Installing Hive Building from Source: $ svn co http://svn.apache.org/repos/asf/ hadoop/hive/trunk hive $ cd hive $ ant package $ cd build/dist $ export HIVE_HOME=$PWD $ export PATH=$HIVE_HOME/bin:$PATH © 2010 Cloudera, Inc.
  • 12. Installing Hive Other Options: •  Use a Git Mirror: –  git://github.com/apache/hive.git •  Cloudera Hive Packages –  Redhat and Debian –  Packages include backported patches –  See archive.cloudera.com © 2010 Cloudera, Inc.
  • 13. Hive Dependencies •  Java 1.6 •  Hadoop 0.17-0.20 •  Hive *MUST* be able to find Hadoop: –  $HADOOP_HOME=<hadoop-install-dir> –  Add $HADOOP_HOME/bin to $PATH © 2010 Cloudera, Inc.
  • 14. Hive Dependencies •  Hive needs r/w access to /tmp and /user/hive/warehouse on HDFS: $ hadoop fs –mkdir /tmp $ hadoop fs –mkdir /user/hive/warehouse $ hadoop fs –chmod g+w /tmp $ hadoop fs –chmod g+w /user/hive/warehouse © 2010 Cloudera, Inc.
  • 15. Hive Configuration •  Default configuration in $HIVE_HOME/conf/hive-default.xml –  DO NOT TOUCH THIS FILE! •  Re(Define) properties in $HIVE_HOME/conf/hive-site.xml •  Use $HIVE_CONF_DIR to specify alternate conf dir location © 2010 Cloudera, Inc.
  • 16. Hive Configuration •  You can override Hadoop configuration properties in Hive’s configuration, e.g: – mapred.reduce.tasks=1 © 2010 Cloudera, Inc.
  • 17. Logging •  Hive uses log4j •  Log4j configuration located in $HIVE_HOME/conf/hive- log4j.properties •  Logs are stored in /tmp/$ {user.name}/hive.log © 2010 Cloudera, Inc.
  • 18. Starting the Hive CLI •  Start a terminal and run $ hive •  Should see a prompt like: hive> © 2010 Cloudera, Inc.
  • 19. Hive CLI Commands •  Set a Hive or Hadoop conf prop: – hive> set propkey=value; •  List all properties and values: – hive> set –v; •  Add a resource to the DCache: – hive> add [ARCHIVE|FILE|JAR] filename; © 2010 Cloudera, Inc.
  • 20. Hive CLI Commands •  List tables: – hive> show tables; •  Describe a table: – hive> describe <tablename>; •  More information: – hive> describe extended <tablename>; © 2010 Cloudera, Inc.
  • 21. Hive CLI Commands •  List Functions: – hive> show functions; •  More information: – hive> describe function <functionname>; © 2010 Cloudera, Inc.
  • 22. Selecting data hive> SELECT * FROM <tablename> LIMIT 10; hive> SELECT * FROM <tablename> WHERE freq > 100 SORT BY freq ASC LIMIT 10; © 2010 Cloudera, Inc.
  • 23. Manipulating Tables •  DDL operations –  SHOW TABLES –  CREATE TABLE –  ALTER TABLE –  DROP TABLE © 2010 Cloudera, Inc.
  • 24. Creating Tables in Hive •  Most straightforward: CREATE TABLE foo(id INT, msg STRING); •  Assumes default table layout –  Text files; fields terminated with ^A, lines terminated with n © 2010 Cloudera, Inc.
  • 25. Changing Row Format • Arbitrary field, record separators are possible. e.g., CSV format: CREATE TABLE foo(id INT, msg STRING) DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘n’; © 2010 Cloudera, Inc.
  • 26. Partitioning Data •  One or more partition columns may be specified: CREATE TABLE foo (id INT, msg STRING) PARTITIONED BY (dt STRING); •  Creates a subdirectory for each value of the partition column, e.g.: /user/hive/warehouse/foo/dt=2009-03-20/ •  Queries with partition columns in WHERE clause will scan through only a subset of the data © 2010 Cloudera, Inc.
  • 27. Sqoop = SQL-to-Hadoop © 2010 Cloudera, Inc.
  • 28. Sqoop: Features •  JDBC-based interface (MySQL, Oracle, PostgreSQL, etc…) •  Automatic datatype generation –  Reads column info from table and generates Java classes –  Can be used in further MapReduce processing passes •  Uses MapReduce to read tables from database –  Can select individual table (or subset of columns) –  Can read all tables in database •  Supports most JDBC standard types and null values © 2010 Cloudera, Inc.
  • 29. Example input mysql> use corp; Database changed mysql> describe employees; +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | firstname | varchar(32) | YES | | NULL | | | lastname | varchar(32) | YES | | NULL | | | jobtitle | varchar(64) | YES | | NULL | | | start_date | date | YES | | NULL | | | dept_id | int(11) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+ © 2010 Cloudera, Inc.
  • 30. Loading into HDFS $ sqoop --connect jdbc:mysql://db.foo.com/corp --table employees •  Imports “employees” table into HDFS directory © 2010 Cloudera, Inc.
  • 31. Hive Integration $ sqoop --connect jdbc:mysql://db.foo.com/ corp --hive-import --table employees •  Auto-generates CREATE TABLE / LOAD DATA INPATH statements for Hive •  After data is imported to HDFS, auto-executes Hive script •  Follow-up step: Loading into partitions © 2010 Cloudera, Inc.
  • 32. Hive Project Status •  Open source, Apache 2.0 license •  Official subproject of Apache Hadoop •  Current version is 0.5.0 •  Supports Hadoop 0.17-0.20 © 2010 Cloudera, Inc.
  • 33. Conclusions •  Supports rapid iteration of ad- hoc queries •  High-level Interface (HiveQL) to low-level infrastructure (Hadoop). •  Scales to handle much more data than many similar systems © 2010 Cloudera, Inc.
  • 34. Hive Resources Documentation •  wiki.apache.org/hadoop/Hive Mailing Lists •  hive-user@hadoop.apache.org IRC •  ##hive on Freenode © 2010 Cloudera, Inc.