• Save
Etu L2 Training - Hadoop 企業應用實作
Upcoming SlideShare
Loading in...5

Etu L2 Training - Hadoop 企業應用實作



Etu L2 Training 上課教材,Hadoop Ecosystem簡介,Pig/Hive/Sqoop簡介。

Etu L2 Training 上課教材,Hadoop Ecosystem簡介,Pig/Hive/Sqoop簡介。



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • In this join, all the records from left hand table that match WHERE clause are return. If the right hand table doesn’t have a record that matches the ON criteria, NULL is used for each column selected from the right hand table.

Etu L2 Training - Hadoop 企業應用實作 Etu L2 Training - Hadoop 企業應用實作 Presentation Transcript

  • Etu Big Data 手作進階企業應用實作
  • 2• Hadoop與海量資料處理概論• Sqoop 介紹與實作• Pig 程式設計與實作• Hive程式設計與實作Etu Big Data 企業應用實作
  • 3Hadoop 與海量資料處理概論
  • 4海量數據現在進行式 …..結構 vs. 非結構生成速度快處理技術難
  • 5Big Data時代來臨Structured (結構化)•Relational Database•File in record formatSemi-structured (半結構化)•XML•Logs•Click-stream•Equipment / Device•RFID tagUnstructured (非結構化)•Web Pages•E-mail•Multimedia•Instant Messages•More Binary Files行動/網際網路Mobile/Internet物聯網Internet of Things
  • 7關鍵在於…個人化
  • 8數據淘金數據挖掘分群分類商品推薦精準廣告演算法行為預測
  • 9很多的非/半結構化資料要在一定的時間內處理完而且成本不能太高30字箴言Volume VarietyVelocity
  • 10資料大到傳統方法無法處理12字箴言
  • 11數據的類型11Social Media Machine / SensorDOC / MediaWebClickstreamAppsCall Log/xDRLog
  • 12Scale Up vs. Scale Out檔案系統ETL 工具或脚本關連式資料庫分散式檔案系統分散式檔案系統分散式檔案系統平行運算 平行運算 平行運算NoSQL NoSQL NoSQLScale Out (TB to PB)ScaleUp(uptoTB)原始數據資料處理查詢應用
  • 13Big Data & Hadoop
  • 14Hadoop與大數據處理關聯式資料庫 & DW異質資料處理平台結構化非結構化15%85%
  • 15What is ? Framework for running distributed applications onlarge cluster built of commodity hardware Originally created by Doug Cutting OSS implementation of Google‟s MapReduce and GFS Hardware failures assumed in design Fault-tolerant via replication The name of “Hadoop” has now evolved to cover afamily of software, but the core is essentiallyMapReduce and a distributed file system
  • 16Why ? Need to process lots of data (up to Petabyte scale) Need to parallelize processing across multitude of CPU Achieve above while KeepIng Software Simple Give scalability with low cost commodity hardware Achieve linear scalability
  • 17What is Hadoop used for ?17 Searching Log Processing Recommendation Systems Business Intelligence / Data Warehousing Video and Image Analysis Archiving
  • 18Hadoop 不只是 HadoopHIVEBig Data ApplicationsPig!ZooKeeperSQLRAW非結構化資料匯入SQL資料匯入分散式檔案系統類SQL資料庫系統(非即時性)分散式資料庫(即時性)平行運算框架資料處理語言Data Mining 程式庫
  • 19Hadoop 是一整個生態系統 ZooKeeper – 管理協調服務 HBase – 分散式即時資料庫 HIVE – Hadoop的資料倉儲系統 Pig – Hadoop的資料處理流程語言 Mahout – Hadoop的數據挖掘函式庫 Sqoop – Hadoop與關連式資料庫的轉換工具19
  • 20HDFS 分散式檔案系統
  • 21HDFS Overview Hadoop Distributed File System Based on Google‟s GFS (Google File System) Master/slave architecture Write once read multiple times Fault tolerant via replication Optimized for larger files Focus on streaming data (high-throughput > lowlatency) Rack-aware (reduce inter-cluster network I/O)21
  • 22HDFS Client API’s “Shell-like” commands ( hadoop dfs [cmd] )22 Native Java API Thrift API for other languages C++, Java, Python, PHP, Ruby, C#cat chgrp chmod chowncopyFromLocal copyToLocal cp du,dusexpunge get getmerge ls,lsrmkdir movefromLocal mv putrm,rmr setrep stat tailtest text touchz
  • 23HDFS Architecture-Read23Name NodeReadClientData Nodelocal diskData Nodelocal diskData Nodelocal diskData Nodelocal diskblock op (heartbeat, replication, re-balancing)name,replicas,block_idnameblock_idlocationXfer Data
  • 24HDFS Architecture-Write24Name NodeWriteClientData Nodelocal diskData Nodelocal diskData Nodelocal diskData Nodelocal disknameblock_sizereplicationnode to writeWrite Data
  • 25HDFS 容錯機制
  • 26Fault Tolerance26Name NodeData Nodelocal diskData Nodelocal diskData Nodelocal diskData Nodelocal diskAuto Replicate
  • 27Map Reduce 平行處理框架
  • 28MapReduce Overview28 Distributed programming paradigm and the frameworkthat is OSS implementation of Google’s MapReduce Modeled using the ideas behind of functionalprogramming map() and reduce () Distributed on as many node as you would like 2 phase process:map() reduce()sub divide& conquercombine &reduce
  • 29MapReduce ABC’s29 Essentially, it’s…A. Take a large problem and divided it into sub-problemsB. Perform the same function on all sub-problemsC. Combine the output from all sub-problems M/R is excellent for problems where the “sub-problems” are NOT interdependent The output of one “mapper” should not depend on theoutput or the communication with another “mapper” The reduce phase doesn’t begin execution until all mappershave finished Failed mapper and reduce tasks get auto restarted Rack/HDFS aware (data locality)
  • 30MapReduce 流程split 0split 1split 2split 3split 4part0map K1K2map K1K2map K1K2K1K1K1reduceK2K2K2reduce part1inputHDFSsort/copymergeoutputHDFS
  • 31Each of mapperprocess a file block31Word CountI am a tiger, you are also a tigera,2also,1am,1are,1I,1tiger,2you,1I,1am,1a,1tiger,1you,1are,1also,1a, 1tiger,1a,2also,1am,1are,1I, 1tiger,2you,1reducereducemapmapmapa, 1a,1also,1am,1are,1I,1tiger,1tiger,1you,1Shuffle & Sortreduce phase, sumand count
  • 32Data LocalityM/RTasktrackers on the samemachines as datanodesOne Rack A Different RackJob on starsDifferent jobIdleThursday, May 27, 2010
  • 33一定要會Java嗎?
  • 34Pig Framework and language (Pig Latin) forcreating and submitting HadoopMapReduce jobs Common data operations like join, groupby, filter, sort, select, etc. are provided Don‟t need to know Java Remove boilerplate aspect from M/R 200 lines in Java -> 15 lines in Pig Feels like SQL34
  • 35Pig Fact from Wiki: 40% of Yahoo‟s M/R jobsare in Pig Interactive shell [grunt] exist User Defined Functions [UDF] Allows you to specify Java code where the logicis too complex for Pig Latin UDF‟s can be part of most every operation in Pig Great for loading and storing custom formats aswell as transforming data35
  • 37Example Pig Script37Taken from Pig Wiki
  • 38HBase – The Big Table
  • 39What HBase is No-SQL (means Non-SQL, not SQL sucks) Good at fast/streaming writes Fault tolerant Good at linear horizontal scalability Very efficient at managing billions of rows andmillions of columns Good at keeping row history Good at auto-balancing A complement to a SQL/DW Great with non-normalized data39
  • 40What HBase is NOT Made for table joins Made for splitting into normalized tables A replacement for RDBMS Great for storing small amount of data Great for large binary data (prefer < 1MB percell) ACID compliant40
  • 41Data Model Simple View is a map Table: similar to relation db table Row Key: row is identified and sorted by key Row ValueKey Row ValueKey Row ValueKey Row ValueTablerow1row2row3
  • 42Data Model (Columns) Table: Row: multiple columns in row’s value ColumnKey Column1 Column2KeyKeyTablerow1row2row3Column3 Column4Column1 Column2 Column3 Column4Column1 Column2 Column3 Column4
  • 43Data Model (Column Family) Table:Row:Column Family: columns are grouped into family. Columnfamily must be predefined with a column family prefix, e.g.“privateData”, when creating schema Column: Column is denoted using family+qualifier, e.g“privateData:mobilePhone”.Key Column1 Column2KeyKeyTablerow1row2row3Column3 Column4Column1 Column2 Column3 Column4Column1 Column2 Column3 Column4Column Family 1Column Family 1Column Family 1Column Family 2Column Family 2Column Family 2
  • 44Data Model (Sparse) Table: Row: Column Family: Column: can be added into existing column family on the fly.Rows can have widely varying number of columns.Key Column1 Column2KeyKeyTablerow1row2row3Column3 Column4Column1 Column5 Column4Column3 Column6Column Family 1Column Family 1Column Family 1Column Family 2Column Family 2Column Family 2
  • 4545HBase ArchitectureThe master keeps track of themetadata for Region Sever andserved Regions and store it inZKThe Hbase clientcommunicate with ZKonly to get region info.All HBase data (Hlog & HFile) are stored on HDFS
  • 46Hive Hadoop的資料倉儲
  • 47Hive 簡介• 由 Facebook 開發• 架構於 Hadoop 之上, 設計用來管理結構化資料的中介軟體• 以 MapReduce 為執行環境• 資料儲存於HDFS上• Metadata 儲存於RDMBS中• Hive的設計原則• 採用類SQL語法• 擴充性 – Types, Functions, Formats, Scripts• 性能與水平擴展能力兼具
  • 48Hive – How it worksDriver(compiler, optimizer, executor)metastoreDataNodeDataNodeDataNodeDataNodeHadoop ClusterM/R M/R M/R M/RWeb UI CLIJDBCODBCCreate M/R Job
  • 49Hadoop 的企業應用RDBMSSensors DevicesWebLogCrawlersERP CRM LOB APPsConnectorsUnstructured DataSSRSSSASHyperionFamiliar End User ToolsPowerView Excel withPowerPivotEmbeddedBIPredictiveAnalyticsStructured dataEtu AppliancethroughHive QLSQLSQL
  • 50參照RDBMS中的資料表RDBMSCustomersWebLogsProductsHDFS
  • 51離線數據分析RDBMSCustomersProductsHDFSSales History
  • 52RDBMSHDFSSales 2008Sales 2009Sales 2010Sales 2008ODBC/JDBC歷史數據與線上數據交互運用
  • 53利用 Hadoop 進行數據彙總RDBMSWebLogsHDFSWebLogSummary
  • 54Hadoop 邁向主流市場
  • 55跨越鴻溝
  • 56企業採用 Hadoop 技術架構的挑戰• 技術/人才缺口1. 企業對 Hadoop 架構普遍陌生2. Hadoop 叢集規劃、部署、管理與系統調校的技術門檻高• 專業服務資源缺口1. 缺乏在地、專業、有實務經驗的Hadoop 顧問服務2. 缺乏能夠提供完整 Big Data 解決方案設計、導入、與維護的專業廠商還處於市場早期助您跨越 Big Data 鴻溝
  • 57Introducing – Etu Appliance 2.0A purpose built high performance appliance for bigdata processing• Automates cluster deployment• Optimize for the highest performance of big data processing• Delivers availability and security
  • 58Key Benefits to Hadoopers• Fully automated deployment and configurationSimplifies configuration and deployment• Reliably deploy running mission critical big data applicationHigh availability made easy• Process and control sensitivity data with confidenceEnterprise-grade security• Fully optimized operating system to boost your data processingperformanceBoost big data processing• Adapts to your workload and grows with your businessProvides a scalable and extensible foundation
  • 59What’s New in Etu Appliance 2.0• New deployment feature – Auto deployment andconfiguration for master node high availability• New security features – LDAP integration andKerberos authentication• New data source – Introduce Etu™ Dataflow datacollection service with built in Syslog and FTP serverfor better integration with existing IT infrastructure• New user experience – new Etu™ ManagementConsole with HDFS file browser and HBase tablemanagement
  • 60Etu Appliance 2.0 – Hardware SpecificationMaster Node – Etu 1000MCPU: 2 x 6 CoreRAM: 48 GB ECCHDD: 300GB/SAS 3.5”/15K RPM x 2 (RAID 1)NIC: Dual Port/1 Gb Ethernet x 1S/W: Etu™ OS/Etu™ Big Data Software StackPower: Redundant Power / 100V~240VWorker Node – Etu 1000WCPU: 2 x 6 CoreRAM: 48 GB ECCHDD: 2TB/SATA 3.5”/7.2K RPM x 4NIC: Dual Port/1 Gb Ethernet x 1S/W: Etu™ OS/Etu™ Big DataSoftware StackPower: Single Power /100V~240VWorker Node – Etu 2000WCPU: 2 x 6 CoreRAM: 48GBHDD: 2TB/SATA 3.5”/7.2K RPM x 8NIC: Dual Port/1 Gb Ethernet x 1S/W: Etu™ OS/Etu™ Big DataSoftware StackPower: Single Power / 100V~240V
  • 61Sqoop : SQL to Hadoop• What is Sqoop ?• Sqoop import to HDFS• Sqoop import to Hive• Sqoop import to Hbase• Sqoop Incremental Imports• Sqoop export
  • 62What is Sqoop• A tool designed to transfer data between Hadoop andrelational databases• Use MapReduce to import and export data• Provide parallels operations• Fault Tolerance, of course!
  • 63How it worksJDBC JDBC JDBCMap Map MapHDFS/HIVE/HBaseSQL statementCreate Map Tasks
  • 64Using sqoop$ sqoop tool-name [tool-arguments]Please try …$ sqoop help
  • 65Sqoop Common Arguments--connect <jdbc-uri>--driver--help-P--password <password>--username <username>--verbose
  • 66Using Options Files$ sqoop import --connect jdbc:mysql://etu-master/db ...Or$ sqoop --options-file ./import.txt --table TESTThe options file contains:import--connectJdbc:mysql://etu-master/db--usernameroot--passwordetuadmin
  • 67Sqoop ImportCommand: sqoop import (generic-args) (import-args)Arguments:• --connect <jdbc-uri> Specify JDBC connect string• --driver <class-name> Manually specify JDBC driver class to use• --help Print usage instructions• -P Read password from console• --password <password> Set authentication password• --username <username> Set authentication username• --verbose Print more information while working
  • 68Import Arguments• --append Append data to an existing dataset in HDFS• -m,--num-mappers <n> Use n map tasks to import in parallel• -e,--query <statement> Import the results of statement.• --table <table-name> Table to read• --target-dir <dir> HDFS destination dir• --where <where clause> WHERE clause to use during import• -z,--compress Enable compression
  • 69Let’s try it!Please refer to L2 training note:• Import nyse_daily to HDFS• Import nyse_dividends to HDFS
  • 70Incremental Import• Sqoop support incremental import• Argument--check-column : column to be examined for importing--incremental : append or lastmodified--last-value : max value from previous import
  • 71Import All Tables• sqoop-all-tables support to import a set of table fromRDBMS to HDFS.
  • 72Sqoop to Hive• Sqoop support hive• Add following argument when import--hive-import : specify sqoop target to Hive--hive-table : specify target table name in Hive--hive-overwrite : overwrite if table existed
  • 73Sqoop to HBase--column-family <family> Sets the target column family for the import--hbase-create-table If specified, create missing HBase tables--hbase-row-key <col> Specifies which input column to use as therow key--hbase-table <table-name> Specifies an HBase table to use as the targetinstead of HDFS
  • 74Sqoop Export• Target table must be exist• Default operation is INSERT, you could specifyUPDATE• Syntax : sqoop export (generic args) (export-args)
  • 75Export Arguments--export-dir <dir> HDFS source path for export--table <name> Table to populate--update-key Anchor column to use for updates.--update-mode updateonly or allowinsert
  • 76Sqoop Job
  • 77More Sqoop Information• Sqoop User Guide :http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
  • 78Coffee Break!
  • 79Pig 程式設計• Introduction to Pig• Reading and Writing Data with Pig• Pig Latin Basics• Debugging Pig Scripts• Pig Best Practices• Pig and HBase
  • 80Pig Introduction• Pig was originally created at Yahoo! To answer asimilar need to Hive– Many developers did nit have the Java and/or MapReduceknowledge required to write standard MapReduce programs– But still needed to query data• Pig is a dataflow language– Language is called Pig Latin– Relatively simple syntax– Under the covers, Pig Latin scripts are turned into MapReducejobs and executed on the cluster
  • 81Pig Features• Pig supports many features which allow developers toperform sophisticated data analysis without having towrite Java MapReduce code– Joining datasets– Grouping data– Referring to elements by position rather than name• Useful for datasets with many elements– Loading non-delimited data using a custom SerDe– Creation of user-defined functions, written in Java– And more
  • 82Pig Word CountBook = LOAD shakespeare/* USING PigStorage() AS (lines:chararray);Wordlist = FOREACH Book GENERATE FLATTEN(TOKENIZE(lines)) as word;GroupWords = GROUP Wordlist BY word;CountGroupWords = FOREACH GroupWords GENERATE group as word,COUNT(Wordlist) as num_occurence;WordCountSorted = ORDER CountGroupWords BY $1 DESC;STORE WordCountSorted INTO wordcount USING PigStorage(,);
  • 83Pig Data Types• Scalar Types– int– long– float– double– chararray– bytearray• Complex Types– tuple ex. (19,2,3)– bag ex. {(19,2), (18,1)}– map ex. [open#apache]• NULL
  • 84Pig Data Type Concepts• In Pig, a single element of data is an atom• A collection of atoms – such as a row, or a partial row– is a tuple• Tuples are collected together into bags• Typically, a Pig Latin script starts by loading one ormore datasets into bags, and then creates new bagsby modifying those it already has
  • 85Pig Schema• Pig eats everything– If schema is available, Pig will make use of it– If schema is not available, Pig will make the best guesses it canbased on how the script treats the dataA = LOAD „text.csv‟ as (field1:chararray, field2:int);• In the example above, Pig will expect this data to have2 fields with specified data types– If there are more fields they will be truncated– If there are less fields NULL will be filled
  • 86Pig Latin: Data Input• The function is LOADsample = LOAD „text.csv‟ as (field1:chararray, field2:int);• In the example above– sample is the name of relation– The file text.csv is loaded.– Pig will expect this data to have 2 fields with specified datatypes• If there are more fields they will be truncated• If there are less fields NULL will be filled
  • 87Pig Latin: Data Output• STORE – Output a relation into a specified HDFSfolderSTORE sample_out into „/tmp/output‟;• DUMP – Output a relation to screenDUMP sample_out;
  • 88Pig Latin: Relational Operations• FOREACH• FILTER• GROUP• ORDER BY• DISTINCT• JOIN• LIMIT
  • 89Pig Latin: FOREACH• FOREACH takes a set of expressions and applies themto every record in the data pipeline, and generatesnew records to send down the pipeline to the nextoperator.• Example:a = LOAD „text.csv‟ as (id, name, phone, zip, address);b = FOREACH a GENERATE id, name;
  • 90Pig Latin: FILTER• FILTER allows you to select which records will beretained in your data pipeline.• Example:a = LOAD „text.csv‟ as (id, name, phone, zip, address);b = FILTER a BY id matches „100*‟;
  • 91Pig Latin: GROUP• GROUP statement collects together records with thesame key.• It is different than the GROUP BY clause in SQL, as inPig Latin there is no direct connection between GROUPand aggregate functions.• GROUP collects all records with the key provided intoa bag and then you can pass this to an aggregatefunction.
  • 92Pig Latin: GROUP (cont)Example:A = LOAD „text.csv‟ as (id, name, phone, zip, address);B = GROUP A BY zip;C = FOREACH B GENERATE group, COUNT(id);STORE C INTO „population_by_zipcode‟;
  • 93Pig Latin: ORDER BY• ORDER statement sorts your data for you by the fieldspecified.• Example:a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);b = ORDER a BY fee;c = ORDER a BY fee DESC, name;DUMP c;
  • 94Pig Latin: DISTINCT• DISTINCT statement removes duplicate records. Noteit works only on entire records, not on individualfields.• Example:a = LOAD „url.csv‟ as (userid, url, dl_bytes, ul_bytes);b = FOREACH a GENERATE userid, url;c = DISTINCT b;
  • 95Pig Latin: JOIN• JOIN selects records from one input to put togetherwith records from another input. This is done byindicating keys from each input, and when those keysare equal, the two rows are joined.• Example:call = LOAD „call.csv‟ as (MSISDN, callee, duration);user = LOAD „user.csv‟ as (name, MSISDN, address);call_bill = JOIN call by MSISDN, user by MSISDN;bill = FOREACH call_bill GENERATE name, MSISDN, callee,duration, address;STORE bill into „to_be_billed‟;
  • 96Pig Latin: LIMIT• LIMIT allows you to limit the number of results in theoutput.• Example:a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);b = ORDER a BY fee DESC, name;top100 = LIMIT b 100;DUMP top100;
  • 97Pig Latin: UDF• UDF(User Defined Function) lets users combine Pigoperators along with their own or other‟s code.• UDF can be written in Java and Python.• UDFs have to be registered before use.• Piggybank is useful• Example:register „path_to_UDF/piggybank.jar‟;a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);b = FOREACH a GENERATE id,org.apache.pig.piggybank.evaluation.string.Reverse(name);
  • 98Debugging Pig• DESCRIBE– Show the schema of a relation in your scripts• EXPLAIN– Show your scripts‟ execution plan in MapReduce manner• ILLUSTRATE– Run scripts with a sampled data• Pig Statistics– A summary set of statistics on your script
  • 99More about Pig• Visit Pig‟s Home Page http://pig.apache.org• http://pig.apache.org/docs/r0.9.2/
  • 100Coffee Break!
  • 101Hive 程式設計與實作• Hive Introduction• Getting Data into Hive• Manipulating Data with Hive• Partitioning and Bucketing Data• Hive Best Practices• Hive and HBase
  • 102Hive: Introduction• Hive was originally developed at Facebook– Provide a very SQL-like language– Can be used by people who know SQL– Under the covers, generates MapReduce job that run on theHadoop cluster– Enabling Hive requires almost no extra work by the systemadministrator
  • 103Hive: Architecture• Driver• 將HiveQL語法編譯成MapReduce任務,進行最佳化,發送到JobTracker執行• CLI/Web UI• Ad-hoc查詢• Schema查詢• 管理介面• Metastore• JDBC/ODBC• 標準介面與其他資料庫工具及應用程式介接Driver(compiler,optimizer,,executor)metastoreWeb UI CLIJDBCODBC
  • 104Driver(compiler, optimizer, executor)metastoreDataNodeDataNodeDataNodeDataNodeHadoop ClusterM/R M/R M/R M/RWeb UI CLIJDBCODBCCreate M/R JobHive: How it works
  • 105The Hive Data Model• Hive „layers‟ table definitions on top of data in HDFS• Databases• Tables– Typed columns (int, float, string, boolean, etc)– Also, list: map (for JSON-like data)• Partition• Buckets
  • 106Hive Datatypes : Primitive Types• TINYINT (1 byte signed integer)• SMALLINT (2 bytes signed integer)• INT (4 bytes signed integer)• BIGINT (8 bytes signed integer)• BOOLEAN (TRUEor FALSE)• FLOAT (single precision floating point)• DOUBLE (Double precision floating point)• STRING (Array of Char)• BINARY (Array of Bytes)• TIMESTAMP (integer, float or string)
  • 107Hive Datatypes: Collection Types• ARRAY <primitive-type>• MAP <primitive-type, primitive-type>• STRUCT <col_name : primitive-type, …>
  • 108Text File Delimiters• By default, hive store data as text file BUT you couldchoose other file formats.• Hive‟s default record and field delimiters:CREATE TABLE …ROW FORMAT DELIMITEDFIELDS TERMINATED BY „001‟COLLECTION ITEMS TERMINATED BY „002‟MAP KEY TERMINATED BY „003‟LINES TERMINATED BY „n‟STORED AS TEXTFILE;
  • 109The Hive Metastore• Hive‟s Metastore is a database containing tabledefinitions and other metadata– By default, stored locally on the client machine in a Derbydatabase– If multiple people will be using Hive, the system administratorshould create a shared Metastore• Usually in MySQL or some other relational database server
  • 110Hive is Schema on Read• Relational Database– Schema on Write– Gatekeeper– Alter schema is painful!• Hive– Schema on Read– Requires less ETL efforts
  • 111Hive Data: Physical Layout• Hive tables are stored in Hive‟s „warehouse‟ directoryin HDFS– By default, /user/hive/warehouse• Tables are stored in subdirectories of the warehousedirectory– Partitions form subdirectories of tables• Possible to create external tables if the data is alreadyin HDFS and should not be move from its currentlocation• Actually data is stored in flat files– Control character-delimited text or SequenceFiles– Can be arbitrary format with the use of a customSerializer/Deserializer(“SerDe”)
  • 112Hive Limitations• Not all “standard” SQL is supported– No correlated subqueries, for example• No support for UPDATE or DELETE• No support for INSERT single rows• Relatively limited number of built-in functions
  • 113Starting The Hive Shell• To launch the Hive shell, start a terminal and run– $ hive• Results in the Hive prompt:– hive>• Autocomplete – Tab• Query Column Headers– Hive> set hive.cli.print.header=true;– Hive> set hive.cli.print.current.db=true;
  • 114Hive’s Word CountCREATE TABLE docs (line STRING);LOAD DATA INPATH docs OVERWRITE INTO TABLE docs;CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, „s‟)) AS word FROM docs) wGROUP BY wordORDER BY count DESC;SELECT * FROM word_counts LIMIT 30;
  • 115HiveQL: Data Definition
  • 119Modify Tables• ALTER TABLE … CHANGECOLUMN old_name new_name typeAFTER column;• ALTER TABLE … (ADD|REPLACE)COLUMNS (column_name column type);
  • 120Partition• Help to organize data in a logical fashion, such ashierarchically.• CREATE TABLE …PARTITIONED BY (column name datatype, …)• CREATE TABLE employee (name STRING,salary FLOAT)PARTITIONED BY (country STRING, state STRING)• Physical Layout in Hive…/employees/country=CA/state=AB…/employees/country=CA/state=BC…
  • 121HiveQL: Data Manipulation
  • 122Loading Data into Hive• LOAD DATA INPATH … INTO TABLE … PARTITION …• Data is loaded into Hive with LOAD DATA INPATHstatement– Assumes that the data in already in HDFSLOAD DATA INPATH “shakespeare_freq” INTO TABLEshakespeare;• If the data is on the local filesystem, use LOAD DATALOCAL INPATH
  • 123Inserting Data into Table fromQueries• INSERT OVERWRITE TABLE employeesPARTITION (country=„US‟, state=„OR‟)SELECT * FROM staged_employees seWHERE se.cnty=„US‟ and se.st=„OR‟;
  • 124Dynamic Partitions Properties• hive.exec.dynamic.partition=true• hive.exec.dynamic.partition.mode=nonstrict• hive.exec.max.dynamic.partitions.pernode=100• hive.exec.max.dynamic.partitions=+1000• hive.exec.max.created.files=100000
  • 125Dynamics Partition Inserts• What if I have so many partitions ?• INSERT OVERWRITE TABLE employeesPARTITION (country, state)SELECT …, se.cnty, se.stFROM staged_employees se;• You can mix static and dynamic partition, for example:• INSERT OVERWRITE TABLE employeesPARTITION (country=„US‟, state)SELECT …, se.cnty, se.stFROM staged_employees seWHERE se.cnty = „US‟;
  • 126Create Table and Loading Data• CREATE TABLE ca_employeesAS SELECT name, salaryFROM employees seWHERE se.state=„CA‟;
  • 127Storing Output Results• The SELECT statement on the previous slide wouldwrite the data to console• To store the result in HDFS, create a new table thenwrite, for example:INSERT OVERWRITE TABLE newTable SELECTs.word, s.freq, k.freq FROM shakespeare s JOINkjv k ON (s.word = k.word) WHERE s.freq >= 5;• Results are stored in the table• Results are just files within the newTable directory– Data can be used in subsequent queries, or in MapReduce jobs
  • 128Exporting Data• If the data file are already formatted as you want,just copy.• Or you can use INSERT … DIRECTORY …, for example• INSERT OVERWRITELOCAL DIRECTORY „./ca_employees‟SELECT name, salary, addressFROM employees seWHERE se.state=„CA‟;
  • 129HiveQL:Queries
  • 130SELECT … FROM• SELECT col_name or functions FROM tab_name;hive> SELECT name FROM employees e;• SELECT … FROM … [LIMIT N]– * or Column alias– Column Arithmetic Operators, Aggregation Function– FROM
  • 131Arithmetic OperatorsOperator Types DescriptionA + B Numbers Add A and BA - B Numbers Subtract B from AA * B Numbers Multiply A and BA / B Numbers Divide A with BA % B Numbers The remainder of dividing A with BA & B Numbers Bitwise ANDA | B Numbers Bitwise ORA ^ B Numbers Bitwise XOR~A Numbers Bitwise NOT of A
  • 132Aggregate Functionscount(*) covar_pop(col)count(expr) covar_samp(col)sum(col) corr(col1,col2)sum(DISTINCT col) percentile(int_expr,p)avg(col) histogram_numericmin(col) collect_set(col)max(col) stddev_pop(col)variance(col), var_pop(col) stddev_samp(col)var_samp(col)• Map Side aggregation for performance improvehive> SET hive.map.aggr=true;
  • 133Other Functions• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-BuiltinFunctions
  • 134When Hive Can Avoid Map Reduce• SELECT * FROM employees;• SELECT * FROM employeesWHERE country=„us‟ AND state=„CA‟LIMIT 100;
  • 135WHERE• >, <, =, >=, <=, !=• IS NULL/IS NOT NULL• OR AND NOT• LIKE– X% (prefix „X‟)– %X (suffix „X‟)– %X% (substring)– _ (single character)• RLIKE (Java Regular Expression)
  • 136GROUP BY• Often used in conjunction with aggregate functions,avg, count, etc.• HAVING– constrain the group produced by GROUP BY in a way thatcould be expressed with a subquery.SELECT year(ymd), avg(price_close) FROM stocksWHERE exchange=„NASDAQ‟ AND symbol=„AAPL‟GROUP BY year (ymd)HAVING avg(price_close) > 50.0;
  • 138Inner JOIN• SELECT a.ymd, a.price_close, b.price_closeFROM stocks a JOIN stocks b ON a.ymd=b.ymdWHERE a.symbol=„AAPL‟ AND b.symbol=„IBM‟• SELECT s.ymd, s.symbol, s.price_close, d.dividendFROM stocks s JOIN dividends d ON s.ymd=d.ymdAND s.symbol=d.symbolWHERE s.symbol=„AAPL‟
  • 139LEFT OUTER JOIN• All records from lefthand table that match WHERE arereturned. NULL will be return if not match ON criteria• SELECT s.ymd, s.symbol, s.price_close, d.dividendFROM stock s LEFT OUTER JOIN dividends dON s.ymd=d.ymd AND s.symbol = d.symbolWHERE s.symbol=„AAPL‟
  • 140RIGHT and FULL OUTER JOIN• RIGHT OUTER JOIN– All records from righthand table that match WHERE arereturned. NULL will be return if not match ON criteria• FULL OUTER JOIN– All records from all tables that match WHERE are returned.NULL will be return if not match ON criteria
  • 141LEFT SEMI-JOIN• Returns records from lefthand table if records arefound in righthand table that satisfy the ONpredicates.• SELECT s.ymd, s.symbol, s.price_closeFROM stocks LEFT SEMI JOIN dividends dON s.ymd = d.ymd AND s.symbol = d.symbol• RIGHT SEMI-JOIN is not supported
  • 142Map-side Joins• If one of the table is small, the largest table can bestreamed through mappers and the small tables arecached in memory.• SELECT /*+ MAPJOIN(d)*/ s.ymd, s.symbol,s.price_close, d.dividendFROM stocks s JOIN dividends d ON s.ymd=d.ymdAND s.symbol=d.symbolWHERE s.symbol=„AAPL‟
  • 143ORDER BY and SORT BY• ORDER BY performs a total ordering of query resultset.• All data passed through single reducer. Caution forlarger data sets. For example:– SELECT s.ymd, s.symbol, s.price_close FROM stocks s ORDERBY s.ymd ASC, s.symbol DESC;• SORT BY performs a local ordering, where eachreducer‟s output will be sorted.
  • 144DISTRIBUTE BY with SORT BY• By default, MapReduce partition mapper output byhash of key-value. This will cause the overlap of databetween reducers.• We can use DISTRIBUTED BY to ensure the recordwith the same column go to the same reducer and useSORT BY to order the data.• SELECT s.ymd, s.symbol, s.price_close FROM stocks sDISTRIBUTED BY s.symbolSORT BY s.symbol ASC, s.ymd ASC• DISTRIBUTED BY requires SORT BY
  • 145CLUSTER BY• Short-hand of DISTRIBUTED BY … SORT BY• CLUSTER BY does not perform SORT• SELECT s.ymd, s.symbol, s.price_closeFROM stocks sCLUSTER BY s.symbol;
  • 146Creating User-Defined Functions• Hive supports manipulation of data via user-createdfunctions• Example:INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM(userid, movieid, rating, unixtime) USING pythonweekday_mapper.py AS (userid, movieid, rating, weekday)FROM u_data;
  • 147Hive: Where to Learn More• http://hive.apache.org/• Programming Hive
  • 148Choosing Between Pig and Hive• Typically, organizations wanting an abstraction on topof standard MapReduce will choose to use either Hiveor Pig• Which one is chosen depends on the skillset of thetarget users– Those with an SQL background will naturally gravitate towardsHive– Those who do not know SQL will often choose Pig• Each has strengths and weaknesses, it is worthspending some time investigating each so you canmake an informed decision• Some organizations are now choosing to use both– Pig deals better with less-structured data, so Pig is used tomanipulate the data into a more structured form, then Hive isused to query that structured data
  • www.etusolution.cominfo@etusolution.comTaipei, Taiwan318, Rueiguang Rd., Taipei 114, TaiwanT: +886 2 7720 1888F: +886 2 8798 6069Beijing, ChinaRoom B-26, Landgent Center,No. 24, East Third Ring Middle Rd.,Beijing, China 100022T: +86 10 8441 7988F: +86 10 8441 7227Contact