HCatalog & Templeton
Upcoming SlideShare
Loading in...5
×
 

HCatalog & Templeton

on

  • 2,722 views

HCatalog & Templeton 소개 및 예제

HCatalog & Templeton 소개 및 예제

Statistics

Views

Total Views
2,722
Views on SlideShare
2,720
Embed Views
2

Actions

Likes
0
Downloads
53
Comments
0

1 Embed 2

http://blog.geekple.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

HCatalog & Templeton HCatalog & Templeton Presentation Transcript

  • HCatalog & Templeton Youngwoo Kim (brandon.kim@nexr.com, kt.com) Daegeun Kim (dani.kim@geekple.com) 데이터분석플랫폼 KTCloudware (NexR)Wednesday, July 18, 12
  • HCatalogWednesday, July 18, 12
  • Hadoop Ecosystem (Many data processing tools) MapReduce Hive Pig LoadFunc StoreFunc Metastore SerDe SerDe RDBMS InputFormat / OutputFormat / ... FilesystemWednesday, July 18, 12
  • Problems • Hive 외에는 메타스토어의 부재 • 한 클러스터에서 다양한 도구를 사용하는 경우 연동이 쉽지 않다. • 매번 커뮤니케이션 비용이 발생 • 어디에? 어떻게? 뭘? • M/R, Pig 사용자는 기억해야할 많은 정보 • 스키마, 데이터 경로 또는 포맷 변경은 M/R, Pig 에 많은 영향Wednesday, July 18, 12
  • HCatalog • Apache Incubator • Hive metastore 기반 • M/R, Pig 사용자에게 읽고 쓸 수 있는 프로그래밍 인터페이스 제공 • MapReduce 작업이 필요없는 모든 DDL 명령 제공 (CLI Commands) • import/export, CREATE TABLE AS SELECT 등 제외 • Data exploration 기능 제공 • SHOW TABLES, DESCRIBE 제공 • http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html • Hortonworks,Yahoo, Twitter, ... 등 개발Wednesday, July 18, 12
  • Table abstraction • 메타데이터 • 데이터 위치, 스키마, 압축, 파티션, 포맷 등 • HCatalog를 이용하여 데이터를 추상화 • 한 곳에서 메타데이터가 관리되며 그 만큼 역할 또한 중요 • 컬럼 타입으로 primitives, map, list, struct 지원Wednesday, July 18, 12
  • HCatalog MapReduce Hive Pig HCatInputFormat HCatLoader HCatOutputFormat HCatStorer Metastore SerDe SerDe InputFormat RDBMS OutputFormat FilesystemWednesday, July 18, 12
  • Data types : Pig HCatalog = Hive Pig primitives int, long, float, double, chararray (int, long, float, double, string) map map (contains key and value pairs) list bag (contains a list elements of same data type) struct tuple (contains elements of different data types)Wednesday, July 18, 12
  • ExamplesWednesday, July 18, 12
  • DDL $HCAT_HOME/bin/hcat -e “ drop table if exists rawevents; create external table rawevents ( url string, user string ) partitioned by (ds string) “ $HIVE_HOME/bin/hive -e “ LOAD DATA LOCAL INPATH ‘...’ OVERWRITE INTO TABLE rawevents PARTITION (ds=‘20120530`) “Wednesday, July 18, 12
  • Pig raw = LOAD /data/rawevents/20120530 AS (url, user); botless = FILTER raw BY myudfs.NotABot(user); grpd = GROUP botless by (url, user); cntd = FOREACH grpd GENERATE flatten(url, user), COUNT(botless); STORE cntd INTO /data/counted/20120530; http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8Wednesday, July 18, 12
  • Pig + HCatalog Pig raw = LOAD /data/rawevents/20120530 AS (url, user); Pig + HCatalog raw = LOAD rawevents using org.apache.hcatalog.pig.HCatLoader(); LOAD /data/rawevents/20120530 Pig + HCatalog (Partition Filter) raw_0530 = FILTER raw BY ds = 20120530; Pig STORE cntd INTO /data/counted/20120530; Pig + HCatalog STORE cntd INTO counted using org.apache.hcatalog.pig.HCatStorer(); http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8Wednesday, July 18, 12
  • MapReduce • HCatInputFormat과 HCatOutputFormat 클래스를 활용 • Value 클래스는 기본적으로 HCatRecord를 활용 • Key는 사용하지 않음 • OutputValueClass는 HCatRecord로 설정 • 언제나 그렇듯 Reducer는 필수가 아닌 선택 • 파티션 제어 가능 • 스키마로 쉽게 제어 가능Wednesday, July 18, 12
  • MapReduce - Job Job job = new Job(getConf()); job.setJarByClass(HCatMRTest.class); job.setJobName("HCatMRTest"); job.setOutputKeyClass(WritableComparable.class); job.setOutputValueClass(HCatRecord.class); job.setMapperClass(HCatMRTest.Map.class); job.setInputFormatClass(HCatInputFormat.class); job.setOutputFormatClass(HCatOutputFormat.class); job.setNumReduceTasks(0);Wednesday, July 18, 12
  • MapReduce - DB, TBL, Partition java.util.Map<String, String> partition = ... partition.put("ds", "20120530"); in = InputJobInfo.create("DB", "rawevents", "ds=20120530"); out = OutputJobInfo.create("DB", "counted", partition); HCatInputFormat.setInput(job, in); HCatOutputFormat.setOutput(job, out); HCatSchema s = HCatOutputFormat.getTableSchema(job); HCatOutputFormat.setSchema(job, s);Wednesday, July 18, 12
  • MapReduce - HCatRecord • 레코드 단위로 사용되는 클래스 • boolean, byte, short, integer, long, float, double, string, list, struct, map • tinyint : HCatRecord.getByte • smallint : HCatRecord.getShort • Index 또는 컬럼명으로 접근가능 • 컬럼명으로 접근할 때는 HCatSchema 정보 필요 • 파티션 컬럼이 들어갈 수 있도록 공간 확보Wednesday, July 18, 12
  • MapReduce - HCatRecord 테이블 스키마 정보 획득 방법 HCatSchema in = HCatInputFormat.getTableSchema(context) HCatSchema out = HCatOutputFormat.getTableSchema(context) HCatRecord record = new HCatRecord(3); record.set(“url”, out, value.get(“url”, in)); context.write(null, record); 해당 스키마 정보는 job.xml에 기록(encoded) * mapreduce.lib.hcat.job.info * mapreduce.lib.hcatoutput.infoWednesday, July 18, 12
  • Conclusions • Pig 및 MR만을 사용하더라도 메타데이터 관리가 편해진다 • 다양한 도구를 활용할 때 효과를 발휘 • 빠른 컨트리뷰션이 이루어지고 있어 추후에 더 기대Wednesday, July 18, 12
  • TempletonWednesday, July 18, 12
  • Wednesday, July 18, 12
  • The Templeton project is named after the a character in the award-winning childrens novel Charlottes Web, by E. B. White. The novels protagonist is a pig named Wilber. Templeton is a rat who helps Wilber by running errands and making deliveries.Wednesday, July 18, 12
  • Templeton • HCatalog 연동 • Thrift • Java API (HCATALOG-419) • REST API • Web services interface for HCatalog access and Pig, Hive and MR Job excution • http://github.com/hortonworks/templeton • HCATALOG-182 • a.k.a ‘webhcat’Wednesday, July 18, 12
  • Getting started • Install ◦ Requirements ■ Hadoop 0.20.205 or Hadoop 1.x ■ Zookeeper ■ HCatalog ■ Hadoop Distributed Cache ■ To use the Hive, Pig, or hadoop/streaming resources • Configuration ◦ templeton-site.xml • Security ◦ Default security (without additional authentication) ◦ Authentication via KerberosWednesday, July 18, 12
  • Templeton Resources :version Returns a list of supported response types. status Returns the Templeton server status. version Returns the a list of supported versions and the current version.Wednesday, July 18, 12
  • Templeton Resources (2) ddl Performs an HCatalog DDL command. ddl/database List HCatalog databases. ddl/database/:db (GET) Describe an HCatalog database. ddl/database/:db (PUT) Create an HCatalog database. ddl/database/:db (DELETE) Delete (drop) an HCatalog database. ddl/database/:db/table List the tables in an HCatalog database. ddl/database/:db/table/:table (GET) Describe an HCatalog table. ddl/database/:db/table/:table (POST) Rename an HCatalog table. ddl/database/:db/table/:table/partion List all partitions in an HCatalog table. ddl/database/:db/table/:table/partion/:partition (GET) Describe a single partition in an HCatalog table. ...... ...... ddl/database/:db/table/:table/partion/:partition (PUT)Wednesday, July 18, 12
  • Templeton Resources (3) mapreduce/streaming Creates and queues Hadoop streaming MapReduce jobs. mapreduce/jar Creates and queues standard Hadoop MapReduce jobs. pig Creates and queues Pig jobs. hive Runs Hive queries and commands. queue Returns a list of all jobids registered for the user. queue/:jobid (GET) Returns the status of a job given its ID. queue/:jobid (DELETE) Kill a job given its ID.Wednesday, July 18, 12
  • Examples $ curl -s http://tb080:50111/templeton/v1/status {"status":"ok","version":"v1"} $ curl -s -d user.name=nexr -d exec=show tables; http://tb080:50111/templeton/v1/ddl { "stdout": "empnnamenname_a29n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ...... //[jar:file:/home/nexr/nexr_platforms/hadoop/hadoop-1.0.3/ lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/ StaticLoggerBinder.class]nSLF4J: See http://www.slf4j.org/ codes.html#multiple_bindings for an explanation.nOKnTime taken: 0.491 secondsn", "exitcode": 0 }Wednesday, July 18, 12
  • Examples $ curl -s http://tb080:50111/templeton/v1/ddl/database/default/ table/emp?user.name=nexr { "statement": "use default; desc emp; ", "error": "...", "exec": { "stdout": "{"columns":[{"name":"empno","type":"int "},{"name":"name","type":"string"},{"name":"deptno ","type":"int"}]}t t n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ...... explanation.nOKnTime taken: 0.324 secondsnOKnTime taken: 0.398 secondsn", "exitcode": 0 } }Wednesday, July 18, 12
  • Examples $ curl -s -X PUT -HContent-type:application/json -d { "comment": "Test table", "columns": [ { "name": "id", "type": "bigint" }, { "name": "price", "type": "float", "comment": "The unit price" } ], "partitionedBy": [ { "name": "country", "type": "string" } ], "format": { "storedAs": "rcfile" } } http://tb080:50111/templeton/v1/ddl/database/default/table/test_table? user.name=nexr hive> show tables; OK emp test_table Time taken: 0.477 seconds hive> describe extended test_table; OK id bigint price float The unit price country string Detailed Table Information Table(tableName:test_table, dbName:default, owner:nexr, createTime:1342578059, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:bigint, comment:null), FieldSchema(name:price, type:float, comment:The unit price), FieldSchema(name:country, type:string,Wednesday, July 18, 12
  • Future of Templeton • webhcat • Java API based on REST API • Integrate or replace existing web interfaces, e.g., WebHDFSWednesday, July 18, 12
  • References • Apache HCatalog (Incubating), http:// incubator.apache.org/hcatalog/ • HCatalog, http://www.slideshare.net/ydn/jan-2012-hug- hcatalog • Future of HCatalog, http://www.slideshare.net/ hortonworks/future-of-hcatalog-hadoop-summit-2012 • Introduction to HCatalog, http://geekdani.wordpress.com/ 2012/07/11/introduction-to-hcatalog/ • HCatalog 설치와 HCatalog를 이용한 Hive & Pig 스키마 연 동, http://mixellaneous.tistory.com/1123Wednesday, July 18, 12