Your SlideShare is downloading. ×
0
HCatalog & Templeton                         Youngwoo Kim (brandon.kim@nexr.com, kt.com)                                 D...
HCatalogWednesday, July 18, 12
Hadoop Ecosystem                               (Many data processing tools)                   MapReduce               Hive...
Problems                    •    Hive 외에는 메타스토어의 부재                    •    한 클러스터에서 다양한 도구를 사용하는 경우 연동이 쉽지 않다.           ...
HCatalog                    •    Apache Incubator                    •    Hive metastore 기반                    •    M/R, P...
Table abstraction                    •    메타데이터                         •   데이터 위치, 스키마, 압축, 파티션, 포맷 등                    ...
HCatalog                   MapReduce                 Hive                  Pig                   HCatInputFormat          ...
Data types : Pig                           HCatalog = Hive                             Pig                                ...
ExamplesWednesday, July 18, 12
DDL            $HCAT_HOME/bin/hcat -e “            drop table if exists rawevents;            create external table raweve...
Pig   raw = LOAD /data/rawevents/20120530 AS (url, user);   botless = FILTER raw BY myudfs.NotABot(user);   grpd = GROUP b...
Pig + HCatalog   Pig   raw = LOAD /data/rawevents/20120530 AS (url, user);   Pig + HCatalog   raw = LOAD rawevents using o...
MapReduce                    •    HCatInputFormat과 HCatOutputFormat 클래스를 활용                    •    Value 클래스는 기본적으로 HCatR...
MapReduce - Job                 Job job = new Job(getConf());                 job.setJarByClass(HCatMRTest.class);        ...
MapReduce - DB, TBL, Partition                 java.util.Map<String, String> partition = ...                 partition.put...
MapReduce - HCatRecord                    •    레코드 단위로 사용되는 클래스                    •    boolean, byte, short, integer, lon...
MapReduce - HCatRecord            테이블 스키마 정보 획득 방법            HCatSchema in = HCatInputFormat.getTableSchema(context)     ...
Conclusions                    •    Pig 및 MR만을 사용하더라도 메타데이터 관리가 편해진다                    •    다양한 도구를 활용할 때 효과를 발휘         ...
TempletonWednesday, July 18, 12
Wednesday, July 18, 12
The Templeton project is named after the a                character in the award-winning childrens                novel Ch...
Templeton                •        HCatalog 연동                    •     Thrift                    •     Java API (HCATALOG-...
Getting started                  • Install                     ◦ Requirements                        ■ Hadoop 0.20.205 or ...
Templeton Resources                :version                   Returns a list of supported response types.                s...
Templeton Resources (2)                ddl                    Performs an HCatalog DDL command.                ddl/databas...
Templeton Resources (3)                mapreduce/streaming                    Creates and queues Hadoop streaming MapReduc...
Examples                $ curl -s http://tb080:50111/templeton/v1/status                {"status":"ok","version":"v1"}    ...
Examples                $ curl -s http://tb080:50111/templeton/v1/ddl/database/default/                table/emp?user.name...
Examples                $ curl -s -X PUT -HContent-type:application/json -d {                 "comment": "Test table",    ...
Future of Templeton                  • webhcat                  • Java API based on REST API                  • Integrate ...
References                  • Apache HCatalog (Incubating), http://                    incubator.apache.org/hcatalog/     ...
Upcoming SlideShare
Loading in...5
×

HCatalog & Templeton

2,490

Published on

HCatalog & Templeton 소개 및 예제

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,490
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
57
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "HCatalog & Templeton"

  1. 1. HCatalog & Templeton Youngwoo Kim (brandon.kim@nexr.com, kt.com) Daegeun Kim (dani.kim@geekple.com) 데이터분석플랫폼 KTCloudware (NexR)Wednesday, July 18, 12
  2. 2. HCatalogWednesday, July 18, 12
  3. 3. Hadoop Ecosystem (Many data processing tools) MapReduce Hive Pig LoadFunc StoreFunc Metastore SerDe SerDe RDBMS InputFormat / OutputFormat / ... FilesystemWednesday, July 18, 12
  4. 4. Problems • Hive 외에는 메타스토어의 부재 • 한 클러스터에서 다양한 도구를 사용하는 경우 연동이 쉽지 않다. • 매번 커뮤니케이션 비용이 발생 • 어디에? 어떻게? 뭘? • M/R, Pig 사용자는 기억해야할 많은 정보 • 스키마, 데이터 경로 또는 포맷 변경은 M/R, Pig 에 많은 영향Wednesday, July 18, 12
  5. 5. HCatalog • Apache Incubator • Hive metastore 기반 • M/R, Pig 사용자에게 읽고 쓸 수 있는 프로그래밍 인터페이스 제공 • MapReduce 작업이 필요없는 모든 DDL 명령 제공 (CLI Commands) • import/export, CREATE TABLE AS SELECT 등 제외 • Data exploration 기능 제공 • SHOW TABLES, DESCRIBE 제공 • http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html • Hortonworks,Yahoo, Twitter, ... 등 개발Wednesday, July 18, 12
  6. 6. Table abstraction • 메타데이터 • 데이터 위치, 스키마, 압축, 파티션, 포맷 등 • HCatalog를 이용하여 데이터를 추상화 • 한 곳에서 메타데이터가 관리되며 그 만큼 역할 또한 중요 • 컬럼 타입으로 primitives, map, list, struct 지원Wednesday, July 18, 12
  7. 7. HCatalog MapReduce Hive Pig HCatInputFormat HCatLoader HCatOutputFormat HCatStorer Metastore SerDe SerDe InputFormat RDBMS OutputFormat FilesystemWednesday, July 18, 12
  8. 8. Data types : Pig HCatalog = Hive Pig primitives int, long, float, double, chararray (int, long, float, double, string) map map (contains key and value pairs) list bag (contains a list elements of same data type) struct tuple (contains elements of different data types)Wednesday, July 18, 12
  9. 9. ExamplesWednesday, July 18, 12
  10. 10. DDL $HCAT_HOME/bin/hcat -e “ drop table if exists rawevents; create external table rawevents ( url string, user string ) partitioned by (ds string) “ $HIVE_HOME/bin/hive -e “ LOAD DATA LOCAL INPATH ‘...’ OVERWRITE INTO TABLE rawevents PARTITION (ds=‘20120530`) “Wednesday, July 18, 12
  11. 11. Pig raw = LOAD /data/rawevents/20120530 AS (url, user); botless = FILTER raw BY myudfs.NotABot(user); grpd = GROUP botless by (url, user); cntd = FOREACH grpd GENERATE flatten(url, user), COUNT(botless); STORE cntd INTO /data/counted/20120530; http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8Wednesday, July 18, 12
  12. 12. Pig + HCatalog Pig raw = LOAD /data/rawevents/20120530 AS (url, user); Pig + HCatalog raw = LOAD rawevents using org.apache.hcatalog.pig.HCatLoader(); LOAD /data/rawevents/20120530 Pig + HCatalog (Partition Filter) raw_0530 = FILTER raw BY ds = 20120530; Pig STORE cntd INTO /data/counted/20120530; Pig + HCatalog STORE cntd INTO counted using org.apache.hcatalog.pig.HCatStorer(); http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8Wednesday, July 18, 12
  13. 13. MapReduce • HCatInputFormat과 HCatOutputFormat 클래스를 활용 • Value 클래스는 기본적으로 HCatRecord를 활용 • Key는 사용하지 않음 • OutputValueClass는 HCatRecord로 설정 • 언제나 그렇듯 Reducer는 필수가 아닌 선택 • 파티션 제어 가능 • 스키마로 쉽게 제어 가능Wednesday, July 18, 12
  14. 14. MapReduce - Job Job job = new Job(getConf()); job.setJarByClass(HCatMRTest.class); job.setJobName("HCatMRTest"); job.setOutputKeyClass(WritableComparable.class); job.setOutputValueClass(HCatRecord.class); job.setMapperClass(HCatMRTest.Map.class); job.setInputFormatClass(HCatInputFormat.class); job.setOutputFormatClass(HCatOutputFormat.class); job.setNumReduceTasks(0);Wednesday, July 18, 12
  15. 15. MapReduce - DB, TBL, Partition java.util.Map<String, String> partition = ... partition.put("ds", "20120530"); in = InputJobInfo.create("DB", "rawevents", "ds=20120530"); out = OutputJobInfo.create("DB", "counted", partition); HCatInputFormat.setInput(job, in); HCatOutputFormat.setOutput(job, out); HCatSchema s = HCatOutputFormat.getTableSchema(job); HCatOutputFormat.setSchema(job, s);Wednesday, July 18, 12
  16. 16. MapReduce - HCatRecord • 레코드 단위로 사용되는 클래스 • boolean, byte, short, integer, long, float, double, string, list, struct, map • tinyint : HCatRecord.getByte • smallint : HCatRecord.getShort • Index 또는 컬럼명으로 접근가능 • 컬럼명으로 접근할 때는 HCatSchema 정보 필요 • 파티션 컬럼이 들어갈 수 있도록 공간 확보Wednesday, July 18, 12
  17. 17. MapReduce - HCatRecord 테이블 스키마 정보 획득 방법 HCatSchema in = HCatInputFormat.getTableSchema(context) HCatSchema out = HCatOutputFormat.getTableSchema(context) HCatRecord record = new HCatRecord(3); record.set(“url”, out, value.get(“url”, in)); context.write(null, record); 해당 스키마 정보는 job.xml에 기록(encoded) * mapreduce.lib.hcat.job.info * mapreduce.lib.hcatoutput.infoWednesday, July 18, 12
  18. 18. Conclusions • Pig 및 MR만을 사용하더라도 메타데이터 관리가 편해진다 • 다양한 도구를 활용할 때 효과를 발휘 • 빠른 컨트리뷰션이 이루어지고 있어 추후에 더 기대Wednesday, July 18, 12
  19. 19. TempletonWednesday, July 18, 12
  20. 20. Wednesday, July 18, 12
  21. 21. The Templeton project is named after the a character in the award-winning childrens novel Charlottes Web, by E. B. White. The novels protagonist is a pig named Wilber. Templeton is a rat who helps Wilber by running errands and making deliveries.Wednesday, July 18, 12
  22. 22. Templeton • HCatalog 연동 • Thrift • Java API (HCATALOG-419) • REST API • Web services interface for HCatalog access and Pig, Hive and MR Job excution • http://github.com/hortonworks/templeton • HCATALOG-182 • a.k.a ‘webhcat’Wednesday, July 18, 12
  23. 23. Getting started • Install ◦ Requirements ■ Hadoop 0.20.205 or Hadoop 1.x ■ Zookeeper ■ HCatalog ■ Hadoop Distributed Cache ■ To use the Hive, Pig, or hadoop/streaming resources • Configuration ◦ templeton-site.xml • Security ◦ Default security (without additional authentication) ◦ Authentication via KerberosWednesday, July 18, 12
  24. 24. Templeton Resources :version Returns a list of supported response types. status Returns the Templeton server status. version Returns the a list of supported versions and the current version.Wednesday, July 18, 12
  25. 25. Templeton Resources (2) ddl Performs an HCatalog DDL command. ddl/database List HCatalog databases. ddl/database/:db (GET) Describe an HCatalog database. ddl/database/:db (PUT) Create an HCatalog database. ddl/database/:db (DELETE) Delete (drop) an HCatalog database. ddl/database/:db/table List the tables in an HCatalog database. ddl/database/:db/table/:table (GET) Describe an HCatalog table. ddl/database/:db/table/:table (POST) Rename an HCatalog table. ddl/database/:db/table/:table/partion List all partitions in an HCatalog table. ddl/database/:db/table/:table/partion/:partition (GET) Describe a single partition in an HCatalog table. ...... ...... ddl/database/:db/table/:table/partion/:partition (PUT)Wednesday, July 18, 12
  26. 26. Templeton Resources (3) mapreduce/streaming Creates and queues Hadoop streaming MapReduce jobs. mapreduce/jar Creates and queues standard Hadoop MapReduce jobs. pig Creates and queues Pig jobs. hive Runs Hive queries and commands. queue Returns a list of all jobids registered for the user. queue/:jobid (GET) Returns the status of a job given its ID. queue/:jobid (DELETE) Kill a job given its ID.Wednesday, July 18, 12
  27. 27. Examples $ curl -s http://tb080:50111/templeton/v1/status {"status":"ok","version":"v1"} $ curl -s -d user.name=nexr -d exec=show tables; http://tb080:50111/templeton/v1/ddl { "stdout": "empnnamenname_a29n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ...... //[jar:file:/home/nexr/nexr_platforms/hadoop/hadoop-1.0.3/ lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/ StaticLoggerBinder.class]nSLF4J: See http://www.slf4j.org/ codes.html#multiple_bindings for an explanation.nOKnTime taken: 0.491 secondsn", "exitcode": 0 }Wednesday, July 18, 12
  28. 28. Examples $ curl -s http://tb080:50111/templeton/v1/ddl/database/default/ table/emp?user.name=nexr { "statement": "use default; desc emp; ", "error": "...", "exec": { "stdout": "{"columns":[{"name":"empno","type":"int "},{"name":"name","type":"string"},{"name":"deptno ","type":"int"}]}t t n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ...... explanation.nOKnTime taken: 0.324 secondsnOKnTime taken: 0.398 secondsn", "exitcode": 0 } }Wednesday, July 18, 12
  29. 29. Examples $ curl -s -X PUT -HContent-type:application/json -d { "comment": "Test table", "columns": [ { "name": "id", "type": "bigint" }, { "name": "price", "type": "float", "comment": "The unit price" } ], "partitionedBy": [ { "name": "country", "type": "string" } ], "format": { "storedAs": "rcfile" } } http://tb080:50111/templeton/v1/ddl/database/default/table/test_table? user.name=nexr hive> show tables; OK emp test_table Time taken: 0.477 seconds hive> describe extended test_table; OK id bigint price float The unit price country string Detailed Table Information Table(tableName:test_table, dbName:default, owner:nexr, createTime:1342578059, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:bigint, comment:null), FieldSchema(name:price, type:float, comment:The unit price), FieldSchema(name:country, type:string,Wednesday, July 18, 12
  30. 30. Future of Templeton • webhcat • Java API based on REST API • Integrate or replace existing web interfaces, e.g., WebHDFSWednesday, July 18, 12
  31. 31. References • Apache HCatalog (Incubating), http:// incubator.apache.org/hcatalog/ • HCatalog, http://www.slideshare.net/ydn/jan-2012-hug- hcatalog • Future of HCatalog, http://www.slideshare.net/ hortonworks/future-of-hcatalog-hadoop-summit-2012 • Introduction to HCatalog, http://geekdani.wordpress.com/ 2012/07/11/introduction-to-hcatalog/ • HCatalog 설치와 HCatalog를 이용한 Hive & Pig 스키마 연 동, http://mixellaneous.tistory.com/1123Wednesday, July 18, 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×