11
Headline	
  Goes	
  Here	
  
Speaker	
  Name	
  or	
  Subhead	
  Goes	
  Here	
  
Building	
  Hadoop	
  Data	
  Applica...
About	
  me	
  
•  Engineer	
  at	
  Cloudera	
  working	
  
on	
  Core	
  Hadoop	
  and	
  Kite	
  
•  Apache	
  Hadoop	
...
Hadoop	
  0.1	
  
% cat bigdata.txt | hadoop fs -put - in!
% hadoop MyJob in out!
% hadoop fs -get out!
3
Characteris;cs	
  
•  Batch	
  applica;ons	
  only	
  
•  Low-­‐level	
  coding	
  
•  File	
  format	
  
•  Serializa;on	...
A	
  Hadoop	
  stack	
  
5
Common	
  Data,	
  Many	
  Tools	
  
	
   	
  #	
  tools	
  >>	
  #	
  file	
  formats	
  >>	
  #	
  file	
  systems	
  
6
Glossary	
  
•  Apache	
  Avro	
  –	
  cross-­‐language	
  data	
  serializa;on	
  library	
  
•  Apache	
  Parquet	
  (in...
Outline	
  
•  A	
  Typical	
  Applica;on	
  
•  Kite	
  SDK	
  
•  An	
  Example	
  
•  Advanced	
  Kite	
  
8
A	
  typical	
  applica;on	
  (zoom	
  100:1)	
  
9
A	
  typical	
  applica;on	
  (zoom	
  10:1)	
  
10
A	
  typical	
  pipeline	
  (zoom	
  5:1)	
  
11
Kite	
  SDK	
  
12
Kite	
  Codifies	
  Best	
  Prac;ce	
  as	
  APIs,	
  Tools,	
  Docs	
  
and	
  Examples	
  
13
Kite	
  
•  A	
  client-­‐side	
  library	
  for	
  wri;ng	
  Hadoop	
  Data	
  Applica;ons	
  
•  First	
  release	
  was...
An	
  Example	
  
15
Kite	
  Data	
  Module	
  
•  Dataset	
  –	
  a	
  collec;on	
  of	
  en;;es	
  
•  DatasetRepository	
  –	
  physical	
  ...
1.	
  Define	
  the	
  Event	
  En;ty	
  
public class Event {!
private long id;!
private long timestamp;!
private String s...
2.	
  Create	
  the	
  Events	
  Dataset	
  
DatasetRepository repo =
DatasetRepositories.open("repo:hive");!
DatasetDescr...
(2.	
  or	
  with	
  the	
  Maven	
  plugin)	
  
$ mvn kite:create-dataset !
-Dkite.repositoryUri='repo:hive' !
-Dkite.dat...
A	
  peek	
  at	
  the	
  Avro	
  schema	
  
$ hive -e "DESCRIBE EXTENDED events"!
...!
{!
"type" : "record",!
"name" : "E...
3.	
  Write	
  Events	
  
Logger logger = Logger.getLogger(...);!
Event event = new Event();!
event.setId(id);!
event.setT...
Log4j	
  configura;on	
  
log4j.appender.flume =
org.kitesdk.data.flume.Log4jAppender!
log4j.appender.flume.Hostname = loca...
The	
  resul;ng	
  file	
  layout	
  
/user!
/hive!
/warehouse!
/events!
/FlumeData.1375659013795!
/FlumeData.1375659013796...
4.	
  Generate	
  Summaries	
  with	
  Crunch	
  
PCollection<Event> events =
read(asSource(repo.load("events"), Event.cla...
…	
  and	
  run	
  using	
  Maven	
  
$ mvn kite:create-dataset -Dkite.datasetName=summaries ...!
<plugin>!
<groupId>org.k...
5.	
  Query	
  with	
  Impala	
  
$ impala-shell -q ’DESCRIBE events'!
+-----------+--------+-------------------+!
| name ...
…	
  Ad	
  Hoc	
  Queries	
  
$ impala-shell -q 'SELECT source, COUNT(1) AS cnt
FROM events GROUP BY source'!
+-----------...
Advanced	
  Kite	
  
28
Unified	
  Storage	
  Interface	
  
•  Dataset	
  –	
  streaming	
  access,	
  HDFS	
  storage	
  
•  RandomAccessDataset	
...
Filesystem	
  Par;;ons	
  
PartitionStrategy p = new PartitionStrategy.Builder()!
.year("timestamp")!
.month("timestamp")!...
HBase	
  Keys:	
  Defined	
  in	
  Avro	
  
{!
"name": "username",!
"type": "string",!
"mapping": { "type": "key", "value":...
Random	
  Access	
  Dataset:	
  Crea;on	
  
RandomAccessDatasetRepository repo =
DatasetRepositories.openRandomAccess(!
"r...
Random	
  Access	
  Dataset:	
  Retrieval	
  
Key key = new Key.Builder(users)!
.add("username", "bill").build();!
User bi...
Views	
  
View<User> view = users.from("username", "bill");!
DatasetReader<User> reader = view.newReader();!
reader.open()...
Parallel	
  Processing	
  
•  Goal	
  is	
  for	
  Hadoop	
  processing	
  frameworks	
  to	
  “just	
  work”	
  
•  Suppo...
Schema	
  Evolu;on	
  
public class Event {!
private long id;!
private long timestamp;!
private String source;!
@Nullable ...
Searchable	
  Datasets	
  
•  Use	
  Flume	
  Solr	
  Sink	
  (in	
  
addi;on	
  to	
  HDFS	
  Sink)	
  
•  Morphlines	
  ...
Conclusion	
  
38
Kite	
  makes	
  it	
  easy	
  to	
  get	
  data	
  into	
  Hadoop	
  
with	
  a	
  flexible	
  schema	
  model	
  that	
  ...
Genng	
  Started	
  With	
  Kite	
  
•  Examples	
  at	
  github.com/kite-­‐sdk/kite-­‐examples	
  
•  Working	
  with	
  ...
Ques;ons?	
  
kitesdk.org	
  
@tom_e_white	
  
tom@cloudera.com	
  
41
4242
Applica;ons	
  
•  [Batch]	
  Analyze	
  an	
  archive	
  of	
  songs1	
  
•  [Interac;ve	
  SQL]	
  Ad	
  hoc	
  queries	...
…	
  or	
  use	
  JDBC	
  
Class.forName("org.apache.hive.jdbc.HiveDriver");!
Connection connection = DriverManager.getCon...
Apps	
  
•  App	
  –	
  a	
  packaged	
  Java	
  program	
  that	
  runs	
  on	
  a	
  Hadoop	
  cluster	
  
•  cdk:packag...
Morphlines	
  Example	
  
46
morphlines	
  :	
  [	
  
	
  {	
  
	
  	
  	
  id	
  :	
  morphline1	
  
	
  	
  	
  importCo...
Upcoming SlideShare
Loading in...5
×

Building Hadoop Data Applications with Kite

468
-1

Published on

By Tom White, Software Engineer at Cloudera

Video at: https://www.youtube.com/watch?v=ibgoMdca5mQ&list=PL5OOLwV_m9vaoNt0wM9BVjd_gWyseq0IR&index=1

Published in: Sports, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
468
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Building Hadoop Data Applications with Kite

  1. 1. 11 Headline  Goes  Here   Speaker  Name  or  Subhead  Goes  Here   Building  Hadoop  Data  Applica;ons  with  Kite   Tom  White  @tom_e_white   Hadoop  Users  Group  UK,  London   17  June  2014  
  2. 2. About  me   •  Engineer  at  Cloudera  working   on  Core  Hadoop  and  Kite   •  Apache  Hadoop  CommiMer,   PMC  Member,  Apache  Member   •  Author  of     “Hadoop:  The  Defini;ve  Guide”   2
  3. 3. Hadoop  0.1   % cat bigdata.txt | hadoop fs -put - in! % hadoop MyJob in out! % hadoop fs -get out! 3
  4. 4. Characteris;cs   •  Batch  applica;ons  only   •  Low-­‐level  coding   •  File  format   •  Serializa;on   •  Par;;oning  scheme   4
  5. 5. A  Hadoop  stack   5
  6. 6. Common  Data,  Many  Tools      #  tools  >>  #  file  formats  >>  #  file  systems   6
  7. 7. Glossary   •  Apache  Avro  –  cross-­‐language  data  serializa;on  library   •  Apache  Parquet  (incuba;ng)  –  column-­‐oriented  storage  format   for  nested  data   •  Apache  Hive  –  data  warehouse  (SQL  and  metastore)   •  Apache  Flume  –  streaming  log  capture  and  delivery  system   •  Apache  Oozie  –  workflow  scheduler  system   •  Apache  Crunch  –  Java  API  for  wri;ng  data  pipelines   •  Impala  –  interac;ve  SQL  on  Hadoop   7
  8. 8. Outline   •  A  Typical  Applica;on   •  Kite  SDK   •  An  Example   •  Advanced  Kite   8
  9. 9. A  typical  applica;on  (zoom  100:1)   9
  10. 10. A  typical  applica;on  (zoom  10:1)   10
  11. 11. A  typical  pipeline  (zoom  5:1)   11
  12. 12. Kite  SDK   12
  13. 13. Kite  Codifies  Best  Prac;ce  as  APIs,  Tools,  Docs   and  Examples   13
  14. 14. Kite   •  A  client-­‐side  library  for  wri;ng  Hadoop  Data  Applica;ons   •  First  release  was  in  April  2013  as  CDK   •  0.14.1  last  month   •  Open  source,  Apache  2  license,  kitesdk.org   •  Modular   •  Data  module  (HDFS,  Flume,  Crunch,  Hive,  HBase)   •  Morphlines  transforma;on  module   •  Maven  plugin   14
  15. 15. An  Example   15
  16. 16. Kite  Data  Module   •  Dataset  –  a  collec;on  of  en;;es   •  DatasetRepository  –  physical  storage  loca;on  for  datasets   •  DatasetDescriptor  –  holds  dataset  metadata  (schema,  format)   •  DatasetWriter  –  write  en;;es  to  a  dataset  in  a  stream   •  DatasetReader  –  read  en;;es  from  a  dataset     •  hMp://kitesdk.org/docs/current/apidocs/index.html   16
  17. 17. 1.  Define  the  Event  En;ty   public class Event {! private long id;! private long timestamp;! private String source;! // getters and setters! }! 17
  18. 18. 2.  Create  the  Events  Dataset   DatasetRepository repo = DatasetRepositories.open("repo:hive");! DatasetDescriptor descriptor =! new DatasetDescriptor.Builder()! .schema(Event.class).build();! repo.create("events", descriptor);! 18
  19. 19. (2.  or  with  the  Maven  plugin)   $ mvn kite:create-dataset ! -Dkite.repositoryUri='repo:hive' ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 19
  20. 20. A  peek  at  the  Avro  schema   $ hive -e "DESCRIBE EXTENDED events"! ...! {! "type" : "record",! "name" : "Event",! "namespace" : "com.example",! "fields" : [! { "name" : "id", "type" : "long" },! { "name" : "timestamp", "type" : "long" },! { "name" : "source", "type" : "string" }! ]! }! 20
  21. 21. 3.  Write  Events   Logger logger = Logger.getLogger(...);! Event event = new Event();! event.setId(id);! event.setTimestamp(System.currentTimeMillis());! event.setSource(source);! logger.info(event);! 21
  22. 22. Log4j  configura;on   log4j.appender.flume = org.kitesdk.data.flume.Log4jAppender! log4j.appender.flume.Hostname = localhost! log4j.appender.flume.Port = 41415! log4j.appender.flume.DatasetRepositoryUri = repo:hive! log4j.appender.flume.DatasetName = events! 22
  23. 23. The  resul;ng  file  layout   /user! /hive! /warehouse! /events! /FlumeData.1375659013795! /FlumeData.1375659013796! 23 Avro   files  
  24. 24. 4.  Generate  Summaries  with  Crunch   PCollection<Event> events = read(asSource(repo.load("events"), Event.class));! PCollection<Summary> summaries = events! .by(new GetTimeBucket(), // minute of day, source! Avros.pairs(Avros.longs(), Avros.strings()))! .groupByKey()! .parallelDo(new MakeSummary(),! Avros.reflects(Summary.class));! write(summaries, asTarget(repo.load("summaries"))!24
  25. 25. …  and  run  using  Maven   $ mvn kite:create-dataset -Dkite.datasetName=summaries ...! <plugin>! <groupId>org.kitesdk</groupId>! <artifactId>kite-maven-plugin</artifactId>! <configuration>! <toolClass>com.example.GenerateSummaries</toolClass>! </configuration>! </plugin>! $ mvn kite:run-tool! 25
  26. 26. 5.  Query  with  Impala   $ impala-shell -q ’DESCRIBE events'! +-----------+--------+-------------------+! | name | type | comment |! +-----------+--------+-------------------+! | id | bigint | from deserializer |! | timestamp | bigint | from deserializer |! | source | string | from deserializer |! +-----------+--------+-------------------+! 26
  27. 27. …  Ad  Hoc  Queries   $ impala-shell -q 'SELECT source, COUNT(1) AS cnt FROM events GROUP BY source'! +--------------------------------------+-----+! | source | cnt |! +--------------------------------------+-----+! | 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |! | bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |! +--------------------------------------+-----+! Returned 2 row(s) in 0.56s! 27
  28. 28. Advanced  Kite   28
  29. 29. Unified  Storage  Interface   •  Dataset  –  streaming  access,  HDFS  storage   •  RandomAccessDataset  –  random  access,  HBase  storage   •  Par;;onStrategy  defines  how  to  map  an  en;ty  to  par;;ons  in   HDFS  or  row  keys  in  HBase   29
  30. 30. Filesystem  Par;;ons   PartitionStrategy p = new PartitionStrategy.Builder()! .year("timestamp")! .month("timestamp")! .day("timestamp").build();! /user/hive/warehouse/events! /year=2014/month=02/day=08! /FlumeData.1375659013795! /FlumeData.1375659013796! 30
  31. 31. HBase  Keys:  Defined  in  Avro   {! "name": "username",! "type": "string",! "mapping": { "type": "key", "value": "0" }! },! {! "name": "favoriteColor",! "type": "string",! "mapping": { "type": "column", "value": "meta:fc" }! }! 31
  32. 32. Random  Access  Dataset:  Crea;on   RandomAccessDatasetRepository repo = DatasetRepositories.openRandomAccess(! "repo:hbase:localhost");! RandomAccessDataset<User> users = repo.load("users");! users.put(new User("bill", "green"));! users.put(new User("alice", "blue"));! 32
  33. 33. Random  Access  Dataset:  Retrieval   Key key = new Key.Builder(users)! .add("username", "bill").build();! User bill = users.get(key);! 33
  34. 34. Views   View<User> view = users.from("username", "bill");! DatasetReader<User> reader = view.newReader();! reader.open();! for (User user : reader) {! System.out.println(user);! }! reader.close();! 34
  35. 35. Parallel  Processing   •  Goal  is  for  Hadoop  processing  frameworks  to  “just  work”   •  Support  Formats,  Par;;ons,  Views   •  Na;ve  Kite  components,  e.g.  DatasetOutputFormat  for  MR   35 HDFS  Dataset   HBase  Dataset   Crunch   Yes   Yes   MapReduce   Yes   Yes   Hive   Yes   Planned   Impala   Yes   Planned  
  36. 36. Schema  Evolu;on   public class Event {! private long id;! private long timestamp;! private String source;! @Nullable private String ipAddress;! }! $ mvn kite:update-dataset ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 36
  37. 37. Searchable  Datasets   •  Use  Flume  Solr  Sink  (in   addi;on  to  HDFS  Sink)   •  Morphlines  library  to  define   fields  to  index   •  SolrCloud  runs  on  cluster  from   indexes  in  HDFS   •  Future  support  in  Kite  to  index   selected  fields  automa;cally   37
  38. 38. Conclusion   38
  39. 39. Kite  makes  it  easy  to  get  data  into  Hadoop   with  a  flexible  schema  model  that  is  storage   agnos;c  in  a  format  that  can  be  processed   with  a  wide  range  of  Hadoop  tools   39
  40. 40. Genng  Started  With  Kite   •  Examples  at  github.com/kite-­‐sdk/kite-­‐examples   •  Working  with  streaming  and  random-­‐access  datasets   •  Logging  events  to  datasets  from  a  webapp   •  Running  a  periodic  job   •  Migra;ng  data  from  CSV  to  a  Kite  dataset   •  Conver;ng  an  Avro  dataset  to  a  Parquet  dataset   •  Wri;ng  and  configuring  Morphlines   •  Using  Morphlines  to  write  JSON  records  to  a  dataset   40
  41. 41. Ques;ons?   kitesdk.org   @tom_e_white   tom@cloudera.com   41
  42. 42. 4242
  43. 43. Applica;ons   •  [Batch]  Analyze  an  archive  of  songs1   •  [Interac;ve  SQL]  Ad  hoc  queries  on  recommenda;ons  from   social  media  applica;ons2   •  [Search]  Searching  email  traffic  in  near-­‐real;me3   •  [ML]  Detec;ng  fraudulent  transac;ons  using  clustering4   43 [1]  hMp://blog.cloudera.com/blog/2012/08/process-­‐a-­‐million-­‐songs-­‐with-­‐apache-­‐pig/     [2]  hMp://blog.cloudera.com/blog/2014/01/how-­‐wajam-­‐answers-­‐business-­‐ques;ons-­‐faster-­‐with-­‐hadoop/     [3]  hMp://blog.cloudera.com/blog/2013/09/email-­‐indexing-­‐using-­‐cloudera-­‐search/     [4]  hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/    
  44. 44. …  or  use  JDBC   Class.forName("org.apache.hive.jdbc.HiveDriver");! Connection connection = DriverManager.getConnection(! "jdbc:hive2://localhost:21050/;auth=noSasl");! Statement statement = connection.createStatement();! ResultSet resultSet = statement.executeQuery(! "SELECT * FROM summaries");! 44
  45. 45. Apps   •  App  –  a  packaged  Java  program  that  runs  on  a  Hadoop  cluster   •  cdk:package-­‐app  –  create  a  package  on  the  local  filesystem   •  like  an  exploded  WAR   •  Oozie  format   •  cdk:deploy-­‐app  –  copy  packaged  app  to  HDFS   •  cdk:run-­‐app  –  execute  the  app   •  Workflow  app  –  runs  once   •  Coordinator  app  –  runs  other  apps  (like  cron)   45
  46. 46. Morphlines  Example   46 morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dic;onaryFiles  :  [/tmp/grok-­‐dic;onaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example Input <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb  4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×