• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Building Hadoop Data Applications with Kite
 

Building Hadoop Data Applications with Kite

on

  • 81 views

By Tom White, Software Engineer at Cloudera

By Tom White, Software Engineer at Cloudera

Video at: https://www.youtube.com/watch?v=ibgoMdca5mQ&list=PL5OOLwV_m9vaoNt0wM9BVjd_gWyseq0IR&index=1

Statistics

Views

Total Views
81
Views on SlideShare
81
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Building Hadoop Data Applications with Kite Building Hadoop Data Applications with Kite Presentation Transcript

    • 11 Headline  Goes  Here   Speaker  Name  or  Subhead  Goes  Here   Building  Hadoop  Data  Applica;ons  with  Kite   Tom  White  @tom_e_white   Hadoop  Users  Group  UK,  London   17  June  2014  
    • About  me   •  Engineer  at  Cloudera  working   on  Core  Hadoop  and  Kite   •  Apache  Hadoop  CommiMer,   PMC  Member,  Apache  Member   •  Author  of     “Hadoop:  The  Defini;ve  Guide”   2
    • Hadoop  0.1   % cat bigdata.txt | hadoop fs -put - in! % hadoop MyJob in out! % hadoop fs -get out! 3
    • Characteris;cs   •  Batch  applica;ons  only   •  Low-­‐level  coding   •  File  format   •  Serializa;on   •  Par;;oning  scheme   4
    • A  Hadoop  stack   5
    • Common  Data,  Many  Tools      #  tools  >>  #  file  formats  >>  #  file  systems   6
    • Glossary   •  Apache  Avro  –  cross-­‐language  data  serializa;on  library   •  Apache  Parquet  (incuba;ng)  –  column-­‐oriented  storage  format   for  nested  data   •  Apache  Hive  –  data  warehouse  (SQL  and  metastore)   •  Apache  Flume  –  streaming  log  capture  and  delivery  system   •  Apache  Oozie  –  workflow  scheduler  system   •  Apache  Crunch  –  Java  API  for  wri;ng  data  pipelines   •  Impala  –  interac;ve  SQL  on  Hadoop   7
    • Outline   •  A  Typical  Applica;on   •  Kite  SDK   •  An  Example   •  Advanced  Kite   8
    • A  typical  applica;on  (zoom  100:1)   9
    • A  typical  applica;on  (zoom  10:1)   10
    • A  typical  pipeline  (zoom  5:1)   11
    • Kite  SDK   12
    • Kite  Codifies  Best  Prac;ce  as  APIs,  Tools,  Docs   and  Examples   13
    • Kite   •  A  client-­‐side  library  for  wri;ng  Hadoop  Data  Applica;ons   •  First  release  was  in  April  2013  as  CDK   •  0.14.1  last  month   •  Open  source,  Apache  2  license,  kitesdk.org   •  Modular   •  Data  module  (HDFS,  Flume,  Crunch,  Hive,  HBase)   •  Morphlines  transforma;on  module   •  Maven  plugin   14
    • An  Example   15
    • Kite  Data  Module   •  Dataset  –  a  collec;on  of  en;;es   •  DatasetRepository  –  physical  storage  loca;on  for  datasets   •  DatasetDescriptor  –  holds  dataset  metadata  (schema,  format)   •  DatasetWriter  –  write  en;;es  to  a  dataset  in  a  stream   •  DatasetReader  –  read  en;;es  from  a  dataset     •  hMp://kitesdk.org/docs/current/apidocs/index.html   16
    • 1.  Define  the  Event  En;ty   public class Event {! private long id;! private long timestamp;! private String source;! // getters and setters! }! 17
    • 2.  Create  the  Events  Dataset   DatasetRepository repo = DatasetRepositories.open("repo:hive");! DatasetDescriptor descriptor =! new DatasetDescriptor.Builder()! .schema(Event.class).build();! repo.create("events", descriptor);! 18
    • (2.  or  with  the  Maven  plugin)   $ mvn kite:create-dataset ! -Dkite.repositoryUri='repo:hive' ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 19
    • A  peek  at  the  Avro  schema   $ hive -e "DESCRIBE EXTENDED events"! ...! {! "type" : "record",! "name" : "Event",! "namespace" : "com.example",! "fields" : [! { "name" : "id", "type" : "long" },! { "name" : "timestamp", "type" : "long" },! { "name" : "source", "type" : "string" }! ]! }! 20
    • 3.  Write  Events   Logger logger = Logger.getLogger(...);! Event event = new Event();! event.setId(id);! event.setTimestamp(System.currentTimeMillis());! event.setSource(source);! logger.info(event);! 21
    • Log4j  configura;on   log4j.appender.flume = org.kitesdk.data.flume.Log4jAppender! log4j.appender.flume.Hostname = localhost! log4j.appender.flume.Port = 41415! log4j.appender.flume.DatasetRepositoryUri = repo:hive! log4j.appender.flume.DatasetName = events! 22
    • The  resul;ng  file  layout   /user! /hive! /warehouse! /events! /FlumeData.1375659013795! /FlumeData.1375659013796! 23 Avro   files  
    • 4.  Generate  Summaries  with  Crunch   PCollection<Event> events = read(asSource(repo.load("events"), Event.class));! PCollection<Summary> summaries = events! .by(new GetTimeBucket(), // minute of day, source! Avros.pairs(Avros.longs(), Avros.strings()))! .groupByKey()! .parallelDo(new MakeSummary(),! Avros.reflects(Summary.class));! write(summaries, asTarget(repo.load("summaries"))!24
    • …  and  run  using  Maven   $ mvn kite:create-dataset -Dkite.datasetName=summaries ...! <plugin>! <groupId>org.kitesdk</groupId>! <artifactId>kite-maven-plugin</artifactId>! <configuration>! <toolClass>com.example.GenerateSummaries</toolClass>! </configuration>! </plugin>! $ mvn kite:run-tool! 25
    • 5.  Query  with  Impala   $ impala-shell -q ’DESCRIBE events'! +-----------+--------+-------------------+! | name | type | comment |! +-----------+--------+-------------------+! | id | bigint | from deserializer |! | timestamp | bigint | from deserializer |! | source | string | from deserializer |! +-----------+--------+-------------------+! 26
    • …  Ad  Hoc  Queries   $ impala-shell -q 'SELECT source, COUNT(1) AS cnt FROM events GROUP BY source'! +--------------------------------------+-----+! | source | cnt |! +--------------------------------------+-----+! | 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |! | bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |! +--------------------------------------+-----+! Returned 2 row(s) in 0.56s! 27
    • Advanced  Kite   28
    • Unified  Storage  Interface   •  Dataset  –  streaming  access,  HDFS  storage   •  RandomAccessDataset  –  random  access,  HBase  storage   •  Par;;onStrategy  defines  how  to  map  an  en;ty  to  par;;ons  in   HDFS  or  row  keys  in  HBase   29
    • Filesystem  Par;;ons   PartitionStrategy p = new PartitionStrategy.Builder()! .year("timestamp")! .month("timestamp")! .day("timestamp").build();! /user/hive/warehouse/events! /year=2014/month=02/day=08! /FlumeData.1375659013795! /FlumeData.1375659013796! 30
    • HBase  Keys:  Defined  in  Avro   {! "name": "username",! "type": "string",! "mapping": { "type": "key", "value": "0" }! },! {! "name": "favoriteColor",! "type": "string",! "mapping": { "type": "column", "value": "meta:fc" }! }! 31
    • Random  Access  Dataset:  Crea;on   RandomAccessDatasetRepository repo = DatasetRepositories.openRandomAccess(! "repo:hbase:localhost");! RandomAccessDataset<User> users = repo.load("users");! users.put(new User("bill", "green"));! users.put(new User("alice", "blue"));! 32
    • Random  Access  Dataset:  Retrieval   Key key = new Key.Builder(users)! .add("username", "bill").build();! User bill = users.get(key);! 33
    • Views   View<User> view = users.from("username", "bill");! DatasetReader<User> reader = view.newReader();! reader.open();! for (User user : reader) {! System.out.println(user);! }! reader.close();! 34
    • Parallel  Processing   •  Goal  is  for  Hadoop  processing  frameworks  to  “just  work”   •  Support  Formats,  Par;;ons,  Views   •  Na;ve  Kite  components,  e.g.  DatasetOutputFormat  for  MR   35 HDFS  Dataset   HBase  Dataset   Crunch   Yes   Yes   MapReduce   Yes   Yes   Hive   Yes   Planned   Impala   Yes   Planned  
    • Schema  Evolu;on   public class Event {! private long id;! private long timestamp;! private String source;! @Nullable private String ipAddress;! }! $ mvn kite:update-dataset ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 36
    • Searchable  Datasets   •  Use  Flume  Solr  Sink  (in   addi;on  to  HDFS  Sink)   •  Morphlines  library  to  define   fields  to  index   •  SolrCloud  runs  on  cluster  from   indexes  in  HDFS   •  Future  support  in  Kite  to  index   selected  fields  automa;cally   37
    • Conclusion   38
    • Kite  makes  it  easy  to  get  data  into  Hadoop   with  a  flexible  schema  model  that  is  storage   agnos;c  in  a  format  that  can  be  processed   with  a  wide  range  of  Hadoop  tools   39
    • Genng  Started  With  Kite   •  Examples  at  github.com/kite-­‐sdk/kite-­‐examples   •  Working  with  streaming  and  random-­‐access  datasets   •  Logging  events  to  datasets  from  a  webapp   •  Running  a  periodic  job   •  Migra;ng  data  from  CSV  to  a  Kite  dataset   •  Conver;ng  an  Avro  dataset  to  a  Parquet  dataset   •  Wri;ng  and  configuring  Morphlines   •  Using  Morphlines  to  write  JSON  records  to  a  dataset   40
    • Ques;ons?   kitesdk.org   @tom_e_white   tom@cloudera.com   41
    • 4242
    • Applica;ons   •  [Batch]  Analyze  an  archive  of  songs1   •  [Interac;ve  SQL]  Ad  hoc  queries  on  recommenda;ons  from   social  media  applica;ons2   •  [Search]  Searching  email  traffic  in  near-­‐real;me3   •  [ML]  Detec;ng  fraudulent  transac;ons  using  clustering4   43 [1]  hMp://blog.cloudera.com/blog/2012/08/process-­‐a-­‐million-­‐songs-­‐with-­‐apache-­‐pig/     [2]  hMp://blog.cloudera.com/blog/2014/01/how-­‐wajam-­‐answers-­‐business-­‐ques;ons-­‐faster-­‐with-­‐hadoop/     [3]  hMp://blog.cloudera.com/blog/2013/09/email-­‐indexing-­‐using-­‐cloudera-­‐search/     [4]  hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/    
    • …  or  use  JDBC   Class.forName("org.apache.hive.jdbc.HiveDriver");! Connection connection = DriverManager.getConnection(! "jdbc:hive2://localhost:21050/;auth=noSasl");! Statement statement = connection.createStatement();! ResultSet resultSet = statement.executeQuery(! "SELECT * FROM summaries");! 44
    • Apps   •  App  –  a  packaged  Java  program  that  runs  on  a  Hadoop  cluster   •  cdk:package-­‐app  –  create  a  package  on  the  local  filesystem   •  like  an  exploded  WAR   •  Oozie  format   •  cdk:deploy-­‐app  –  copy  packaged  app  to  HDFS   •  cdk:run-­‐app  –  execute  the  app   •  Workflow  app  –  runs  once   •  Coordinator  app  –  runs  other  apps  (like  cron)   45
    • Morphlines  Example   46 morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dic;onaryFiles  :  [/tmp/grok-­‐dic;onaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example Input <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb  4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.