11
Headline	
  Goes	
  Here	
  
Speaker	
  Name	
  or	
  Subhead	
  Goes	
  Here	
  
Building	
  Hadoop	
  Data	
  Applica;ons	
  with	
  Kite	
  
Tom	
  White	
  @tom_e_white	
  
Hadoop	
  Users	
  Group	
  UK,	
  London	
  
17	
  June	
  2014	
  
About	
  me	
  
•  Engineer	
  at	
  Cloudera	
  working	
  
on	
  Core	
  Hadoop	
  and	
  Kite	
  
•  Apache	
  Hadoop	
  CommiMer,	
  
PMC	
  Member,	
  Apache	
  Member	
  
•  Author	
  of	
  	
  
“Hadoop:	
  The	
  Defini;ve	
  Guide”	
  
2
Hadoop	
  0.1	
  
% cat bigdata.txt | hadoop fs -put - in!
% hadoop MyJob in out!
% hadoop fs -get out!
3
Characteris;cs	
  
•  Batch	
  applica;ons	
  only	
  
•  Low-­‐level	
  coding	
  
•  File	
  format	
  
•  Serializa;on	
  
•  Par;;oning	
  scheme	
  
4
A	
  Hadoop	
  stack	
  
5
Common	
  Data,	
  Many	
  Tools	
  
	
   	
  #	
  tools	
  >>	
  #	
  file	
  formats	
  >>	
  #	
  file	
  systems	
  
6
Glossary	
  
•  Apache	
  Avro	
  –	
  cross-­‐language	
  data	
  serializa;on	
  library	
  
•  Apache	
  Parquet	
  (incuba;ng)	
  –	
  column-­‐oriented	
  storage	
  format	
  
for	
  nested	
  data	
  
•  Apache	
  Hive	
  –	
  data	
  warehouse	
  (SQL	
  and	
  metastore)	
  
•  Apache	
  Flume	
  –	
  streaming	
  log	
  capture	
  and	
  delivery	
  system	
  
•  Apache	
  Oozie	
  –	
  workflow	
  scheduler	
  system	
  
•  Apache	
  Crunch	
  –	
  Java	
  API	
  for	
  wri;ng	
  data	
  pipelines	
  
•  Impala	
  –	
  interac;ve	
  SQL	
  on	
  Hadoop	
  
7
Outline	
  
•  A	
  Typical	
  Applica;on	
  
•  Kite	
  SDK	
  
•  An	
  Example	
  
•  Advanced	
  Kite	
  
8
A	
  typical	
  applica;on	
  (zoom	
  100:1)	
  
9
A	
  typical	
  applica;on	
  (zoom	
  10:1)	
  
10
A	
  typical	
  pipeline	
  (zoom	
  5:1)	
  
11
Kite	
  SDK	
  
12
Kite	
  Codifies	
  Best	
  Prac;ce	
  as	
  APIs,	
  Tools,	
  Docs	
  
and	
  Examples	
  
13
Kite	
  
•  A	
  client-­‐side	
  library	
  for	
  wri;ng	
  Hadoop	
  Data	
  Applica;ons	
  
•  First	
  release	
  was	
  in	
  April	
  2013	
  as	
  CDK	
  
•  0.14.1	
  last	
  month	
  
•  Open	
  source,	
  Apache	
  2	
  license,	
  kitesdk.org	
  
•  Modular	
  
•  Data	
  module	
  (HDFS,	
  Flume,	
  Crunch,	
  Hive,	
  HBase)	
  
•  Morphlines	
  transforma;on	
  module	
  
•  Maven	
  plugin	
  
14
An	
  Example	
  
15
Kite	
  Data	
  Module	
  
•  Dataset	
  –	
  a	
  collec;on	
  of	
  en;;es	
  
•  DatasetRepository	
  –	
  physical	
  storage	
  loca;on	
  for	
  datasets	
  
•  DatasetDescriptor	
  –	
  holds	
  dataset	
  metadata	
  (schema,	
  format)	
  
•  DatasetWriter	
  –	
  write	
  en;;es	
  to	
  a	
  dataset	
  in	
  a	
  stream	
  
•  DatasetReader	
  –	
  read	
  en;;es	
  from	
  a	
  dataset	
  	
  
•  hMp://kitesdk.org/docs/current/apidocs/index.html	
  
16
1.	
  Define	
  the	
  Event	
  En;ty	
  
public class Event {!
private long id;!
private long timestamp;!
private String source;!
// getters and setters!
}!
17
2.	
  Create	
  the	
  Events	
  Dataset	
  
DatasetRepository repo =
DatasetRepositories.open("repo:hive");!
DatasetDescriptor descriptor =!
new DatasetDescriptor.Builder()!
.schema(Event.class).build();!
repo.create("events", descriptor);!
18
(2.	
  or	
  with	
  the	
  Maven	
  plugin)	
  
$ mvn kite:create-dataset !
-Dkite.repositoryUri='repo:hive' !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
19
A	
  peek	
  at	
  the	
  Avro	
  schema	
  
$ hive -e "DESCRIBE EXTENDED events"!
...!
{!
"type" : "record",!
"name" : "Event",!
"namespace" : "com.example",!
"fields" : [!
{ "name" : "id", "type" : "long" },!
{ "name" : "timestamp", "type" : "long" },!
{ "name" : "source", "type" : "string" }!
]!
}!
20
3.	
  Write	
  Events	
  
Logger logger = Logger.getLogger(...);!
Event event = new Event();!
event.setId(id);!
event.setTimestamp(System.currentTimeMillis());!
event.setSource(source);!
logger.info(event);!
21
Log4j	
  configura;on	
  
log4j.appender.flume =
org.kitesdk.data.flume.Log4jAppender!
log4j.appender.flume.Hostname = localhost!
log4j.appender.flume.Port = 41415!
log4j.appender.flume.DatasetRepositoryUri = repo:hive!
log4j.appender.flume.DatasetName = events!
22
The	
  resul;ng	
  file	
  layout	
  
/user!
/hive!
/warehouse!
/events!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
23
Avro	
  
files	
  
4.	
  Generate	
  Summaries	
  with	
  Crunch	
  
PCollection<Event> events =
read(asSource(repo.load("events"), Event.class));!
PCollection<Summary> summaries = events!
.by(new GetTimeBucket(), // minute of day, source!
Avros.pairs(Avros.longs(), Avros.strings()))!
.groupByKey()!
.parallelDo(new MakeSummary(),!
Avros.reflects(Summary.class));!
write(summaries, asTarget(repo.load("summaries"))!24
…	
  and	
  run	
  using	
  Maven	
  
$ mvn kite:create-dataset -Dkite.datasetName=summaries ...!
<plugin>!
<groupId>org.kitesdk</groupId>!
<artifactId>kite-maven-plugin</artifactId>!
<configuration>!
<toolClass>com.example.GenerateSummaries</toolClass>!
</configuration>!
</plugin>!
$ mvn kite:run-tool!
25
5.	
  Query	
  with	
  Impala	
  
$ impala-shell -q ’DESCRIBE events'!
+-----------+--------+-------------------+!
| name | type | comment |!
+-----------+--------+-------------------+!
| id | bigint | from deserializer |!
| timestamp | bigint | from deserializer |!
| source | string | from deserializer |!
+-----------+--------+-------------------+!
26
…	
  Ad	
  Hoc	
  Queries	
  
$ impala-shell -q 'SELECT source, COUNT(1) AS cnt
FROM events GROUP BY source'!
+--------------------------------------+-----+!
| source | cnt |!
+--------------------------------------+-----+!
| 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |!
| bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |!
+--------------------------------------+-----+!
Returned 2 row(s) in 0.56s!
27
Advanced	
  Kite	
  
28
Unified	
  Storage	
  Interface	
  
•  Dataset	
  –	
  streaming	
  access,	
  HDFS	
  storage	
  
•  RandomAccessDataset	
  –	
  random	
  access,	
  HBase	
  storage	
  
•  Par;;onStrategy	
  defines	
  how	
  to	
  map	
  an	
  en;ty	
  to	
  par;;ons	
  in	
  
HDFS	
  or	
  row	
  keys	
  in	
  HBase	
  
29
Filesystem	
  Par;;ons	
  
PartitionStrategy p = new PartitionStrategy.Builder()!
.year("timestamp")!
.month("timestamp")!
.day("timestamp").build();!
/user/hive/warehouse/events!
/year=2014/month=02/day=08!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
30
HBase	
  Keys:	
  Defined	
  in	
  Avro	
  
{!
"name": "username",!
"type": "string",!
"mapping": { "type": "key", "value": "0" }!
},!
{!
"name": "favoriteColor",!
"type": "string",!
"mapping": { "type": "column", "value": "meta:fc" }!
}!
31
Random	
  Access	
  Dataset:	
  Crea;on	
  
RandomAccessDatasetRepository repo =
DatasetRepositories.openRandomAccess(!
"repo:hbase:localhost");!
RandomAccessDataset<User> users = repo.load("users");!
users.put(new User("bill", "green"));!
users.put(new User("alice", "blue"));!
32
Random	
  Access	
  Dataset:	
  Retrieval	
  
Key key = new Key.Builder(users)!
.add("username", "bill").build();!
User bill = users.get(key);!
33
Views	
  
View<User> view = users.from("username", "bill");!
DatasetReader<User> reader = view.newReader();!
reader.open();!
for (User user : reader) {!
System.out.println(user);!
}!
reader.close();!
34
Parallel	
  Processing	
  
•  Goal	
  is	
  for	
  Hadoop	
  processing	
  frameworks	
  to	
  “just	
  work”	
  
•  Support	
  Formats,	
  Par;;ons,	
  Views	
  
•  Na;ve	
  Kite	
  components,	
  e.g.	
  DatasetOutputFormat	
  for	
  MR	
  
35
HDFS	
  Dataset	
   HBase	
  Dataset	
  
Crunch	
   Yes	
   Yes	
  
MapReduce	
   Yes	
   Yes	
  
Hive	
   Yes	
   Planned	
  
Impala	
   Yes	
   Planned	
  
Schema	
  Evolu;on	
  
public class Event {!
private long id;!
private long timestamp;!
private String source;!
@Nullable private String ipAddress;!
}!
$ mvn kite:update-dataset !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
36
Searchable	
  Datasets	
  
•  Use	
  Flume	
  Solr	
  Sink	
  (in	
  
addi;on	
  to	
  HDFS	
  Sink)	
  
•  Morphlines	
  library	
  to	
  define	
  
fields	
  to	
  index	
  
•  SolrCloud	
  runs	
  on	
  cluster	
  from	
  
indexes	
  in	
  HDFS	
  
•  Future	
  support	
  in	
  Kite	
  to	
  index	
  
selected	
  fields	
  automa;cally	
  
37
Conclusion	
  
38
Kite	
  makes	
  it	
  easy	
  to	
  get	
  data	
  into	
  Hadoop	
  
with	
  a	
  flexible	
  schema	
  model	
  that	
  is	
  storage	
  
agnos;c	
  in	
  a	
  format	
  that	
  can	
  be	
  processed	
  
with	
  a	
  wide	
  range	
  of	
  Hadoop	
  tools	
  
39
Genng	
  Started	
  With	
  Kite	
  
•  Examples	
  at	
  github.com/kite-­‐sdk/kite-­‐examples	
  
•  Working	
  with	
  streaming	
  and	
  random-­‐access	
  datasets	
  
•  Logging	
  events	
  to	
  datasets	
  from	
  a	
  webapp	
  
•  Running	
  a	
  periodic	
  job	
  
•  Migra;ng	
  data	
  from	
  CSV	
  to	
  a	
  Kite	
  dataset	
  
•  Conver;ng	
  an	
  Avro	
  dataset	
  to	
  a	
  Parquet	
  dataset	
  
•  Wri;ng	
  and	
  configuring	
  Morphlines	
  
•  Using	
  Morphlines	
  to	
  write	
  JSON	
  records	
  to	
  a	
  dataset	
  
40
Ques;ons?	
  
kitesdk.org	
  
@tom_e_white	
  
tom@cloudera.com	
  
41
4242
Applica;ons	
  
•  [Batch]	
  Analyze	
  an	
  archive	
  of	
  songs1	
  
•  [Interac;ve	
  SQL]	
  Ad	
  hoc	
  queries	
  on	
  recommenda;ons	
  from	
  
social	
  media	
  applica;ons2	
  
•  [Search]	
  Searching	
  email	
  traffic	
  in	
  near-­‐real;me3	
  
•  [ML]	
  Detec;ng	
  fraudulent	
  transac;ons	
  using	
  clustering4	
  
43
[1]	
  hMp://blog.cloudera.com/blog/2012/08/process-­‐a-­‐million-­‐songs-­‐with-­‐apache-­‐pig/	
  	
  
[2]	
  hMp://blog.cloudera.com/blog/2014/01/how-­‐wajam-­‐answers-­‐business-­‐ques;ons-­‐faster-­‐with-­‐hadoop/	
  	
  
[3]	
  hMp://blog.cloudera.com/blog/2013/09/email-­‐indexing-­‐using-­‐cloudera-­‐search/	
  	
  
[4]	
  hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/	
  	
  
…	
  or	
  use	
  JDBC	
  
Class.forName("org.apache.hive.jdbc.HiveDriver");!
Connection connection = DriverManager.getConnection(!
"jdbc:hive2://localhost:21050/;auth=noSasl");!
Statement statement = connection.createStatement();!
ResultSet resultSet = statement.executeQuery(!
"SELECT * FROM summaries");!
44
Apps	
  
•  App	
  –	
  a	
  packaged	
  Java	
  program	
  that	
  runs	
  on	
  a	
  Hadoop	
  cluster	
  
•  cdk:package-­‐app	
  –	
  create	
  a	
  package	
  on	
  the	
  local	
  filesystem	
  
•  like	
  an	
  exploded	
  WAR	
  
•  Oozie	
  format	
  
•  cdk:deploy-­‐app	
  –	
  copy	
  packaged	
  app	
  to	
  HDFS	
  
•  cdk:run-­‐app	
  –	
  execute	
  the	
  app	
  
•  Workflow	
  app	
  –	
  runs	
  once	
  
•  Coordinator	
  app	
  –	
  runs	
  other	
  apps	
  (like	
  cron)	
  
45
Morphlines	
  Example	
  
46
morphlines	
  :	
  [	
  
	
  {	
  
	
  	
  	
  id	
  :	
  morphline1	
  
	
  	
  	
  importCommands	
  :	
  ["com.cloudera.**",	
  "org.apache.solr.**"]	
  
	
  	
  	
  commands	
  :	
  [	
  
	
  	
  	
  	
  	
  {	
  readLine	
  {}	
  }	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  grok	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  dic;onaryFiles	
  :	
  [/tmp/grok-­‐dic;onaries]	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  expressions	
  :	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  message	
  :	
  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}	
  %
{SYSLOGHOST:syslog_hostname}	
  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:	
  %
{GREEDYDATA:syslog_message}"""	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  {	
  loadSolr	
  {}	
  }	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  ]	
  
	
  }	
  
]	
  
Example Input	

<164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22	

Output Record	

syslog_pri:164	

syslog_timestamp:Feb  4 10:46:14	

syslog_hostname:syslog	

syslog_program:sshd	

syslog_pid:607	

syslog_message:listening on 0.0.0.0 port 22.

Building Hadoop Data Applications with Kite

  • 1.
    11 Headline  Goes  Here   Speaker  Name  or  Subhead  Goes  Here   Building  Hadoop  Data  Applica;ons  with  Kite   Tom  White  @tom_e_white   Hadoop  Users  Group  UK,  London   17  June  2014  
  • 2.
    About  me   • Engineer  at  Cloudera  working   on  Core  Hadoop  and  Kite   •  Apache  Hadoop  CommiMer,   PMC  Member,  Apache  Member   •  Author  of     “Hadoop:  The  Defini;ve  Guide”   2
  • 3.
    Hadoop  0.1   %cat bigdata.txt | hadoop fs -put - in! % hadoop MyJob in out! % hadoop fs -get out! 3
  • 4.
    Characteris;cs   •  Batch  applica;ons  only   •  Low-­‐level  coding   •  File  format   •  Serializa;on   •  Par;;oning  scheme   4
  • 5.
  • 6.
    Common  Data,  Many  Tools      #  tools  >>  #  file  formats  >>  #  file  systems   6
  • 7.
    Glossary   •  Apache  Avro  –  cross-­‐language  data  serializa;on  library   •  Apache  Parquet  (incuba;ng)  –  column-­‐oriented  storage  format   for  nested  data   •  Apache  Hive  –  data  warehouse  (SQL  and  metastore)   •  Apache  Flume  –  streaming  log  capture  and  delivery  system   •  Apache  Oozie  –  workflow  scheduler  system   •  Apache  Crunch  –  Java  API  for  wri;ng  data  pipelines   •  Impala  –  interac;ve  SQL  on  Hadoop   7
  • 8.
    Outline   •  A  Typical  Applica;on   •  Kite  SDK   •  An  Example   •  Advanced  Kite   8
  • 9.
    A  typical  applica;on  (zoom  100:1)   9
  • 10.
    A  typical  applica;on  (zoom  10:1)   10
  • 11.
    A  typical  pipeline  (zoom  5:1)   11
  • 12.
  • 13.
    Kite  Codifies  Best  Prac;ce  as  APIs,  Tools,  Docs   and  Examples   13
  • 14.
    Kite   •  A  client-­‐side  library  for  wri;ng  Hadoop  Data  Applica;ons   •  First  release  was  in  April  2013  as  CDK   •  0.14.1  last  month   •  Open  source,  Apache  2  license,  kitesdk.org   •  Modular   •  Data  module  (HDFS,  Flume,  Crunch,  Hive,  HBase)   •  Morphlines  transforma;on  module   •  Maven  plugin   14
  • 15.
  • 16.
    Kite  Data  Module   •  Dataset  –  a  collec;on  of  en;;es   •  DatasetRepository  –  physical  storage  loca;on  for  datasets   •  DatasetDescriptor  –  holds  dataset  metadata  (schema,  format)   •  DatasetWriter  –  write  en;;es  to  a  dataset  in  a  stream   •  DatasetReader  –  read  en;;es  from  a  dataset     •  hMp://kitesdk.org/docs/current/apidocs/index.html   16
  • 17.
    1.  Define  the  Event  En;ty   public class Event {! private long id;! private long timestamp;! private String source;! // getters and setters! }! 17
  • 18.
    2.  Create  the  Events  Dataset   DatasetRepository repo = DatasetRepositories.open("repo:hive");! DatasetDescriptor descriptor =! new DatasetDescriptor.Builder()! .schema(Event.class).build();! repo.create("events", descriptor);! 18
  • 19.
    (2.  or  with  the  Maven  plugin)   $ mvn kite:create-dataset ! -Dkite.repositoryUri='repo:hive' ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 19
  • 20.
    A  peek  at  the  Avro  schema   $ hive -e "DESCRIBE EXTENDED events"! ...! {! "type" : "record",! "name" : "Event",! "namespace" : "com.example",! "fields" : [! { "name" : "id", "type" : "long" },! { "name" : "timestamp", "type" : "long" },! { "name" : "source", "type" : "string" }! ]! }! 20
  • 21.
    3.  Write  Events   Logger logger = Logger.getLogger(...);! Event event = new Event();! event.setId(id);! event.setTimestamp(System.currentTimeMillis());! event.setSource(source);! logger.info(event);! 21
  • 22.
    Log4j  configura;on   log4j.appender.flume= org.kitesdk.data.flume.Log4jAppender! log4j.appender.flume.Hostname = localhost! log4j.appender.flume.Port = 41415! log4j.appender.flume.DatasetRepositoryUri = repo:hive! log4j.appender.flume.DatasetName = events! 22
  • 23.
    The  resul;ng  file  layout   /user! /hive! /warehouse! /events! /FlumeData.1375659013795! /FlumeData.1375659013796! 23 Avro   files  
  • 24.
    4.  Generate  Summaries  with  Crunch   PCollection<Event> events = read(asSource(repo.load("events"), Event.class));! PCollection<Summary> summaries = events! .by(new GetTimeBucket(), // minute of day, source! Avros.pairs(Avros.longs(), Avros.strings()))! .groupByKey()! .parallelDo(new MakeSummary(),! Avros.reflects(Summary.class));! write(summaries, asTarget(repo.load("summaries"))!24
  • 25.
    …  and  run  using  Maven   $ mvn kite:create-dataset -Dkite.datasetName=summaries ...! <plugin>! <groupId>org.kitesdk</groupId>! <artifactId>kite-maven-plugin</artifactId>! <configuration>! <toolClass>com.example.GenerateSummaries</toolClass>! </configuration>! </plugin>! $ mvn kite:run-tool! 25
  • 26.
    5.  Query  with  Impala   $ impala-shell -q ’DESCRIBE events'! +-----------+--------+-------------------+! | name | type | comment |! +-----------+--------+-------------------+! | id | bigint | from deserializer |! | timestamp | bigint | from deserializer |! | source | string | from deserializer |! +-----------+--------+-------------------+! 26
  • 27.
    …  Ad  Hoc  Queries   $ impala-shell -q 'SELECT source, COUNT(1) AS cnt FROM events GROUP BY source'! +--------------------------------------+-----+! | source | cnt |! +--------------------------------------+-----+! | 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |! | bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |! +--------------------------------------+-----+! Returned 2 row(s) in 0.56s! 27
  • 28.
  • 29.
    Unified  Storage  Interface   •  Dataset  –  streaming  access,  HDFS  storage   •  RandomAccessDataset  –  random  access,  HBase  storage   •  Par;;onStrategy  defines  how  to  map  an  en;ty  to  par;;ons  in   HDFS  or  row  keys  in  HBase   29
  • 30.
    Filesystem  Par;;ons   PartitionStrategyp = new PartitionStrategy.Builder()! .year("timestamp")! .month("timestamp")! .day("timestamp").build();! /user/hive/warehouse/events! /year=2014/month=02/day=08! /FlumeData.1375659013795! /FlumeData.1375659013796! 30
  • 31.
    HBase  Keys:  Defined  in  Avro   {! "name": "username",! "type": "string",! "mapping": { "type": "key", "value": "0" }! },! {! "name": "favoriteColor",! "type": "string",! "mapping": { "type": "column", "value": "meta:fc" }! }! 31
  • 32.
    Random  Access  Dataset:  Crea;on   RandomAccessDatasetRepository repo = DatasetRepositories.openRandomAccess(! "repo:hbase:localhost");! RandomAccessDataset<User> users = repo.load("users");! users.put(new User("bill", "green"));! users.put(new User("alice", "blue"));! 32
  • 33.
    Random  Access  Dataset:  Retrieval   Key key = new Key.Builder(users)! .add("username", "bill").build();! User bill = users.get(key);! 33
  • 34.
    Views   View<User> view= users.from("username", "bill");! DatasetReader<User> reader = view.newReader();! reader.open();! for (User user : reader) {! System.out.println(user);! }! reader.close();! 34
  • 35.
    Parallel  Processing   • Goal  is  for  Hadoop  processing  frameworks  to  “just  work”   •  Support  Formats,  Par;;ons,  Views   •  Na;ve  Kite  components,  e.g.  DatasetOutputFormat  for  MR   35 HDFS  Dataset   HBase  Dataset   Crunch   Yes   Yes   MapReduce   Yes   Yes   Hive   Yes   Planned   Impala   Yes   Planned  
  • 36.
    Schema  Evolu;on   publicclass Event {! private long id;! private long timestamp;! private String source;! @Nullable private String ipAddress;! }! $ mvn kite:update-dataset ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 36
  • 37.
    Searchable  Datasets   • Use  Flume  Solr  Sink  (in   addi;on  to  HDFS  Sink)   •  Morphlines  library  to  define   fields  to  index   •  SolrCloud  runs  on  cluster  from   indexes  in  HDFS   •  Future  support  in  Kite  to  index   selected  fields  automa;cally   37
  • 38.
  • 39.
    Kite  makes  it  easy  to  get  data  into  Hadoop   with  a  flexible  schema  model  that  is  storage   agnos;c  in  a  format  that  can  be  processed   with  a  wide  range  of  Hadoop  tools   39
  • 40.
    Genng  Started  With  Kite   •  Examples  at  github.com/kite-­‐sdk/kite-­‐examples   •  Working  with  streaming  and  random-­‐access  datasets   •  Logging  events  to  datasets  from  a  webapp   •  Running  a  periodic  job   •  Migra;ng  data  from  CSV  to  a  Kite  dataset   •  Conver;ng  an  Avro  dataset  to  a  Parquet  dataset   •  Wri;ng  and  configuring  Morphlines   •  Using  Morphlines  to  write  JSON  records  to  a  dataset   40
  • 41.
  • 42.
  • 43.
    Applica;ons   •  [Batch]  Analyze  an  archive  of  songs1   •  [Interac;ve  SQL]  Ad  hoc  queries  on  recommenda;ons  from   social  media  applica;ons2   •  [Search]  Searching  email  traffic  in  near-­‐real;me3   •  [ML]  Detec;ng  fraudulent  transac;ons  using  clustering4   43 [1]  hMp://blog.cloudera.com/blog/2012/08/process-­‐a-­‐million-­‐songs-­‐with-­‐apache-­‐pig/     [2]  hMp://blog.cloudera.com/blog/2014/01/how-­‐wajam-­‐answers-­‐business-­‐ques;ons-­‐faster-­‐with-­‐hadoop/     [3]  hMp://blog.cloudera.com/blog/2013/09/email-­‐indexing-­‐using-­‐cloudera-­‐search/     [4]  hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/    
  • 44.
    …  or  use  JDBC   Class.forName("org.apache.hive.jdbc.HiveDriver");! Connection connection = DriverManager.getConnection(! "jdbc:hive2://localhost:21050/;auth=noSasl");! Statement statement = connection.createStatement();! ResultSet resultSet = statement.executeQuery(! "SELECT * FROM summaries");! 44
  • 45.
    Apps   •  App  –  a  packaged  Java  program  that  runs  on  a  Hadoop  cluster   •  cdk:package-­‐app  –  create  a  package  on  the  local  filesystem   •  like  an  exploded  WAR   •  Oozie  format   •  cdk:deploy-­‐app  –  copy  packaged  app  to  HDFS   •  cdk:run-­‐app  –  execute  the  app   •  Workflow  app  –  runs  once   •  Coordinator  app  –  runs  other  apps  (like  cron)   45
  • 46.
    Morphlines  Example   46 morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dic;onaryFiles  :  [/tmp/grok-­‐dic;onaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example Input <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb  4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.