Building Hadoop Data Applications with Kite

11
Headline
Goes
Here

Speaker
Name
or
Subhead
Goes
Here

Building
Hadoop
Data
Applica;ons
with
Kite

Tom
White
@tom_e_white

Hadoop
Users
Group
UK,
London

17
June
2014

About
me

•  Engineer
at
Cloudera
working

on
Core
Hadoop
and
Kite

•  Apache
Hadoop
CommiMer,

PMC
Member,
Apache
Member

•  Author
of

“Hadoop:
The
Deﬁni;ve
Guide”

2

Hadoop
0.1

% cat bigdata.txt | hadoop fs -put - in!
% hadoop MyJob in out!
% hadoop fs -get out!
3

Characteris;cs

•  Batch
applica;ons
only

•  Low-‐level
coding

•  File
format

•  Serializa;on

•  Par;;oning
scheme

4

Common
Data,
Many
Tools

#
tools
>>
#
ﬁle
formats
>>
#
ﬁle
systems

6

Glossary

•  Apache
Avro
–
cross-‐language
data
serializa;on
library

•  Apache
Parquet
(incuba;ng)
–
column-‐oriented
storage
format

for
nested
data

•  Apache
Hive
–
data
warehouse
(SQL
and
metastore)

•  Apache
Flume
–
streaming
log
capture
and
delivery
system

•  Apache
Oozie
–
workﬂow
scheduler
system

•  Apache
Crunch
–
Java
API
for
wri;ng
data
pipelines

•  Impala
–
interac;ve
SQL
on
Hadoop

7

Outline

•  A
Typical
Applica;on

•  Kite
SDK

•  An
Example

•  Advanced
Kite

8

A
typical
applica;on
(zoom
100:1)

9

A
typical
applica;on
(zoom
10:1)

10

A
typical
pipeline
(zoom
5:1)

11

Kite
Codiﬁes
Best
Prac;ce
as
APIs,
Tools,
Docs

and
Examples

13

Kite

•  A
client-‐side
library
for
wri;ng
Hadoop
Data
Applica;ons

•  First
release
was
in
April
2013
as
CDK

•  0.14.1
last
month

•  Open
source,
Apache
2
license,
kitesdk.org

•  Modular

•  Data
module
(HDFS,
Flume,
Crunch,
Hive,
HBase)

•  Morphlines
transforma;on
module

•  Maven
plugin

14

Kite
Data
Module

•  Dataset
–
a
collec;on
of
en;;es

•  DatasetRepository
–
physical
storage
loca;on
for
datasets

•  DatasetDescriptor
–
holds
dataset
metadata
(schema,
format)

•  DatasetWriter
–
write
en;;es
to
a
dataset
in
a
stream

•  DatasetReader
–
read
en;;es
from
a
dataset

•  hMp://kitesdk.org/docs/current/apidocs/index.html

16

1.
Deﬁne
the
Event
En;ty

public class Event {!
private long id;!
private long timestamp;!
private String source;!
// getters and setters!
}!
17

2.
Create
the
Events
Dataset

DatasetRepository repo =
DatasetRepositories.open("repo:hive");!
DatasetDescriptor descriptor =!
new DatasetDescriptor.Builder()!
.schema(Event.class).build();!
repo.create("events", descriptor);!
18

(2.
or
with
the
Maven
plugin)

$ mvn kite:create-dataset !
-Dkite.repositoryUri='repo:hive' !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
19

A
peek
at
the
Avro
schema

$ hive -e "DESCRIBE EXTENDED events"!
...!
{!
"type" : "record",!
"name" : "Event",!
"namespace" : "com.example",!
"fields" : [!
{ "name" : "id", "type" : "long" },!
{ "name" : "timestamp", "type" : "long" },!
{ "name" : "source", "type" : "string" }!
]!
}!
20

3.
Write
Events

Logger logger = Logger.getLogger(...);!
Event event = new Event();!
event.setId(id);!
event.setTimestamp(System.currentTimeMillis());!
event.setSource(source);!
logger.info(event);!
21

Log4j
conﬁgura;on

log4j.appender.flume =
org.kitesdk.data.flume.Log4jAppender!
log4j.appender.flume.Hostname = localhost!
log4j.appender.flume.Port = 41415!
log4j.appender.flume.DatasetRepositoryUri = repo:hive!
log4j.appender.flume.DatasetName = events!
22

The
resul;ng
ﬁle
layout

/user!
/hive!
/warehouse!
/events!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
23
Avro

ﬁles

4.
Generate
Summaries
with
Crunch

PCollection<Event> events =
read(asSource(repo.load("events"), Event.class));!
PCollection<Summary> summaries = events!
.by(new GetTimeBucket(), // minute of day, source!
Avros.pairs(Avros.longs(), Avros.strings()))!
.groupByKey()!
.parallelDo(new MakeSummary(),!
Avros.reflects(Summary.class));!
write(summaries, asTarget(repo.load("summaries"))!24

…
and
run
using
Maven

$ mvn kite:create-dataset -Dkite.datasetName=summaries ...!
<plugin>!
<groupId>org.kitesdk</groupId>!
<artifactId>kite-maven-plugin</artifactId>!
<configuration>!
<toolClass>com.example.GenerateSummaries</toolClass>!
</configuration>!
</plugin>!
$ mvn kite:run-tool!
25

…
Ad
Hoc
Queries

$ impala-shell -q 'SELECT source, COUNT(1) AS cnt
FROM events GROUP BY source'!
+--------------------------------------+-----+!
| source | cnt |!
+--------------------------------------+-----+!
| 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |!
| bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |!
+--------------------------------------+-----+!
Returned 2 row(s) in 0.56s!
27

Uniﬁed
Storage
Interface

•  Dataset
–
streaming
access,
HDFS
storage

•  RandomAccessDataset
–
random
access,
HBase
storage

•  Par;;onStrategy
deﬁnes
how
to
map
an
en;ty
to
par;;ons
in

HDFS
or
row
keys
in
HBase

29

Filesystem
Par;;ons

PartitionStrategy p = new PartitionStrategy.Builder()!
.year("timestamp")!
.month("timestamp")!
.day("timestamp").build();!
/user/hive/warehouse/events!
/year=2014/month=02/day=08!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
30

HBase
Keys:
Deﬁned
in
Avro

{!
"name": "username",!
"type": "string",!
"mapping": { "type": "key", "value": "0" }!
},!
{!
"name": "favoriteColor",!
"type": "string",!
"mapping": { "type": "column", "value": "meta:fc" }!
}!
31

Random
Access
Dataset:
Crea;on

RandomAccessDatasetRepository repo =
DatasetRepositories.openRandomAccess(!
"repo:hbase:localhost");!
RandomAccessDataset<User> users = repo.load("users");!
users.put(new User("bill", "green"));!
users.put(new User("alice", "blue"));!
32

Random
Access
Dataset:
Retrieval

Key key = new Key.Builder(users)!
.add("username", "bill").build();!
User bill = users.get(key);!
33

Views

View<User> view = users.from("username", "bill");!
DatasetReader<User> reader = view.newReader();!
reader.open();!
for (User user : reader) {!
System.out.println(user);!
}!
reader.close();!
34

Parallel
Processing

•  Goal
is
for
Hadoop
processing
frameworks
to
“just
work”

•  Support
Formats,
Par;;ons,
Views

•  Na;ve
Kite
components,
e.g.
DatasetOutputFormat
for
MR

35
HDFS
Dataset
HBase
Dataset

Crunch
Yes
Yes

MapReduce
Yes
Yes

Hive
Yes
Planned

Impala
Yes
Planned

Schema
Evolu;on

public class Event {!
private long id;!
private long timestamp;!
private String source;!
@Nullable private String ipAddress;!
}!
$ mvn kite:update-dataset !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
36

Searchable
Datasets

•  Use
Flume
Solr
Sink
(in

addi;on
to
HDFS
Sink)

•  Morphlines
library
to
define

fields
to
index

•  SolrCloud
runs
on
cluster
from

indexes
in
HDFS

•  Future
support
in
Kite
to
index

selected
fields
automa;cally

37

Kite
makes
it
easy
to
get
data
into
Hadoop

with
a
ﬂexible
schema
model
that
is
storage

agnos;c
in
a
format
that
can
be
processed

with
a
wide
range
of
Hadoop
tools

39

Genng
Started
With
Kite

•  Examples
at
github.com/kite-‐sdk/kite-‐examples

•  Working
with
streaming
and
random-‐access
datasets

•  Logging
events
to
datasets
from
a
webapp

•  Running
a
periodic
job

•  Migra;ng
data
from
CSV
to
a
Kite
dataset

•  Conver;ng
an
Avro
dataset
to
a
Parquet
dataset

•  Wri;ng
and
conﬁguring
Morphlines

•  Using
Morphlines
to
write
JSON
records
to
a
dataset

40

Ques;ons?

kitesdk.org

@tom_e_white

tom@cloudera.com

41

Applica;ons

•  [Batch]
Analyze
an
archive
of
songs1

•  [Interac;ve
SQL]
Ad
hoc
queries
on
recommenda;ons
from

social
media
applica;ons2

•  [Search]
Searching
email
traﬃc
in
near-‐real;me3

•  [ML]
Detec;ng
fraudulent
transac;ons
using
clustering4

43
[1]
hMp://blog.cloudera.com/blog/2012/08/process-‐a-‐million-‐songs-‐with-‐apache-‐pig/

[2]
hMp://blog.cloudera.com/blog/2014/01/how-‐wajam-‐answers-‐business-‐ques;ons-‐faster-‐with-‐hadoop/

[3]
hMp://blog.cloudera.com/blog/2013/09/email-‐indexing-‐using-‐cloudera-‐search/

[4]
hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

…
or
use
JDBC

Class.forName("org.apache.hive.jdbc.HiveDriver");!
Connection connection = DriverManager.getConnection(!
"jdbc:hive2://localhost:21050/;auth=noSasl");!
Statement statement = connection.createStatement();!
ResultSet resultSet = statement.executeQuery(!
"SELECT * FROM summaries");!
44

Apps

•  App
–
a
packaged
Java
program
that
runs
on
a
Hadoop
cluster

•  cdk:package-‐app
–
create
a
package
on
the
local
ﬁlesystem

•  like
an
exploded
WAR

•  Oozie
format

•  cdk:deploy-‐app
–
copy
packaged
app
to
HDFS

•  cdk:run-‐app
–
execute
the
app

•  Workﬂow
app
–
runs
once

•  Coordinator
app
–
runs
other
apps
(like
cron)

45

Morphlines
Example

46
morphlines
:
[

{

id
:
morphline1

importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]

commands
:
[

{
readLine
{}
}

{

grok
{

dic;onaryFiles
:
[/tmp/grok-‐dic;onaries]

expressions
:
{

message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""

}

}

}

{
loadSolr
{}
}

]

}

]

Example Input

<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22

Output Record

syslog_pri:164

syslog_timestamp:Feb 4 10:46:14

syslog_hostname:syslog

syslog_program:sshd

syslog_pid:607

syslog_message:listening on 0.0.0.0 port 22.

Building Hadoop Data Applications with Kite

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Hadoop Data Applications with Kite

Similar to Building Hadoop Data Applications with Kite (20)

More from huguk

More from huguk (20)

Recently uploaded

Recently uploaded (20)

Building Hadoop Data Applications with Kite