Building Hadoop Data Applications with Kite by Tom White

4,988 views

Published on

With a such a large number of components in the Hadoop ecosystem, writing Hadoop applications can be a big challenge for newcomers. In this talk Tom looks at best practices for building data applications that run on Hadoop, and introduces the Kite SDK, an open source project created at Cloudera with the goal of simplifying Hadoop application development by codifying many of these best practices.

Meet with Tom White:
Tom White is one of the foremost experts on Hadoop. He has been an Apache Hadoop committer since February 2007, and is a Member of the Apache Software Foundation. Tom is a software engineer at Cloudera, where he has worked, since its foundation, on the core distributions from Apache and Cloudera. Previously he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O’Reilly, java.net and IBM’s developerWorks, and has spoken at many conferences, including ApacheCon and OSCON. Tom has a B.A. in mathematics from the University of Cambridge and an M.A. in philosophy of science from the University of Leeds, UK. He currently lives in Wales with his family.

Building Hadoop Data Applications with Kite by Tom White

  1. 1. Building(Hadoop(Data(Applica;ons(with(Kite( Headline(Goes(Here( Tom(White(@tom_e_white( Speaker(Name(or(Subhead(Goes(Here( The(Hive,(February(18,(2014( 1
  2. 2. Hadoop(0.1 ( % cat bigdata.txt | hadoop fs -put - in! % hadoop MyJob in out! % hadoop fs -get out! 2
  3. 3. Characteris;cs ( •  Batch(applica;ons(only( •  LowNlevel(coding( •  File(format( •  Serializa;on( •  Par;;oning(scheme( 3
  4. 4. A(Hadoop(Stack ( 4
  5. 5. Applica;ons ( •  [Batch](Analyze(an(archive(of(songs1( •  [Interac;ve(SQL](Ad(hoc(queries(on(recommenda;ons(from( social(media(applica;ons2( •  [Search](Searching(email(traffic(in(nearNreal;me3( •  [ML](Detec;ng(fraudulent(transac;ons(using(clustering4( 5 [1](hZp://blog.cloudera.com/blog/2012/08/processNaNmillionNsongsNwithNapacheNpig/(( [2](hZp://blog.cloudera.com/blog/2014/01/howNwajamNanswersNbusinessNques;onsNfasterNwithNhadoop/(( [3](hZp://blog.cloudera.com/blog/2013/09/emailNindexingNusingNclouderaNsearch/(( [4](hZp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/((
  6. 6. Outline ( •  A(Typical(Applica;on( •  Kite(SDK( •  An(Example( •  Advanced(Kite( •  Conclusion( •  Ques;ons( 6
  7. 7. A(typical(applica;on((zoom(100:1) ( 7
  8. 8. A(typical(applica;on((zoom(10:1) ( 8
  9. 9. A(typical(pipeline((zoom(5:1) ( 9
  10. 10. Kite(SDK ( 10
  11. 11. Kite(Codifies(Best(Prac;ce(as(APIs,(Tools,(Docs( and(Examples ( 11
  12. 12. Kite ( •  A(clientNside(library(for(wri;ng(Hadoop(Data(Applica;ons( •  First(release(was(in(April(2013(as(CDK( •  0.11.0(earlier(this(month( •  Open(source,(Apache(2(license,(kitesdk.org( •  Modular( •  Data(module((HDFS,(Flume,(Crunch,(Hive,(HBase)( •  Morphlines(transforma;on(module( •  Maven(plugin( 12
  13. 13. An(Example ( 13
  14. 14. Kite(Data(Module ( •  Dataset(–(a(collec;on(of(en;;es( •  DatasetRepository(–(physical(storage(loca;on(for(datasets( •  DatasetDescriptor(–(holds(dataset(metadata((schema,(format)( •  DatasetWriter(–(write(en;;es(to(a(dataset(in(a(stream( •  DatasetReader(–(read(en;;es(from(a(dataset(( 14
  15. 15. 1.(Define(the(Event(En;ty ( public class Event {! private long id;! private long timestamp;! private String source;! // getters and setters! }! 15
  16. 16. 2.(Create(the(Events(Dataset ( DatasetRepository repo = DatasetRepositories.open("repo:hive");! DatasetDescriptor descriptor =! new DatasetDescriptor.Builder()! .schema(Event.class).build();! repo.create("events", descriptor);! 16
  17. 17. (2.(or(with(the(Maven(plugin) ( $ mvn kite:create-dataset ! -Dkite.repositoryUri='repo:hive' ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 17
  18. 18. A(peek(at(the(Avro(schema ( $ hive -e "DESCRIBE EXTENDED events"! ...! {! "type" : "record",! "name" : "Event",! "namespace" : "com.example",! "fields" : [! { "name" : "id", "type" : "long" },! { "name" : "timestamp", "type" : "long" },! { "name" : "source", "type" : "string" }! ]! 18 }!
  19. 19. 3.(Write(Events ( Logger logger = Logger.getLogger(...);! Event event = new Event();! event.setId(id);! event.setTimestamp(System.currentTimeMillis());! event.setSource(source);! logger.info(event);! 19
  20. 20. Log4j(configura;on ( log4j.appender.flume = org.kitesdk.data.flume.Log4jAppender! log4j.appender.flume.Hostname = localhost! log4j.appender.flume.Port = 41415! log4j.appender.flume.DatasetRepositoryUri = repo:hive ! log4j.appender.flume.DatasetName = events! 20
  21. 21. The(resul;ng(file(layout ( /user! /hive! /warehouse! /events! /FlumeData.1375659013795! /FlumeData.1375659013796! 21 Avro( files(
  22. 22. 4.(Generate(Summaries(with(Crunch ( PCollection<Event> events = read(asSource(repo.load("events"), Event.class));! PCollection<Summary> summaries = events! .by(new GetTimeBucket(), // minute of day, source ! Avros.pairs(Avros.longs(), Avros.strings()))! .groupByKey()! .parallelDo(new MakeSummary(),! Avros.reflects(Summary.class));! 22 write(summaries, asTarget(repo.load("summaries"))!
  23. 23. …(and(run(using(Maven ( $ mvn kite:create-dataset -Dkite.datasetName=summaries ...! <plugin>! <groupId>org.kitesdk</groupId>! <artifactId>kite-maven-plugin</artifactId>! <configuration>! <toolClass>com.example.GenerateSummaries</toolClass>! </configuration>! </plugin>! 23 $ mvn kite:run-tool!
  24. 24. 5.(Query(with(Impala ( $ impala-shell -q ’DESCRIBE events'! +-----------+--------+-------------------+! | name | type | comment |! +-----------+--------+-------------------+! | id | bigint | from deserializer |! | timestamp | bigint | from deserializer |! | source | string | from deserializer |! +-----------+--------+-------------------+! 24
  25. 25. …(Ad(Hoc(Queries ( $ impala-shell -q 'SELECT source, COUNT(1) AS cnt FROM events GROUP BY source'! +--------------------------------------+-----+! | source | cnt |! +--------------------------------------+-----+! | 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |! | bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |! +--------------------------------------+-----+! Returned 2 row(s) in 0.56s! 25
  26. 26. …(or(use(JDBC ( Class.forName("org.apache.hive.jdbc.HiveDriver");! Connection connection = DriverManager.getConnection(! "jdbc:hive2://localhost:21050/;auth=noSasl");! Statement statement = connection.createStatement();! ResultSet resultSet = statement.executeQuery(! "SELECT * FROM summaries");! 26
  27. 27. Advanced(Kite ( 27
  28. 28. Unified(Storage(Interface ( •  Dataset(–(streaming(access,(HDFS(storage( •  RandomAccessDataset(–(random(access,(HBase(storage( •  Par;;onStrategy(defines(how(to(map(an(en;ty(to(par;;ons(in( HDFS(or(row(keys(in(HBase( 28
  29. 29. Filesystem(Par;;ons ( PartitionStrategy p = new PartitionStrategy.Builder() ! .year("timestamp")! .month("timestamp")! .day("timestamp").build();! /user/hive/warehouse/events! /year=2014/month=02/day=08! /FlumeData.1375659013795! /FlumeData.1375659013796! 29
  30. 30. HBase(Keys:(Defined(in(Avro ( {! "name": "username",! "type": "string",! "mapping": { "type": "key", "value": "0" }! },! {! "name": "favoriteColor",! "type": "string",! "mapping": { "type": "column", "value": "meta:fc" } ! }! 30
  31. 31. Random(Access(Dataset:(Crea;on ( RandomAccessDatasetRepository repo = DatasetRepositories.openRandomAccess(! "repo:hbase:localhost");! RandomAccessDataset<User> users = repo.load("users"); ! users.put(new User("bill", "green"));! users.put(new User("alice", "blue"));! 31
  32. 32. Random(Access(Dataset:(Retrieval ( Key key = new Key.Builder(users)! .add("username", "bill").build();! User bill = users.get(key);! 32
  33. 33. Views ( View<User> view = users.from("username", "bill");! DatasetReader<User> reader = view.newReader();! reader.open();! for (User user : reader) {! System.out.println(user);! }! reader.close();! 33
  34. 34. Parallel(Processing ( •  Goal(is(for(Hadoop(processing(frameworks(to(“just(work”( •  Support(Formats,(Par;;ons,(Views( •  Na;ve(Kite(components,(e.g.(DatasetOutputFormat(for(MR( HDFS%Dataset% Crunch( MapReduce( Impala( 34 HBase%Dataset% Yes( 0.12.0( 0.12.0( 0.12.0( Yes( Planned(
  35. 35. Schema(Evolu;on ( public class Event {! private long id;! private long timestamp;! private String source;! @Nullable private String ipAddress;! }! $ mvn kite:update-dataset ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 35
  36. 36. Searchable(Datasets ( •  Use(Flume(Solr(Sink((in( addi;on(to(HDFS(Sink)( •  Morphlines(library(to(define( fields(to(index( •  SolrCloud(runs(on(cluster(from( indexes(in(HDFS( •  Future(support(in(Kite(to(index( selected(fields(automa;cally( 36
  37. 37. Conclusion ( 37
  38. 38. Kite(makes(it(easy(to(get(data(into(Hadoop( with(a(flexible(schema(model(that(is(storage( agnos;c(in(a(format(that(can(be(processed( with(a(wide(range(of(Hadoop(tools ( 38
  39. 39. Gepng(Started(With(Kite ( •  Examples(at(github.com/kiteNsdk/kiteNexamples( •  Working(with(streaming(and(randomNaccess(datasets( •  Logging(events(to(datasets(from(a(webapp( •  Running(a(periodic(job( •  Migra;ng(data(from(CSV(to(a(Kite(dataset( •  Conver;ng(an(Avro(dataset(to(a(Parquet(dataset( •  Wri;ng(and(configuring(Morphlines( •  Using(Morphlines(to(write(JSON(records(to(a(dataset( 39
  40. 40. Ques;ons? ( kitesdk.org ( @tom_e_white ( tom@cloudera.com ( 40
  41. 41. 41
  42. 42. About(me ( •  Engineer(at(Cloudera(working( on(Core(Hadoop(and(Kite( •  Apache(Hadoop(CommiZer,( PMC(Member,(Apache(Member( •  Author(of(( “Hadoop:(The(Defini;ve(Guide”( 42
  43. 43. Morphlines(Example ( morphlines(:([( ({( (((id(:(morphline1( (((importCommands(:(["com.cloudera.**",("org.apache.solr.**"]( (((commands(:([( ((((({(readLine({}(}((((((((((((((((((((( ((((({(( (((((((grok({(( (((((((((dic;onaryFiles(:([/tmp/grokNdic;onaries](((((((((((((((((((((((((((((((( (((((((((expressions(:({(( (((((((((((message(:("""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}(% {SYSLOGHOST:syslog_hostname}(%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:(% {GREEDYDATA:syslog_message}"""( Example Input! <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22! (((((((((}( Output Record! (((((((}( syslog_pri:164! (((((}( syslog_timestamp:Feb  4 10:46:14! ((((({(loadSolr({}(}(((((( syslog_hostname:syslog! syslog_program:sshd! ((((]( syslog_pid:607! (}( syslog_message:listening on 0.0.0.0 port 22.! 43 ](
  44. 44. Apps ( •  App(–(a(packaged(Java(program(that(runs(on(a(Hadoop(cluster( •  cdk:packageNapp(–(create(a(package(on(the(local(filesystem( •  like(an(exploded(WAR( •  Oozie(format( •  cdk:deployNapp(–(copy(packaged(app(to(HDFS( •  cdk:runNapp(–(execute(the(app( •  Workflow(app(–(runs(once( •  Coordinator(app(–(runs(other(apps((like(cron)( 44

×