Large Scale ETL for Hadoop and Cloudera Search using Morphlines


Published on

Cloudera Morphlines is a new, embeddable, open source Java framework that reduces the time and skills necessary to integrate and build Hadoop applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, analytic online dashboards, or other consumers. If you want to integrate, build, or facilitate streaming or batch transformation pipelines without programming and without MapReduce skills, and get the job done with a minimum amount of fuss and support costs, Morphlines is for you.

In this talk, you'll get an overview of Morphlines internals and explore sample use cases that can be widely applied.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Large Scale ETL for Hadoop and Cloudera Search using Morphlines

  1. 1. 1 Large  Scale  ETL  for  Hadoop  and   Cloudera  Search  using  Morphlines   Wolfgang  Hoschek  (@whoschek)   Silicon  Valley  Java  User  Group  Meetup  Sept  2013  
  2. 2. Agenda   •  Hadoop,  ETL  and  Search  –  seLng  the  stage   •  Cloudera  Morphlines  Architecture   •  Component  Deep  Dive   •  Cloudera  Search  Use  Cases   •  What’s  next?   Feel  free  to  ask  quesTons  as  we  go!  
  3. 3. Example  ETL  Use  Case:     Distributed  Search  on  Hadoop   Flume   Hue  UI   Custom   UI   Custom   App   Solr   Solr   Solr   SolrCloud   query   query   query   Index   (ETL)   Hadoop  Cluster   MR   HDFS   Index   (ETL)   HBase   Index   (ETL)  
  4. 4. Cloudera  Morphlines  Architecture   Solr   Solr   Solr   SolrCloud   Logs,  tweets,  social   media,  html,   images,  pdf,  text….     Anything  you  want   to  index   Flume,  MR  Indexer,  HBase  indexer,  etc...    Or  your  applicaTon!   Morphline  Library   Morphlines  can  be  embedded  in  any  applicaTon…   Your  App!  
  5. 5. Cloudera  Morphlines   •  Open  Source  framework  for  simple  ETL   •  Consume  any  kind  of  data  from  any  kind  of  data  source,  process  and   load  into  any  app  or  storage  system   •  Designed  for  Near  Real  Time  apps  &  Batch  apps   •  Ships  as  part  Cloudera  Developer  Kit  (CDK)  and  Cloudera  Search   •  It’s  a  Java  library   •  ASL  licensed  on  github  hbps://   •  Similar  to  Unix  pipelines,  but  more  convenient  &  efficient   •  ConfiguraTon  over  coding  (reduce  Tme  &  skills)   •  Supports  common  file  formats   •  Log  Files  &  Text   •  Avro,  Sequence  file   •  JSON,  HTML  &  XML   •  Etc…  (pluggable)   •  Extensible  set  of  transformaTon  commands  
  6. 6. ExtracTon,  TransformaTon  and  Loading   •  Chain  of  pipelined   commands   •  Simple  and  flexible  data   mapping  &  transformaTon     •  Reusable  across  mulTple   index  workloads   •  Over  Tme,  extend  and  re-­‐ use  across  plaiorm   workloads   syslog   Flume   Agent   Solr  sink   Command:  readLine   Command:  grok   Command:  loadSolr   Solr   Event   Record   Record   Record   Document   Morphline  Library  
  7. 7. Like  a  Unix  Pipeline   •  Like  Unix  pipelines  where  the  data  model  is   generalized  to  work  with  streams  of  generic  records,   including  arbitrary  binary  payloads   •  Designed  to  be  embedded  into  Hadoop  components   such  as  Search,  Flume,  MapReduce,  Pig,  Hive,  Sqoop  
  8. 8. Stdlib  +  plugins   •  Framework  ships  with  a  set  of  frequently  used  high   level  transformaTon  and  I/O  commands  that  can  be   combined  in  applicaTon  specific  ways   •  The  plugin  system  allows  the  adding  of  new   transformaTons  and  I/O  commands  and  integrates   exisTng  funcTonality  and  third  party  systems  in  a   straighiorward  manne  
  9. 9. Flexible  Data  Model   •  A  record  is  a  set  of  named  fields  where  each  field  has   an  ordered  list  of  one  or  more  Java  Objects  (i.e.   Guava’s  ArrayListMulTmap)   •  Field  can  have  mulTple  values  and  any  two  records   need  not  use  common  field  names   •  Corresponds  exactly  to  Solr/Lucene  data  model   •  Pass  not  only  structured  data,  but  also  arbitrary   binary  data  
  10. 10. Passing  Binary  Data   •  _abachment_body  field  (opTonal)   •  or  Java  byte[]     •  opTonal  fields  assist  w/  detecTng  &  parsing  data  type   •  _abachment_mimetype  field   •  e.g.  "applicaTon/pdf"     •  _abachment_charset  field   •  e.g.  "UTF-­‐8"   •  _abachment_name  field   •  e.g.  "cars.pdf”   •  Conceptually  similar  to  email  and  HTTP  headers/body  
  11. 11. Processing  Model   •  Morphline  commands  manipulate  conTnuous  or   arbitrarily  large  streams  of  records   •  A  command  transforms  a  record  into  zero  or  more   records   •  The  output  records  of  a  command  are  passed  to  the   next  command  in  the  chain   •  A  command  can  contain  nested  commands     •  A  morphline  is  a  tree  of  commands,  essenTally  a   push-­‐based  data  flow  engine  
  12. 12. Processing  Model  Non-­‐Goals   •  Designed  to  embedded  into  mulTple  host  systems,  thus…   •  No  noTon  of  persistence  or  durability  or  distributed   compuTng  or  node  failover   •  Basically  just  a  chain  of  in-­‐memory  transformaTons  in  the   current  thread   •  No  need  to  manage  mulTple  nodes  or  threads  -­‐    already   covered  by  host  systems  such  as  MapReduce,  Flume,   Storm,  Samza,  etc.     •  However,  a  morphline  does  support  passing  noTficaTons   •  E.g.  BEGIN_TRANSACTION,  COMMIT_TRANSACTION,   ROLLBACK_TRANSACTION,  SHUTDOWN  
  13. 13. Performance  and  Scaling   •  The  runTme  compiles  morphline  on  the  fly     •  The  runTme  processes  all  commands  of  a  given   morphline  in  the  same  thread   •  Piping  a  record  from  one  command  to  another  is  fast   •  just  a  cheap  Java  method  call   •  no  queues,  no  handoffs  among  threads,  no  context   switches,  and  no  serializaTon  between  commands   •  For  scalability,  deploy  many  morphline  instances  on  a   cluster  in  many  Flume  agents  and  MapReduce  tasks  
  14. 14. Syntax   •  HOCON  format  (Human-­‐OpTmized  Config  Object   NotaTon)   •  Basically  JSON  slightly  adjusted  for  the  configuraTon   file  use  case     •  Came  out  of   •  Also  used  by  Akka  and  Play  frameworks  
  15. 15. Example:  Indexing  log4j  w/  stacktraces   juil. 25, 2012 10:49:40 AM hudson.triggers.SafeTimerTask run ok juil. 25, 2012 10:49:46 AM hudson.triggers.SafeTimerTask run failed com.amazonaws.AmazonClientException: Unable to calculate a request signature at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode( at Caused by: com.amazonaws.AmazonClientException: Unable to calculate a request signature at com.amazonaws.auth.AbstractAWSSigner.sign( at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode( ... 14 more Caused by: java.lang.IllegalArgumentException: Empty key at javax.crypto.spec.SecretKeySpec.<init>( at com.amazonaws.auth.AbstractAWSSigner.sign( ... 15 more juil. 25, 2012 10:49:54 AM hudson.slaves.SlaveComputer tryReconnect Record  1   Record  2   Record  3  
  16. 16. Example:  Indexing  log4j  w/  stacktraces   morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readMultiLine { regex : "(^.+Exception: .+)|(^s+at .+)|(^s+... d+ more)|(^s*Caused by:.+)" what : previous charset : UTF-8 } } { logDebug { format : "output record: {}", args : ["@{}"] } } { loadSolr {} ] } ]
  17. 17. Example:  Escape  to  Java  Code   morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { java { code: """ List tags = record.get("tags"); if (!tags.contains("hello")) { return false; } tags.add("world"); return child.process(record); """ } } ] } ]
  18. 18. Current  Command  Library   •  Integrate  with  and  load  into  Apache  Solr   •  Flexible  log  file  analysis   •  Single-­‐line  record,  mulT-­‐line  records,  CSV  files     •  Regex  based  pabern  matching  and  extracTon     •  IntegraTon  with  Avro,  JSON,  XML,  HTML     •  IntegraTon  with  Apache  Hadoop  Sequence  Files   •  IntegraTon  with  SolrCell  and  all  Apache  Tika  parsers     •  Auto-­‐detecTon  of  MIME  types  from  binary  data  using   Apache  Tika  
  19. 19. Current  Command  Library  (cont’d)   •  ScripTng  support  for  dynamic  java  code     •  OperaTons  on  fields  for  assignment  and  comparison   •  OperaTons  on  fields  with  list  and  set  semanTcs     •  if-­‐then-­‐else  condiTonals     •  A  small  rules  engine  (tryRules)   •  String  and  Tmestamp  conversions     •  slf4j  logging   •  Yammer  metrics  and  counters     •  Decompression  and  unpacking  of  arbitrarily  nested   container  file  formats   •  etc  
  20. 20. Plugin  Commands   •  Easy  to  add  new  I/O  &  transformaTon  cmds     •  Integrate  exisTng  funcTonality  and  third  party   systems   •  Implement  Java  interface  Command  or  subclass   AbstractCommand •  Add  it  to  Java  classpath   •  No  registraTon  or  other  administraTve  acTon   required  
  21. 21. Morphline  Example  –  syslog  with  grok   morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dicTonaryFiles  :  [/tmp/grok-­‐dicTonaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Tmestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example  Input   <164>Feb    4  10:46:14  syslog  sshd[607]:  listening  on  port  22   Output  Record   syslog_pri:164   syslog_Tmestamp:Feb    4  10:46:14   syslog_hostname:syslog   syslog_program:sshd   syslog_pid:607   syslog_message:listening  on  port  22.      
  22. 22. Example  Java  Driver  Program   /** Usage: java ... <morphline.conf> <dataFile1> ... <dataFileN> */ public static void main(String[] args) { // compile morphline.conf file on the fly File conf= new File(args[0]); MorphlineContext ctx= new MorphlineContext.Builder().build(); Command morphline = new Compiler().compile(conf, null, ctx, null); // process each input data file Notifications.notifyBeginTransaction(morphline); for (int i = 1; i < args.length; i++) { InputStream in = new FileInputStream(new File(args[i])); Record record = new Record(); record.put(Fields.ATTACHMENT_BODY, in); morphline.process(record); in.close(); } Notifications.notifyCommitTransaction(morphline); }
  23. 23. PotenTal  New  Plugin  Commands   •  Extract,  clean,  transform,  join,  integrate,  enrich  and   decorate  records   •  Examples   •  join  records  with  external  data  sources  such  as  relaTonal   databases,  key-­‐value  stores,  local  files  or  IP  Geo  lookup   tables.     •  Perform  DNS  resoluTon,  expand  shortened  URLs   •  fetch  linked  metadata  from  social  networks   •  do  senTment  analysis  &  annotate  record  accordingly   •  conTnuously  maintain  stats  over  sliding  windows   •  compute  exact  or  approx.  disTnct  values  &  quanTles  
  24. 24. Example  Command  ImplementaTon  (1/2)   public final class ToStringBuilder implements CommandBuilder { @Override public Collection<String> getNames() { return Collections.singletonList("toString"); } @Override public Command build(Config config, Command parent, Command child, MorphlineContext context) { return new ToString(config, parent, child, context); } private static final class ToString extends AbstractCommand { @Override protected boolean doProcess(Record record) { // some custom processing goes here return super.doProcess(record); // pass to next command in chain } } }
  25. 25. Example  Command  ImplementaTon  (2/2)   private static final class ToString extends AbstractCommand { private final String fieldName; private final boolean trim; public ToString(Config config, Command parent, Command child, MorphlineContext context) { super(config, parent, child, context); this.fieldName = getConfigs().getString(config, "field"); this.trim = getConfigs().getBoolean(config, "trim", false); validateArguments(); } @Override protected boolean doProcess(Record record) { ListIterator iter = record.get(fieldName).listIterator(); while (iter.hasNext()) { String str =; iter.set(trim ? str.trim() : str); } return super.doProcess(record); // pass to next command in chain } } }
  26. 26. Use  Case:  Cloudera  Search   An  Integrated  Part  of   the  Hadoop  System   One  pool  of  data   One  security  framework   One  set  of  system  resources   One  management  interface  
  27. 27. What  is  Cloudera  Search?   •  Full-­‐text,  interacTve  search  and  faceted  navigaTon   •  Batch,  near  real-­‐Tme,  and  on-­‐demand  indexing   •  Apache  Solr  integrated  with  CDH   •  Established,  mature  search  with  vibrant  community   •  Separate  runTme  like  MapReduce,  Impala   •  Incorporated  as  part  of  the  Hadoop  ecosystem   •  Open  Source   •  100%  Apache,  100%  Solr   •  Standard  Solr  APIs  
  28. 28. ETL  for  Distributed  Search  on  Apache  Hadoop   Flume   Hue  UI   Custom   UI   Custom   App   Solr   Solr   Solr   SolrCloud   query   query   query   Index   (ETL)   Hadoop  Cluster   MR   HDFS   Index   (ETL)   HBase   Index   (ETL)  
  29. 29. Near  Real  Time  ETL  &  Indexing  with  Flume   Log  File   Apache  Solr  and   Apache  Flume   •  Data  ingest  at  scale   •  Flexible  extracTon  and   mapping   •  Indexing  at  data  ingest   •  Packaged  as  Flume   Morphline  Solr  Sink   HDFS   Flume   Agent   Indexer  w/   Morphline   Other  Log  File   Flume   Agent   Indexer  w/   Morphline   29   agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf Flume.conf  
  30. 30. Cloudera  Manager  Flume  Morphline  GUI   30
  31. 31. Scalable  Batch  ETL  &  Indexing   Index   shard   Files   Index   shard   Indexer  w/   Morphline   Files   Solr   server   Indexer  w/   Morphline   Solr   server   31 HDFS   Solr  and  MapReduce   •  Flexible,  scalable  batch   indexing   •  Start  serving  new  indices   with  no  downTme   •  On-­‐demand  indexing,  cost-­‐ efficient  re-­‐indexing   •  Packaged  as   MapReduceIndexerTool   hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
  32. 32. MapReduceIndexerTool   32 hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ... S0_0_0 Extractors (Mappers) Leaf Shards (Reducers) Root Shards (Mappers) S0_0_1 S0S0_1_0 S0_1_1 S1_0_0 S1_0_1 S1S1_1_0 S1_1_1 Input Files ... ... ... ... •  Morphline  runs  inside  Mapper  
  33. 33. Near  Real  Time  indexing  of  Apache  HBase   HDFS   HBase   interacTve  load   Lily  HBase   Indexer(s)   with   Morphline   Triggers  on   updates   Solr  server   Solr  server   Solr  server   Solr  server   Solr  server   Search   +   =   Large  scale  tabular  data   immediate  access  &  updates   fast  &  flexible  informaDon   discovery   BIG  DATA  DATAMANAGEMENT  
  34. 34. Batch  &  Near  Real  Time  ETL   Tweets Flume Solr Hue UI HDFS MapReduceIndexerTool, Impala, HBase, Mahout, EDW, MR, etc Lily HBase Indexer HdfsSink Query MapReduce IndexerTool Log Formats Social Media HTML Images PDF Custom UI Query Custom App ... Morphline Morphline MorphlineSink Morphline HBase OLTP
  35. 35. What’s  next   •  More  work  on  Apache  Hbase  IntegraTon   •  IntegraTon  into  Apache  Hive  &  Sqoop   •  Stream  AnalyTcs  
  36. 36. Conclusion   •  Cloudera  Development  Kit  w/  Morphlines     •  Open  Source  -­‐  ASL  License   •  Version  0.7.0  shipping   •  Extensive  documentaTon   •  Send  your  quesTons  and  feedback  to  cdk-­‐dev  mailing  list   •  Also  ships  integrated  with  Cloudera  Search   •  Free  QuickStart  VM  also  available!