Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Full stack analytics with Hadoop 2

Recent developments in Hadoop version 2 are pushing the system from the traditional, batch oriented, computational model based on MapRecuce towards becoming a multi paradigm, general purpose, platform. In the first part of this talk we will review and contrast three popular processing frameworks. In the second part we will look at how the ecosystem (eg. Hive, Mahout, Spark) is making use of these new advancements. Finally, we will illustrate "use cases" of batch, interactive and streaming architectures to power traditional and "advanced" analytics applications.

  • Be the first to comment

Full stack analytics with Hadoop 2

  1. 1. Full stack analytics with Hadoop 2 Trento, 2014-09-11 GABRIELE MODENA LEARNING HADOOP 2
  2. 2. CS.ML! Data Scientist ML & Data Mining Academia & Industry ! Learning Hadoop 2 for Packt_Publishing (together with Garry Turkington). TBD.
  3. 3. This talk is about tools
  4. 4. Your mileage may vary
  5. 5. I will avoid benchmarks
  7. 7. HDFS Name Node Data Node ! ! Google paper (2003)! Distributed storage! Block ops Name Node Data Node Data Node GABRIELE MODENA LEARNING HADOOP 2
  8. 8. MapReduce Google paper (2006)! Divide and conquer functional model! Concepts from database research! Batch worloads! Aggregation operations (eg. GROUP BY) GABRIELE MODENA LEARNING HADOOP 2
  9. 9. Two phases Map Reduce GABRIELE MODENA LEARNING HADOOP 2
  10. 10. Programs are chains of jobs
  12. 12. All in all Great when records (jobs) are independent! Composability monsters! Computation vs. Communication tradeoff! Low level API! Tuning required GABRIELE MODENA LEARNING HADOOP 2
  13. 13. Computation with MapReduce CRUNCH GABRIELE MODENA LEARNING HADOOP 2
  14. 14. Higher level abstractions, still geared towards batch loads
  15. 15. Dremel (Impala, Drill) Google paper (2010) ! Access blocks directly from data nodes (partition the fs namespace)! Columnar store (optimize for OLAP)! Appeals to database / BI crowds! Ridiculously fast (as long as you have memory) GABRIELE MODENA LEARNING HADOOP 2
  16. 16. Computation beyond MapReduce Iterative workloads! Low latency queries! Real-time computation! High level abstractions GABRIELE MODENA LEARNING HADOOP 2
  17. 17. Hadoop 2 Applications (Hive, Pig, Crunch, Cascading, etc…) Streaming (storm, spark, samza) In memory (spark) Interactive (Tez) HPC (MPI) Resource Management (YARN) HDFS Batch (MapReduce) Graph (giraph) GABRIELE MODENA LEARNING HADOOP 2
  18. 18. Tez (Dryad) Microsoft paper (2007)! Generalization of MapReduce as dataflow! Express dependencies, I/O pipelining! Low level API for building DAGs! Mainly an execution engine (Hive-on-Tez, Pig-on-Tez) GABRIELE MODENA LEARNING HADOOP 2
  20. 20. DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge( new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); GABRIELE MODENA LEARNING HADOOP 2
  21. 21. p!!ackage org.apache.tez.mapreduce.examples; import; import java.util.Map; import java.util.StringTokenizer; i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import; import; import org.apache.hadoop.mapred.FileAlreadyExistsException; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.yarn.api.records.LocalResource; import org.apache.tez.client.TezClient; import org.apache.tez.dag.api.DAG; import org.apache.tez.dag.api.Edge; import org.apache.tez.dag.api.InputDescriptor; import org.apache.tez.dag.api.OutputDescriptor; import org.apache.tez.dag.api.ProcessorDescriptor; import org.apache.tez.dag.api.TezConfiguration; import org.apache.tez.dag.api.Vertex; import org.apache.tez.dag.api.client.DAGClient; import org.apache.tez.dag.api.client.DAGStatus; import org.apache.tez.mapreduce.committer.MROutputCommitter; import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator; import org.apache.tez.mapreduce.hadoop.MRHelpers; import org.apache.tez.mapreduce.input.MRInput; import org.apache.tez.mapreduce.output.MROutput; import org.apache.tez.mapreduce.processor.SimpleMRProcessor; import org.apache.tez.runtime.api.Output; import org.apache.tez.runtime.library.api.KeyValueReader; import org.apache.tez.runtime.library.api.KeyValueWriter; import org.apache.tez.runtime.library.api.KeyValuesReader; i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import; import org.apache.tez.runtime.library.partitioner.HashPartitioner; !! public class WordCount extends Configured implements Tool { public static class TokenProcessor extends SimpleMRProcessor { IntWritable one = new IntWritable(1); ! Text word = new Text(); @Override public void run() throws Exception { Preconditions.checkArgument(getInputs().size() == 1); Preconditions.checkArgument(getOutputs().size() == 1); MRInput input = (MRInput) getInputs().values().iterator().next(); KeyValueReader kvReader = input.getReader(); Output output = getOutputs().values().iterator().next(); KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter(); while ( { StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); kvWriter.write(word, one); } } ! } ! } public static class SumProcessor extends SimpleMRProcessor { @Override public void run() throws Exception { Preconditions.checkArgument(getInputs().size() == 1); MROutput out = (MROutput) getOutputs().values().iterator().next(); KeyValueWriter kvWriter = out.getWriter(); KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next() .getReader(); while ( { Text word = (Text) kvReader.getCurrentKey(); int sum = 0; for (Object value : kvReader.getCurrentValues()) { sum += ((IntWritable) value).get(); } kvWriter.write(word, new IntWritable(sum)); } } } ! private DAG createDAG(FileSystem fs, TezConfiguration tezConf, Map<String, LocalResource> localResources, Path stagingDir, ! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf); inputConf.set(FileInputFormat.INPUT_DIR, inputPath); InputDescriptor id = new InputDescriptor(MRInput.class.getName()) .setUserPayload(MRInput.createUserPayload(inputConf, ! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf); outputConf.set(FileOutputFormat.OUTDIR, outputPath); OutputDescriptor od = new OutputDescriptor(MROutput.class.getName()) .setUserPayload(MROutput.createUserPayload( ! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor( TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf)); ! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer", ! new ProcessorDescriptor( SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf)); summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer .newBuilder(Text.class.getName(), IntWritable.class.getName(), ! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge( return dag; ! } private static void printUsage() { new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); System.err.println("Usage: " + " wordcount <in1> <out1>"); ToolRunner.printGenericCommandUsage(System.err); ! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception { System.out.println("Running WordCount"); // conf and UGI TezConfiguration tezConf; if (conf != null) { tezConf = new TezConfiguration(conf); } else { tezConf = new TezConfiguration(); } UserGroupInformation.setConfiguration(tezConf); ! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir FileSystem fs = FileSystem.get(tezConf); String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR + user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR + Path.SEPARATOR + Long.toString(System.currentTimeMillis()); Path stagingDir = new Path(stagingDirStr); tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr); stagingDir = fs.makeQualified(stagingDir); // No need to add jar containing this class as assumed to be part of ! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir // is the same filesystem as the one used for Input/Output. TezClient tezSession = new TezClient("WordCountSession", tezConf); ! tezSession.start(); ! DAGClient dagClient = null; try { if (fs.exists(new Path(outputPath))) { throw new FileAlreadyExistsException("Output directory " + outputPath + " already exists"); } Map<String, LocalResource> localResources = new TreeMap<String, LocalResource>(); DAG dag = createDAG(fs, tezConf, localResources, ! stagingDir, inputPath, outputPath); tezSession.waitTillReady(); ! dagClient = tezSession.submitDAG(dag); // monitoring DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null); if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) { System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics()); return false; } return true; } finally { fs.delete(stagingDir, true); tezSession.stop(); } ! } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); ! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { printUsage(); return 2; } WordCount job = new WordCount();[0], otherArgs[1], conf); return 0; ! } public static void main(String[] args) throws Exception { int res = Configuration(), new WordCount(), args); System.exit(res); } } GABRIELE MODENA LEARNING HADOOP 2
  22. 22. Spark AMPLab paper (2010), builds on Dryad! Resilient Distributed Datasets (RDDs)! High level API (and a repl)! Also an execution engine (Hive-on-Spark, Pig-on- Spark) GABRIELE MODENA LEARNING HADOOP 2
  23. 23. JavaRDD<String> file = spark.textFile(“hdfs://infile.txt"); ! JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); ! JavaPairRDD<String, Integer> pairs = PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); ! JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); ! counts.saveAsTextFile(“hdfs://outfile.txt"); GABRIELE MODENA LEARNING HADOOP 2
  24. 24. Rule of thumb Avoid spill-to-disk! Spark and Tez don’t mix well! Join on 50+ TB = Hive+Tez, MapReduce! Direct access to API (in memory) = Spark! OLAP = Hive+Tez, Cloudera Impala! GABRIELE MODENA LEARNING HADOOP 2
  25. 25. Good stuff. So what?
  26. 26. The data <adjective> S3, mysql, nfs, … HDFS Workflow coordination Ingestion Metadata Processing GABRIELE MODENA LEARNING HADOOP 2
  27. 27. Analytics on Hadoop 2 Batch & interactive! Datawarehousing & computing! Dataset size and velocity! Integrations with existing tools! Distributions will constrain your stack GABRIELE MODENA LEARNING HADOOP 2
  28. 28. Use cases Datawarehousing! Explorative Data Analysis! Stream processing! Predictive Analytics GABRIELE MODENA LEARNING HADOOP 2
  29. 29. Datawarehousing Data ingestion! Pipelines! Transform and enrich (ETL) queries - batch! Low latency (presentation) queries - interactive! Interoperable data formats and metadata! Workflow Orchestration GABRIELE MODENA LEARNING HADOOP 2
  30. 30. Collection and ingestion $ hadoop distcp GABRIELE MODENA LEARNING HADOOP 2
  31. 31. Once data is in HDFS
  32. 32. Apache Hive HiveQL ! Data stored on HDFS! Metadata kept in mysql (metastore)! Metadata exposed to third parties (HCatalog)! Suitable both for interactive and batch queries GABRIELE MODENA LEARNING HADOOP 2
  33. 33. set hive.execution.engine=tez
  34. 34. set hive.execution.engine=mr
  35. 35. The nature of Hive tables CREATE TABLE and (LOAD DATA) produce metadata! ! Schema based on the data “as it has already arrived”! ! Data files underlying a Hive table are no different from any other file on HDFS! ! Primitive types behave as in Java GABRIELE MODENA LEARNING HADOOP 2
  36. 36. Data formats Record oriented (avro, text)! Column oriented (Parquet, Orc) GABRIELE MODENA LEARNING HADOOP 2
  37. 37. Text (tab separated) create external table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION ‘$input’ $ hadoop fs -cat /data/tweets.tsv 2014-03-12T17:34:26.000Z!443802208698908672! Oh &amp; I'm chuffed for @GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL! 223224878! NULL 2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek Disini 236! NULL! 629845435! NULL 2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c mudei! NULL! 255768055! NULL 2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his own world. He's always like 4 hours behind everyone else.! NULL! 2379282889! NULL 2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you gotta see this, its awesome! NULL! 355858832! NULL 2014-03-12T17:34:26.000Z!443802208698896384! Crazy... G4QRMSKGkh! NULL! 106439395! NULL! GABRIELE MODENA LEARNING HADOOP 2 •
  38. 38. SELECT COUNT(*) FROM tweets
  39. 39. Apache Avro Record oriented! Migrations (forward, backward)! Schema on write! Interoperability { “namespace”: “com.mycompany.avrotables”, "name": "tweets", "type": "record", "fields": [ {"name": "created_at", "type": "string", “doc”: “date_time of tweet”}, {"name": "tweet_id_str", "type": "string"}, {"name": "text", "type": "string"}, {"name": "in_reply_to", "type": ["string", "null"]}, {"name": "is_retweeted", "type": ["string", "null"]}, {"name": "user_id", "type": "string"}, {"name": "place_id", "type": ["string", "null"]} ] } CREATE TABLE tweets ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT '' OUTPUTFORMAT '' SERDEPROPERTIES ( 'avro.schema.url'='hdfs:///schema/avro/tweets_avro.avsc' ) ; insert into table tweets select * from tweets_ext; GABRIELE MODENA LEARNING HADOOP 2
  40. 40. Some thoughts on schemas Only make additive changes! Think about schema distribution! Manage schema versions explicitly GABRIELE MODENA LEARNING HADOOP 2
  41. 41. Parquet ! Ad hoc use case! Cloudera Impala’s default file format! Execution engine agnostic! HIVE-5783! Let it handle block size! ! create table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) STORED AS PARQUET; ! insert into table tweets select * from tweets_ext; GABRIELE MODENA LEARNING HADOOP 2
  42. 42. If possible, use both
  43. 43. Table Optimization Create tables with workloads in mind! Partitions! Bucketing! Join strategies GABRIELE MODENA LEARNING HADOOP 2
  44. 44. Plenty of tunables !! # partitions SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions.pernode=10000; SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.created.files=1000000; ! # merge small files SET hive.merge.size.per.task=256000000; SET hive.merge.mapfiles=true; SET hive.merge.mapredfiles=true; SET hive.merge.smallfiles.avgsize=16000000; # Compression SET mapred.output.compress=true; SET mapred.output.compression.type=BLOCK; SET; SET; GABRIELE MODENA LEARNING HADOOP 2
  45. 45. Apache Oozie Data pipelines! Workflow execution and coordination! Time and availability based execution! Configuration over code! MapReduce centric! Actions Hive, Pig, fs, shell, sqoop <workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">! ...! <action name="[NODE-NAME]">! <hive xmlns="uri:oozie:hive-action:0.2">! <job-tracker>[JOB-TRACKER]</job-tracker>! <name-node>[NAME-NODE]</name-node>! <prepare>! <delete path="[PATH]"/>! ...! <mkdir path="[PATH]"/>! ...! </prepare>! <job-xml>[HIVE SETTINGS FILE]</job-xml>! <configuration>! <property>! <name>[PROPERTY-NAME]</name>! <value>[PROPERTY-VALUE]</value>! </property>! ...! </configuration>! <script>[HIVE-SCRIPT]</script>! <param>[PARAM-VALUE]</param>! ...! <param>[PARAM-VALUE]</param>! <file>[FILE-PATH]</file>! ...! <archive>[FILE-PATH]</archive>! ...! </hive>! <ok to="[NODE-NAME]"/>! <error to="[NODE-NAME]"/>! </action>! ...! </workflow-app> GABRIELE MODENA LEARNING HADOOP 2
  46. 46. EDA Luminosity in xkcd comics (courtesy of rbloggers) GABRIELE MODENA LEARNING HADOOP 2
  47. 47. Sample the dataset
  48. 48. Use hive-on-tez, impala
  49. 49. Spark & Ipython Notebook ! ! from pyspark import SparkContext! ! sc = SparkContext(CLUSTER_URL, ‘ipython-notebook') ! Works with Avro, Parqeut etc! Move computation close to data! Numpy, scikit-learn, matplotlib! Setup can be tedious GABRIELE MODENA LEARNING HADOOP 2
  50. 50. Stream processing Statistics in real time! Data feeds! Machine generated (sensor data, logs)! Predictive analytics GABRIELE MODENA LEARNING HADOOP 2
  51. 51. Several niches Low latency (storm, s4)! Persistency and resiliency (samza)! Apply complex logic (spark-streaming)! Type of message stream (kafka) GABRIELE MODENA LEARNING HADOOP 2
  52. 52. Apache Samza Kafka for streaming ! Yarn for resource management and exec! Samza API for processing! Sweet spot: second, minutes Samza API Yarn Kafka GABRIELE MODENA LEARNING HADOOP 2
  53. 53. public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator)
  54. 54. public void window( MessageCollector collector, TaskCoordinator coordinator)
  55. 55. Bootstrap streams Samza can consume messages from multiple streams! Rewind on historical data does not preserve ordering! If a task has any bootstrap streams defined then it will read these streams until they are fully processed GABRIELE MODENA LEARNING HADOOP 2
  56. 56. Predictive modelling GABRIELE MODENA LEARNING HADOOP 2
  57. 57. Learning from data Predictive model = statistical learning! Simple = parallelizable! Garbage in = garbage out GABRIELE MODENA LEARNING HADOOP 2
  58. 58. Couple of things we can do 1. Parameter tuning 2. Feature engineering 3. Learn on all data GABRIELE MODENA LEARNING HADOOP 2
  59. 59. Train against all data Ensamble methods (cooperative and competitive)! Avoid multi pass / iterations! Apply models to live data! Keep models up to date GABRIELE MODENA LEARNING HADOOP 2
  60. 60. Off the shelf Apache Mahout (MapReduce, Spark) ! MLlib (Spark)! Cascading-pattern (MapReduce, Tez, Spark) GABRIELE MODENA LEARNING HADOOP 2
  61. 61. Apache Mahout 0.9 Once the default solution for ML with MapReduce! Quality may vary! Good components are really good! Is it a library? A framework? A recommendation system? GABRIELE MODENA LEARNING HADOOP 2
  62. 62. The good The go-to if you need a Recommendation System! SGD (optimization)! Random Forest (classification/regression)! SVD (feature engineering)! ALS (collaborative filtering) GABRIELE MODENA LEARNING HADOOP 2
  63. 63. The puzzling SVM? ! Model updates are implementation specific!! Feature encoding and input format are often model specific GABRIELE MODENA LEARNING HADOOP 2
  64. 64. Apache Mahout trunk Moving away from MapReduce! Spark + Scala DSL = new classes of algorithms! Major code cleanup GABRIELE MODENA LEARNING HADOOP 2
  65. 65. It needs major infrastructure work around it
  66. 66. batch + streaming
  67. 67. There’s a buzzword for that GABRIELE MODENA LEARNING HADOOP 2
  68. 68. Wrap up
  69. 69. With hadoop 2 Cluster as an Operating System! YARN, mostly! Multiparadigm, better interop! Same system, different tools, multiple use cases! Batch + interactive GABRIELE MODENA LEARNING HADOOP 2
  70. 70. This said Ops is where a lot of time goes! Building clusters is hard! Distro fragmentation! Bleeding edge rush! Heavy lifting needed GABRIELE MODENA LEARNING HADOOP 2
  71. 71. That’s all, folks
  72. 72. Thanks for having me
  73. 73. Let’s discuss