Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. 1 Hatayama Hideharu (Hide) Conference Memo
  2. 2. 2 about the conference - It’s annually held by Hadoop User Group Japan from 2009 (so this time is 5th) - More than 1200 attendees - Sponsored by Recruit Technologies and Cloudera, SAS Institute Japan, Treasure Data, IBM Japan, MapR Technologies
  3. 3. 3 Keynote
  4. 4. 4 Future of Data by Doug Cutting Chief Architect of Cloudera Creator of Lucene, Nutch, and Hadoop “Hardware cost will be decreasing, and value of data will be increasing as before.” “Hadoop’s functionality will keep growing, then it might be possible to execute transactional task on Hadoop”h Q: I’m glad if Apache version will be standard hogehoge... A: We’ll keep reflecting good points into Apache ver. Q: I have a big expectation of Hadoop near real time hogehoge... A: We’re working on that and if it grows we don’t need to use strom or ... Q: Are you still working on Lucene? A: Sorry I’m not. I’m not sure about current Lucene...
  5. 5. 5 Future of Spark by Patrick Wendell Main developer of Spark, working in Databricks. Databricks Clound (PaaS Spark cluster)
  6. 6. 6 Session
  7. 7. 7 BigQuery and the world after MapReduce Map & Reduce -> BigQuery (Dremel) GFS -> Colossus Storage: $0.026 / GB, month Queries: $5 / TB https://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
  8. 8. 8 BigQuery and the world after MapReduce Small JOIN executed with Broadcast JOIN One table should be < 8MB One table is sent to every shard Big JOIN executed with shuffling Both table can be > 8MB Shuffler doesn’t sort, just hash partitioninghttps://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
  9. 9. 9 BigQuery and the world after MapReduce BigQuery + Hadoop, BigQuery Streaming, UDF with JavaScript, etc... https://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
  10. 10. 10 Batch processing and Stream processing by SQL Batch processing - Hadoop/Hive - hourly - weekly... - Highest throughput - Largest latency Short Batch processing - Presto, Impala, Drill - secondly - hourly... - Normal throughput - Small latency Stream processing - Storm, Kafka, Esper, Norikra, Fluentd,... - secondly - hourly... - Normal throughput - Smallest latency - After query registration, it runs repeatedly Norikra - Internally using Esper but NO SCHEMA - Not distributed - 10K events / sec on 2cpu (8core) http://www.slideshare.net/tagomoris/hcj2014-sql
  11. 11. 11 Deeper Understanding of Spark’s Internals Spark Execution Model 1. Create DAG of RDDs to represent computation 2. Create logical execution plan for DAG 3. Schedule and execute individual tasks RDD: Resilient Distributed Dataset DAG: Directed acyclic graph (無閉路有向グラフ) task: data + computation execute all tasks within a stage before moving on to the next stage http://www.slideshare.net/hadoopconf/japanese-spark-internalssummit20143
  12. 12. 12 Deeper Understanding of Spark’s Internals Common issue checklist - Ensure enough partitions for concurrency (at least 2x number of cores in cluster, and at least 100ms for each task) (commonly b/w 100 and 10,000) - Minimize memory consumption (sorting, large key in group-by) - Minimize amount of data shuffled - Know the standard library Memory Problems Symptoms: Diagnosis: Resolution - Inexplicably bad performance - Inexplicably executor/machine failures - Set spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError - Increase spark.executor.memory - Increase number of partitions - Re-evaluate program structure
  13. 13. 13 Spark on large Hadoop cluster and evaluation from the view point of enterprise Hadoop user and developer http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
  14. 14. 14 Spark on large Hadoop cluster and evaluation from the view point of enterprise Hadoop user and developer http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
  15. 15. 15 Spark on large Hadoop cluster and evaluation from the view point of enterprise Hadoop user and developer Please check the evaluation result & summary in the slide http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014 4k core 10TB + RAM 10G network spark 1.0.0 HDFS (CDH 5.0) 1. wordcount linear, reduce wouldn’t be bottle neck 2. SparkHdfsLR (LogisticRegression) cached: (data is small) very fast for 2nd, 3rd,... processing non-cached: (data is big) same as hadoop 3. GroupByTest (Large shuffle process) also linear 4. POC of a certain project key is memory management!!! for utilizing cache, rich data format, simple task -> simple data format, complicated task
  16. 16. 16 Treasure Data on the YARN YARN (Yet Another Resource Negotiator) job tracker -> resource manager application master job history server task tracker -> node master http://www.slideshare.net/ryukobayashi/treasure-data-on-the-yarn-hadoop-conference-japan-2014
  17. 17. 17 Treasure Data on the YARN Many configuration changes are required for MRv1 -> YARN -> copy HDFS directory of CDH VM or HDP VM -> use the Ambari or Cloudera Manager -> use hdp-configuration-utils.py script (http://goo.gl/L2hxyq) Don't use Apache Hadoop 2.2.0, 2.3.0, or HDP2.0(2.2.0 based) there is a bug in scheduler (deadlock) These are OK Apache Hadoop 2.4.1 CDH 5.0.2(2.3.0 based and patch) HDP 2.1(2.4.0 based)
  18. 18. 18 実践機械学習 - MahoutとSolrを活用した レコメンデーションにおけるイノベーション h: users behavior Ah: user who did h user oriented recommendation (*) cannot be pre-processed -> not efficient & slow item oriented recommendation (*) can be pre-processed by night time offline batch -> new user’s h processing can be done in near real time http://www.slideshare.net/MapR_Japan/mahoutsolr-hadoop-conference-japan-2014
  19. 19. 19 実践機械学習 - MahoutとSolrを活用した レコメンデーションにおけるイノベーション Multi Modal Recommendation (or cross recomenndation) e.g.) movie recommendation A: search query B: watched video -> AtA: query recommendation -> BtB: video recommendation -> BtA: video by query recommendation query: “Paco de Lucia” (Spanish guitarist) normal result: “hombres de paco” (Spanish TV) BtA: Spanish classical guitar Flamenco guitar riff by Van Halen Dithering -> switch recommendation result by random noise Reference Practical Machine Learning ebook http://www.mapr.com/practical-machine-learning
  20. 20. 20 my impression
  21. 21. 21 my impression Last year: Hybrid or Mixed architecture Batch Indexing and Near Real Time, keeping things fast http://www.slideshare.net/lucenerevolution/batch-indexing-near-real-time-keeping-things-fast
  22. 22. 22 my impression This year: Lambda architecture Lambda Architecture http://lambda-architecture.net/
  23. 23. 23 my impression Stream processing can’t be replayed or recovered -> Hybrid processing for fault-tolerance Stream processing: executes queries in normal Batch processing: executes recovery queries Batch processing has a big latency, Stream processing might not be accurate -> Hybrid processing for latency-reduction & accuracy Stream processing: prompt reports Batch processing: fixed reports  For keeping good QCD, the same query is used for both stream & batch processing