K-means in Hadoop
              K-means && Spark && Plan
Outline

• K-means
• Spark
• Plan




2012-12-20   2
K-means in Hadoop
• Programs:
   • Kmeans.py: k-means core algorithm
   • Wrapper.py: local control iterations of k-means
   • Generator.py: generate data in random of
     range
   • Graph.py: draw data




2012-12-20                                         3
Flowchart




2012-12-20   4
Kmeans.py
• use “in-mapper combining” technology, for
  implementing combiner functionality within every
  map task. Notice, not combiner phase.
• It makes a discrete Combine step between Map and Reduce
  unnecessary. Typically, it is not guaranteed that a combiner
  function will be called on every mapper or that ,if called , it
  will only be called once.
• In-mapper combiner design patten, we will guarantee that
  combiner-like key aggregation occurs in every mapper,
  instead of optionally in some mappers.

2012-12-20                                                      5
Kmeans.py
• The aggregation is done entirely in the memory, without
  touching disk and it happens before any emission code has
  been called
• But it can not assure “Memory Leak” issue. We
  should use python to control this condition.
• Results (3.6G Test Dataset)
   • Old: 30+ min
   • Current: 9+ min, in reduce phase we only use
     1~2 second. Saving significant time.
2012-12-20                                                6
Generator.py




2012-12-20     7
Wrapper.py
• Main controller for k-means iterations
• Function:
   • Start mapper-reduce
   • Carry basic data and program with mapper phase
   • Verify whether it runs end.
• Result:
   • Source: 13 clusters
   • Target: 10 cluster -> 180 + iterations
   • Target: 13 cluster -> 7-8 iterations

2012-12-20                                            8
Processing(13-clusters)

•   110331.286264 -> 43648.070121
•   43648.070121 -> 22167.351291
•   22167.351291 -> 5853.008014
•   5853.008014 -> 552.292067
•   552.292067 -> 8.202320
•   8.202320 -> 0.000000
•   0.000000 -> 0.000000



2012-12-20                          9
Spark
• In-memory , high performance , use Scala
• Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可
  以优化迭代工作负载。
• Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合
  对象一样轻松地操作分布式数据集。
• 尽管创建 Spark 是为了支持分布式数据集上的迭代作业,但是实际
  上它是对 Hadoop 的补充,可以在 Hadoo 文件系统中并行运行。
•   Scala 是一种多范式语言,它以一种流畅的、让人感到舒服的方法支持与命
    令式、函数式和面向对象的语言相关的语言特性。



2012-12-20                                   10
Spark
• Spark 是为集群计算中的特定类型的工作负载而设计,即那些在并行
  操作之间重用工作数据集(比如机器学习算法)的工作负载。
• Spark 引进了内存集群计算的概念,可在内存集群计算中将数据集缓
  存在内存中,以缩短访问延迟。




2012-12-20                        11
其他的大数据分析框架
• GraphLab :侧重于机器学习算法的并行实现
• Storm: “实时处理的 Hadoop”,它主要侧重于流处理
  和持续计算(流处理可以得出计算的结果)。Storm 是
  用 Clojure 语言(Lisp 语言的一种方言)编写的,但它
  支持用任何语言(比如 Ruby 和 Python)编写的应用程
  序。




2012-12-20                       12
Plan
• 27 PCs run properly in Hadoop
• Remote management : write some shell scripts,
  power saving, task submit from everyone etc.
• Build Mesos, spark, ZooKeeper, Hbase in our
  platform.




2012-12-20                                        13
thanks




2012-12-20            14

Kmeans in-hadoop

  • 1.
    K-means in Hadoop K-means && Spark && Plan
  • 2.
  • 3.
    K-means in Hadoop •Programs: • Kmeans.py: k-means core algorithm • Wrapper.py: local control iterations of k-means • Generator.py: generate data in random of range • Graph.py: draw data 2012-12-20 3
  • 4.
  • 5.
    Kmeans.py • use “in-mappercombining” technology, for implementing combiner functionality within every map task. Notice, not combiner phase. • It makes a discrete Combine step between Map and Reduce unnecessary. Typically, it is not guaranteed that a combiner function will be called on every mapper or that ,if called , it will only be called once. • In-mapper combiner design patten, we will guarantee that combiner-like key aggregation occurs in every mapper, instead of optionally in some mappers. 2012-12-20 5
  • 6.
    Kmeans.py • The aggregationis done entirely in the memory, without touching disk and it happens before any emission code has been called • But it can not assure “Memory Leak” issue. We should use python to control this condition. • Results (3.6G Test Dataset) • Old: 30+ min • Current: 9+ min, in reduce phase we only use 1~2 second. Saving significant time. 2012-12-20 6
  • 7.
  • 8.
    Wrapper.py • Main controllerfor k-means iterations • Function: • Start mapper-reduce • Carry basic data and program with mapper phase • Verify whether it runs end. • Result: • Source: 13 clusters • Target: 10 cluster -> 180 + iterations • Target: 13 cluster -> 7-8 iterations 2012-12-20 8
  • 9.
    Processing(13-clusters) • 110331.286264 -> 43648.070121 • 43648.070121 -> 22167.351291 • 22167.351291 -> 5853.008014 • 5853.008014 -> 552.292067 • 552.292067 -> 8.202320 • 8.202320 -> 0.000000 • 0.000000 -> 0.000000 2012-12-20 9
  • 10.
    Spark • In-memory ,high performance , use Scala • Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可 以优化迭代工作负载。 • Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合 对象一样轻松地操作分布式数据集。 • 尽管创建 Spark 是为了支持分布式数据集上的迭代作业,但是实际 上它是对 Hadoop 的补充,可以在 Hadoo 文件系统中并行运行。 • Scala 是一种多范式语言,它以一种流畅的、让人感到舒服的方法支持与命 令式、函数式和面向对象的语言相关的语言特性。 2012-12-20 10
  • 11.
    Spark • Spark 是为集群计算中的特定类型的工作负载而设计,即那些在并行 操作之间重用工作数据集(比如机器学习算法)的工作负载。 • Spark 引进了内存集群计算的概念,可在内存集群计算中将数据集缓 存在内存中,以缩短访问延迟。 2012-12-20 11
  • 12.
    其他的大数据分析框架 • GraphLab :侧重于机器学习算法的并行实现 •Storm: “实时处理的 Hadoop”,它主要侧重于流处理 和持续计算(流处理可以得出计算的结果)。Storm 是 用 Clojure 语言(Lisp 语言的一种方言)编写的,但它 支持用任何语言(比如 Ruby 和 Python)编写的应用程 序。 2012-12-20 12
  • 13.
    Plan • 27 PCsrun properly in Hadoop • Remote management : write some shell scripts, power saving, task submit from everyone etc. • Build Mesos, spark, ZooKeeper, Hbase in our platform. 2012-12-20 13
  • 14.