Kmeans in-hadoop

  • 537 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
537
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. K-means in Hadoop K-means && Spark && Plan
  • 2. Outline• K-means• Spark• Plan2012-12-20 2
  • 3. K-means in Hadoop• Programs: • Kmeans.py: k-means core algorithm • Wrapper.py: local control iterations of k-means • Generator.py: generate data in random of range • Graph.py: draw data2012-12-20 3
  • 4. Flowchart2012-12-20 4
  • 5. Kmeans.py• use “in-mapper combining” technology, for implementing combiner functionality within every map task. Notice, not combiner phase.• It makes a discrete Combine step between Map and Reduce unnecessary. Typically, it is not guaranteed that a combiner function will be called on every mapper or that ,if called , it will only be called once.• In-mapper combiner design patten, we will guarantee that combiner-like key aggregation occurs in every mapper, instead of optionally in some mappers.2012-12-20 5
  • 6. Kmeans.py• The aggregation is done entirely in the memory, without touching disk and it happens before any emission code has been called• But it can not assure “Memory Leak” issue. We should use python to control this condition.• Results (3.6G Test Dataset) • Old: 30+ min • Current: 9+ min, in reduce phase we only use 1~2 second. Saving significant time.2012-12-20 6
  • 7. Generator.py2012-12-20 7
  • 8. Wrapper.py• Main controller for k-means iterations• Function: • Start mapper-reduce • Carry basic data and program with mapper phase • Verify whether it runs end.• Result: • Source: 13 clusters • Target: 10 cluster -> 180 + iterations • Target: 13 cluster -> 7-8 iterations2012-12-20 8
  • 9. Processing(13-clusters)• 110331.286264 -> 43648.070121• 43648.070121 -> 22167.351291• 22167.351291 -> 5853.008014• 5853.008014 -> 552.292067• 552.292067 -> 8.202320• 8.202320 -> 0.000000• 0.000000 -> 0.0000002012-12-20 9
  • 10. Spark• In-memory , high performance , use Scala• Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可 以优化迭代工作负载。• Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合 对象一样轻松地操作分布式数据集。• 尽管创建 Spark 是为了支持分布式数据集上的迭代作业,但是实际 上它是对 Hadoop 的补充,可以在 Hadoo 文件系统中并行运行。• Scala 是一种多范式语言,它以一种流畅的、让人感到舒服的方法支持与命 令式、函数式和面向对象的语言相关的语言特性。2012-12-20 10
  • 11. Spark• Spark 是为集群计算中的特定类型的工作负载而设计,即那些在并行 操作之间重用工作数据集(比如机器学习算法)的工作负载。• Spark 引进了内存集群计算的概念,可在内存集群计算中将数据集缓 存在内存中,以缩短访问延迟。2012-12-20 11
  • 12. 其他的大数据分析框架• GraphLab :侧重于机器学习算法的并行实现• Storm: “实时处理的 Hadoop”,它主要侧重于流处理 和持续计算(流处理可以得出计算的结果)。Storm 是 用 Clojure 语言(Lisp 语言的一种方言)编写的,但它 支持用任何语言(比如 Ruby 和 Python)编写的应用程 序。2012-12-20 12
  • 13. Plan• 27 PCs run properly in Hadoop• Remote management : write some shell scripts, power saving, task submit from everyone etc.• Build Mesos, spark, ZooKeeper, Hbase in our platform.2012-12-20 13
  • 14. thanks2012-12-20 14