Hadoop Introduction II
               K-means && Python && Dumbo
Outline

• Dumbo
• K-means
• Python and Data Mining




12/20/12                   2
Hadoop in Python
• Jython: Happy
• Cython:
     • Pydoop
           • components(RecordReader , RecordWriter and Partitioner)
           • Get configuration, set counters and report statuscpython use any module
                                         Dumbo
           • HDFS API
     • Hadoopy: an other Cython
• Streaming:
     • Dumbo
     • Other small Map-Reduce wrapper

12/20/12                                                                               3
Hadoop in Python




12/20/12           4
Hadoop in Python Extention




                                 Hadoop in Python




Integration with Pipes(C++) + Integration with libhdfs(C)
 12/20/12                                                   5
Dumbo
•   Dumbo is a project that allows you to easily write and
    run Hadoop programs in Python. More generally, Dumbo can be
    considered a convenient Python API for writing MapReduce
    programs.
•   Advantages:
     • Easy: Dumbo strives to be as Pythonic as possible
     • Efficient: Dumbo programs communicate with Hadoop in a very
       effecient way by relying on typed bytes, a nifty serialisation
       mechanism that was specifically added to Hadoop with Dumbo
       in mind.
     • Flexible: We can extend it
     • Mature

12/20/12                                                            6
Dumbo: Review WordCount




12/20/12                  7
Dumbo – Word Count




12/20/12             8
Dumbo IP counts




12/20/12          9
Dumbo IP counts




12/20/12          10
K-means in Map-Reduce
•   Normal K-means:
     •     Inputs: a set of n d-dimensional points && a number of desired clusters k.

     •     Step 1: Random choice K points at the sample of n Points
     •     Step2 : Calculate every point to K initial centers. Choice closest
     •     Step3 : Using this assignation of points to cluster centers, each cluster center is
           recalculated as the centroid of its member points.
     •     Step4: This process is then iterated until convergence is reached.
     •     Final: points are reassigned to centers, and centroids recalculated until the k
           cluster centers shift by less than some delta value.



•   k-means is a surprisingly parallelizable algorithm.


12/20/12                                                                                    11
K-means in Map-Reduce
•   Key-points:
     • we want to come up with a scheme where we can operate on
       each point in the data set independently.
     • a small amount of shared data (The cluster centers)
     • when we partition points among MapReduce nodes, we
       also distribute a copy of the cluster centers. This results
       in a small amount of data duplication, but very minimal.
       In this way each of the points can be operated on
       independently.




12/20/12                                                         12
Hadoop Phase
• Map:
  • In : points in the data set
  • Output : (ClusterID, Point) pair for each point.
    Where the ClusterID is the integer Id of the
    cluster which is cloest to point.




12/20/12                                           13
Hadoop Phase
• Reduce Phase:
   • In : (ClusterID, Point)
• Operator:
   • the outputs of the map phase are grouped by
     ClusterID.
   • for each ClusterID the centroid of the points
     associated with that ClusterID is calculated.
     • Output: (ClusterID, Centroid) pairs. Which represent the
       newly calculated cluster centers.

12/20/12                                                     14
External Program
•   Each iteration of the algorithm is structured as a single
    MapReduce job.

•   After each phase, our lib reads the output , determines
    whether convergence has been reached by the calculating
    by how much distance the clusters have moved. The runs
    another Mapreduce job.




12/20/12                                                        15
Write in Dumbo




12/20/12         16
Write in Dumbo




12/20/12         17
Write in Dumbo




12/20/12         18
Results




12/20/12   19
Next
• Write n-times iteration wrapper
• Optimize K-means
• Result Visualization with Python




12/20/12                             20
Optimize
•   If partial centroids for clusters are computed on the map
    nodes are computed on the map nodes themselves.
    (Mapper Local calculate!) and then a weighted average
    of the centroids is taken later by the reducer. In
    other words, the mapping was one to one, and so for
    every point inputted , our mapper outputted a single
    point which it was necessary to sort and transfer to a
    reducer.


• We can use Combiner!

12/20/12                                                   21
Dumbo Usage
•   Very easy
•   You can write your own code for Dumbo
•   Debug easy
•   Command easy




12/20/12                                    22
Python and Data Mining
• Books:
   • 用 Python 进行科学计算
   • 集体智慧编程
   • 挖掘社交网络
   • 用 Python 进行自然语言处理
   • Think Stats Python 与数据分析




12/20/12                        23
Python and Data Mining
• Tools
   • Numpy
   • Scipy
   • Orange (利用 orange 进行关联规则挖掘)




12/20/12                           24
thanks




12/20/12            25

Hadoop introduction 2

  • 1.
    Hadoop Introduction II K-means && Python && Dumbo
  • 2.
    Outline • Dumbo • K-means •Python and Data Mining 12/20/12 2
  • 3.
    Hadoop in Python •Jython: Happy • Cython: • Pydoop • components(RecordReader , RecordWriter and Partitioner) • Get configuration, set counters and report statuscpython use any module Dumbo • HDFS API • Hadoopy: an other Cython • Streaming: • Dumbo • Other small Map-Reduce wrapper 12/20/12 3
  • 4.
  • 5.
    Hadoop in PythonExtention Hadoop in Python Integration with Pipes(C++) + Integration with libhdfs(C) 12/20/12 5
  • 6.
    Dumbo • Dumbo is a project that allows you to easily write and run Hadoop programs in Python. More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs. • Advantages: • Easy: Dumbo strives to be as Pythonic as possible • Efficient: Dumbo programs communicate with Hadoop in a very effecient way by relying on typed bytes, a nifty serialisation mechanism that was specifically added to Hadoop with Dumbo in mind. • Flexible: We can extend it • Mature 12/20/12 6
  • 7.
  • 8.
    Dumbo – WordCount 12/20/12 8
  • 9.
  • 10.
  • 11.
    K-means in Map-Reduce • Normal K-means: • Inputs: a set of n d-dimensional points && a number of desired clusters k. • Step 1: Random choice K points at the sample of n Points • Step2 : Calculate every point to K initial centers. Choice closest • Step3 : Using this assignation of points to cluster centers, each cluster center is recalculated as the centroid of its member points. • Step4: This process is then iterated until convergence is reached. • Final: points are reassigned to centers, and centroids recalculated until the k cluster centers shift by less than some delta value. • k-means is a surprisingly parallelizable algorithm. 12/20/12 11
  • 12.
    K-means in Map-Reduce • Key-points: • we want to come up with a scheme where we can operate on each point in the data set independently. • a small amount of shared data (The cluster centers) • when we partition points among MapReduce nodes, we also distribute a copy of the cluster centers. This results in a small amount of data duplication, but very minimal. In this way each of the points can be operated on independently. 12/20/12 12
  • 13.
    Hadoop Phase • Map: • In : points in the data set • Output : (ClusterID, Point) pair for each point. Where the ClusterID is the integer Id of the cluster which is cloest to point. 12/20/12 13
  • 14.
    Hadoop Phase • ReducePhase: • In : (ClusterID, Point) • Operator: • the outputs of the map phase are grouped by ClusterID. • for each ClusterID the centroid of the points associated with that ClusterID is calculated. • Output: (ClusterID, Centroid) pairs. Which represent the newly calculated cluster centers. 12/20/12 14
  • 15.
    External Program • Each iteration of the algorithm is structured as a single MapReduce job. • After each phase, our lib reads the output , determines whether convergence has been reached by the calculating by how much distance the clusters have moved. The runs another Mapreduce job. 12/20/12 15
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    Next • Write n-timesiteration wrapper • Optimize K-means • Result Visualization with Python 12/20/12 20
  • 21.
    Optimize • If partial centroids for clusters are computed on the map nodes are computed on the map nodes themselves. (Mapper Local calculate!) and then a weighted average of the centroids is taken later by the reducer. In other words, the mapping was one to one, and so for every point inputted , our mapper outputted a single point which it was necessary to sort and transfer to a reducer. • We can use Combiner! 12/20/12 21
  • 22.
    Dumbo Usage • Very easy • You can write your own code for Dumbo • Debug easy • Command easy 12/20/12 22
  • 23.
    Python and DataMining • Books: • 用 Python 进行科学计算 • 集体智慧编程 • 挖掘社交网络 • 用 Python 进行自然语言处理 • Think Stats Python 与数据分析 12/20/12 23
  • 24.
    Python and DataMining • Tools • Numpy • Scipy • Orange (利用 orange 进行关联规则挖掘) 12/20/12 24
  • 25.

Editor's Notes

  • #2 素材天下 sucaitianxia.com