• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
376
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
13
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • 素材天下 sucaitianxia.com

Transcript

  • 1. Hadoop Introduction II K-means && Python && Dumbo
  • 2. Outline• Dumbo• K-means• Python and Data Mining12/20/12 2
  • 3. Hadoop in Python• Jython: Happy• Cython: • Pydoop • components(RecordReader , RecordWriter and Partitioner) • Get configuration, set counters and report statuscpython use any module Dumbo • HDFS API • Hadoopy: an other Cython• Streaming: • Dumbo • Other small Map-Reduce wrapper12/20/12 3
  • 4. Hadoop in Python12/20/12 4
  • 5. Hadoop in Python Extention Hadoop in PythonIntegration with Pipes(C++) + Integration with libhdfs(C) 12/20/12 5
  • 6. Dumbo• Dumbo is a project that allows you to easily write and run Hadoop programs in Python. More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs.• Advantages: • Easy: Dumbo strives to be as Pythonic as possible • Efficient: Dumbo programs communicate with Hadoop in a very effecient way by relying on typed bytes, a nifty serialisation mechanism that was specifically added to Hadoop with Dumbo in mind. • Flexible: We can extend it • Mature12/20/12 6
  • 7. Dumbo: Review WordCount12/20/12 7
  • 8. Dumbo – Word Count12/20/12 8
  • 9. Dumbo IP counts12/20/12 9
  • 10. Dumbo IP counts12/20/12 10
  • 11. K-means in Map-Reduce• Normal K-means: • Inputs: a set of n d-dimensional points && a number of desired clusters k. • Step 1: Random choice K points at the sample of n Points • Step2 : Calculate every point to K initial centers. Choice closest • Step3 : Using this assignation of points to cluster centers, each cluster center is recalculated as the centroid of its member points. • Step4: This process is then iterated until convergence is reached. • Final: points are reassigned to centers, and centroids recalculated until the k cluster centers shift by less than some delta value.• k-means is a surprisingly parallelizable algorithm.12/20/12 11
  • 12. K-means in Map-Reduce• Key-points: • we want to come up with a scheme where we can operate on each point in the data set independently. • a small amount of shared data (The cluster centers) • when we partition points among MapReduce nodes, we also distribute a copy of the cluster centers. This results in a small amount of data duplication, but very minimal. In this way each of the points can be operated on independently.12/20/12 12
  • 13. Hadoop Phase• Map: • In : points in the data set • Output : (ClusterID, Point) pair for each point. Where the ClusterID is the integer Id of the cluster which is cloest to point.12/20/12 13
  • 14. Hadoop Phase• Reduce Phase: • In : (ClusterID, Point)• Operator: • the outputs of the map phase are grouped by ClusterID. • for each ClusterID the centroid of the points associated with that ClusterID is calculated. • Output: (ClusterID, Centroid) pairs. Which represent the newly calculated cluster centers.12/20/12 14
  • 15. External Program• Each iteration of the algorithm is structured as a single MapReduce job.• After each phase, our lib reads the output , determines whether convergence has been reached by the calculating by how much distance the clusters have moved. The runs another Mapreduce job.12/20/12 15
  • 16. Write in Dumbo12/20/12 16
  • 17. Write in Dumbo12/20/12 17
  • 18. Write in Dumbo12/20/12 18
  • 19. Results12/20/12 19
  • 20. Next• Write n-times iteration wrapper• Optimize K-means• Result Visualization with Python12/20/12 20
  • 21. Optimize• If partial centroids for clusters are computed on the map nodes are computed on the map nodes themselves. (Mapper Local calculate!) and then a weighted average of the centroids is taken later by the reducer. In other words, the mapping was one to one, and so for every point inputted , our mapper outputted a single point which it was necessary to sort and transfer to a reducer.• We can use Combiner!12/20/12 21
  • 22. Dumbo Usage• Very easy• You can write your own code for Dumbo• Debug easy• Command easy12/20/12 22
  • 23. Python and Data Mining• Books: • 用 Python 进行科学计算 • 集体智慧编程 • 挖掘社交网络 • 用 Python 进行自然语言处理 • Think Stats Python 与数据分析12/20/12 23
  • 24. Python and Data Mining• Tools • Numpy • Scipy • Orange (利用 orange 进行关联规则挖掘)12/20/12 24
  • 25. thanks12/20/12 25