Your SlideShare is downloading. ×
Hadoop introduction 2
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop introduction 2

405
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
405
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 素材天下 sucaitianxia.com
  • Transcript

    • 1. Hadoop Introduction II K-means && Python && Dumbo
    • 2. Outline• Dumbo• K-means• Python and Data Mining12/20/12 2
    • 3. Hadoop in Python• Jython: Happy• Cython: • Pydoop • components(RecordReader , RecordWriter and Partitioner) • Get configuration, set counters and report statuscpython use any module Dumbo • HDFS API • Hadoopy: an other Cython• Streaming: • Dumbo • Other small Map-Reduce wrapper12/20/12 3
    • 4. Hadoop in Python12/20/12 4
    • 5. Hadoop in Python Extention Hadoop in PythonIntegration with Pipes(C++) + Integration with libhdfs(C) 12/20/12 5
    • 6. Dumbo• Dumbo is a project that allows you to easily write and run Hadoop programs in Python. More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs.• Advantages: • Easy: Dumbo strives to be as Pythonic as possible • Efficient: Dumbo programs communicate with Hadoop in a very effecient way by relying on typed bytes, a nifty serialisation mechanism that was specifically added to Hadoop with Dumbo in mind. • Flexible: We can extend it • Mature12/20/12 6
    • 7. Dumbo: Review WordCount12/20/12 7
    • 8. Dumbo – Word Count12/20/12 8
    • 9. Dumbo IP counts12/20/12 9
    • 10. Dumbo IP counts12/20/12 10
    • 11. K-means in Map-Reduce• Normal K-means: • Inputs: a set of n d-dimensional points && a number of desired clusters k. • Step 1: Random choice K points at the sample of n Points • Step2 : Calculate every point to K initial centers. Choice closest • Step3 : Using this assignation of points to cluster centers, each cluster center is recalculated as the centroid of its member points. • Step4: This process is then iterated until convergence is reached. • Final: points are reassigned to centers, and centroids recalculated until the k cluster centers shift by less than some delta value.• k-means is a surprisingly parallelizable algorithm.12/20/12 11
    • 12. K-means in Map-Reduce• Key-points: • we want to come up with a scheme where we can operate on each point in the data set independently. • a small amount of shared data (The cluster centers) • when we partition points among MapReduce nodes, we also distribute a copy of the cluster centers. This results in a small amount of data duplication, but very minimal. In this way each of the points can be operated on independently.12/20/12 12
    • 13. Hadoop Phase• Map: • In : points in the data set • Output : (ClusterID, Point) pair for each point. Where the ClusterID is the integer Id of the cluster which is cloest to point.12/20/12 13
    • 14. Hadoop Phase• Reduce Phase: • In : (ClusterID, Point)• Operator: • the outputs of the map phase are grouped by ClusterID. • for each ClusterID the centroid of the points associated with that ClusterID is calculated. • Output: (ClusterID, Centroid) pairs. Which represent the newly calculated cluster centers.12/20/12 14
    • 15. External Program• Each iteration of the algorithm is structured as a single MapReduce job.• After each phase, our lib reads the output , determines whether convergence has been reached by the calculating by how much distance the clusters have moved. The runs another Mapreduce job.12/20/12 15
    • 16. Write in Dumbo12/20/12 16
    • 17. Write in Dumbo12/20/12 17
    • 18. Write in Dumbo12/20/12 18
    • 19. Results12/20/12 19
    • 20. Next• Write n-times iteration wrapper• Optimize K-means• Result Visualization with Python12/20/12 20
    • 21. Optimize• If partial centroids for clusters are computed on the map nodes are computed on the map nodes themselves. (Mapper Local calculate!) and then a weighted average of the centroids is taken later by the reducer. In other words, the mapping was one to one, and so for every point inputted , our mapper outputted a single point which it was necessary to sort and transfer to a reducer.• We can use Combiner!12/20/12 21
    • 22. Dumbo Usage• Very easy• You can write your own code for Dumbo• Debug easy• Command easy12/20/12 22
    • 23. Python and Data Mining• Books: • 用 Python 进行科学计算 • 集体智慧编程 • 挖掘社交网络 • 用 Python 进行自然语言处理 • Think Stats Python 与数据分析12/20/12 23
    • 24. Python and Data Mining• Tools • Numpy • Scipy • Orange (利用 orange 进行关联规则挖掘)12/20/12 24
    • 25. thanks12/20/12 25