Your SlideShare is downloading. ×
Machine learning with Apache Hama
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Machine learning with Apache Hama

8,547
views

Published on

Published in: Technology

0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,547
On Slideshare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
165
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org 1
  • 2. About me ASF member having fun with: Lucene / Solr Hama UIMA Stanbol … some others SW engineer @ Adobe R&D 2
  • 3. Agenda Apache Hama and BSP Why machine learning on BSP Some examples Benchmarks 3
  • 4. Apache Hama Bulk Synchronous Parallel computing framework on top of HDFS for massive scientific computations TLP since May 2012 0.6.0 release out soon Growing community 4
  • 5. BSP supersteps A BSP algorithm is composed by a sequence of “supersteps” 5
  • 6. BSP supersteps Each task Superstep 1  Do some computation  Communicate with other tasks  Synchronize Superstep 2  Do some computation  Communicate with other tasks  Synchronize … … … Superstep N  Do some computation  Communicate with other tasks  Synchronize 6
  • 7. Why BSP Simple programming model Supersteps semantic is easy Preserve data locality Improve performance Well suited for iterative algorithms 7
  • 8. Apache Hama architecture  BSP Program execution flow 8
  • 9. Apache Hama architecture 9
  • 10. Apache Hama Features  BSP API  M/R like I/O API  Graph API  Job management / monitoring  Checkpoint recovery  Local & (Pseudo) Distributed run modes  Pluggable message transfer architecture  YARN supported  Running in Apache Whirr 10
  • 11. Apache Hama BSP API public abstract class BSP<K1, V1, K2, V2, M extends Writable> …  K1, V1 are key, values for inputs  K2, V2 are key, values for outputs  M are they type of messages used for task communication 11
  • 12. Apache Hama BSP API public void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws .. public void setup(BSPPeer<K1, V1, K2, V2, M> peer) throws .. public void cleanup(BSPPeer<K1, V1, K2, V2, M> peer) throws .. 12
  • 13. Machine learning on BSP Lots (most?) of ML algorithms are inherently iterative Hama ML module currently counts  Collaborative filtering  Clustering  Gradient descent 13
  • 14. Benchmarking architectureNodeNode Node Node Node Node Node Node Hama Hama Solr DBMS Lucene Mahout Mahout HDFS HDFS 14
  • 15. Collaborative filtering Given user preferences on movies We want to find users “near” to some specific user So that that user can “follow” them And/or see what they like (which he/she could like too) 15
  • 16. Collaborative filtering BSP Given a specific user Iteratively (for each task) Superstep 1*i  Read a new user preference row  Find how near is that user from the current user  That is finding how near their preferences are  Since they are given as vectors we may use vector distance measures like Euclidean, cosine, etc. distance algorithms  Broadcast the measure output to other peers Superstep 2*i  Aggregate measure outputs  Update most relevant users  Still to be committed (HAMA-612) 16
  • 17. Collaborative filtering BSP Given user ratings about movies "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8 "paula" -> 7, 3, 8, 2, 8.5, 0, 0 "jim” -> 4, 5, 0, 5, 8, 0, 1.5 "tom" -> 9, 4, 9, 1, 5, 0, 8 "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0 We ask for 2 nearest users to “paula” and we get “timothy” and “tom”  user recommendation We can extract highly rated movies “timothy” and “tom” that “paula” didn’t see  Item recommendation 17
  • 18. Benchmarks Fairly simple algorithm Highly iterative Comparing to Apache Mahout Behaves better than ALS-WR Behaves similarly to RecommenderJob and ItemSimilarityJob 18
  • 19. K-Means clustering We have a bunch of data (e.g. documents) We want to group those docs in k homogeneous clusters Iteratively for each cluster  Calculate new cluster center  Add doc nearest to new center to the cluster 19
  • 20. K-Means clustering 20
  • 21. K-Means clustering BSP Iteratively Superstep 1*i Assignment phase Read vectors splits Sum up temporary centers with assigned vectors Broadcast sum and ingested vectors count Superstep 2*i Update phase Calculate the total sum over all received messages and average Replace old centers with new centers and check for convergence 21
  • 22. Benchmarks One rack (16 nodes 256 cores) cluster 10G network On average faster than Mahout’s impl 22
  • 23. Gradient descent Optimization algorithm Find a (local) minimum of some function Used for  solving linear systems  solving non linear systems  in machine learning tasks  linear regression  logistic regression  neural networks backpropagation  … 23
  • 24. Gradient descent Minimize a given (cost) function Give the function a starting point (set of parameters) Iteratively change parameters in order to minimize the function Stop at the (local) minimum There’s some math but intuitively:  evaluate derivatives at a given point in order to choose where to “go” next 24
  • 25. Gradient descent BSP Iteratively  Superstep 1*i  each task calculates and broadcasts portions of the cost function with the current parameters  Superstep 2*i  aggregate and update cost function  check the aggregated cost and iterations count  cost should always decrease  Superstep 3*i  each task calculates and broadcasts portions of (partial) derivatives  Superstep 4*i  aggregate and update parameters 25
  • 26. Gradient descent BSP Simplistic example  Linear regression  Given real estate market dataset  Estimate new houses prices given known houses’ size, geographic region and prices  Expected output: actual parameters for the (linear) prediction function 26
  • 27. Gradient descent BSP Generate a different model for each region House item vectors  price -> size  150k -> 80 2 dimensional space ~1.3M vectors dataset 27
  • 28. Gradient descent BSP Dataset and model fit 28
  • 29. Gradient descent BSP Cost checking 29
  • 30. Gradient descent BSP Classification Logistic regression with gradient descent Real estate market dataset We want to find which estate listings belong to agencies  To avoid buying from them  Same algorithm With different cost function and features Existing items are tagged or not as “belonging to agency” Create vectors from items’ text Sample vector  1 -> 1 3 0 0 5 3 4 1 30
  • 31. Gradient descent BSP Classification 31
  • 32. Benchmarks Not directly comparable to Mahout’s regression algorithms Both SGD and CGD are inherently better than plain GD But Hama GD had on average same performance of Mahout’s SGD / CGD Next step is implementing SGD / CGD on top of Hama  32
  • 33. Wrap up Even if ML module is still “young” / work in progress and tools like Apache Mahout have better “coverage” Apache Hama can be particularly useful in certain “highly iterative” use cases Interesting benchmarks 33
  • 34. Thanks! 34

×