Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Easy, scalable, fault tolerant stre... by Databricks 3236 views
- Deduplication and Author-Disambigua... by Spark Summit 574 views
- Using Pluggable Apache Spark SQL Fi... by Spark Summit 408 views
- Optimal Strategies for Large Scale ... by Databricks 1589 views
- Apache Spark and Tensorflow as a Se... by Spark Summit 1317 views
- Storage Engine Considerations for Y... by Spark Summit 1313 views

883 views

Published on

Published in:
Data & Analytics

License: CC Attribution-ShareAlike License

No Downloads

Total views

883

On SlideShare

0

From Embeds

0

Number of Embeds

118

Shares

0

Downloads

61

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in Spark Lorand Dali @lorserker #EUds9
- 2. You don’t have to implement your own optimization algorithm* *unless you want to play around and learn a lot of new stuff
- 3. Use a representation that is suited for distributed implementation
- 4. Logistic regression definition weights Feature Vector Prediction Loss Weight update Derivative of loss Gradient
- 5. Logistic regression vectorized weights Predictionsfeatures examples Dot products
- 6. How to compute the gradient vector
- 7. Computing dot products and predictions
- 8. Computing the gradient
- 9. weights Partitions Examples Predictions Array[Double] RDD[(Long, Double)] Seq[(Int, Double)] RDD[(Long, Seq[(Int, Double)])] Column index Feature value row index Map[Int, Double]
- 10. Gradient Array[Double] Prediction minus label Transposed data matrix RDD[(Long, Double)] RDD[(Long, Seq[(Int, Double)])]
- 11. Experimental dataset - avazu click prediction dataset (sites) - 20 million examples - 1 million dimensions - we just want to try it out https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#avazu
- 12. Learning curve
- 13. time per iteration AWS EMR Cluster 5 nodes of m4.2xlarge
- 14. Use a partitioner to avoid shuffles
- 15. We have two joins in our code
- 16. Why is the join expensive + * * * Needs shuffle No shuffle
- 17. Using a custom partitioner
- 18. time per iteration
- 19. Try to avoid joins altogether
- 20. Gradient descent without joins dimension
- 21. time per iteration
- 22. Use aggregate and treeAggregate
- 23. Gradient (part) Features Examples Tree aggregate Comb OP Seq op
- 24. Seq Op
- 25. Comb Op
- 26. time per iteration
- 27. If you can’t decrease the time per iteration, make the iteration smaller
- 28. Mini batch gradient descent
- 29. Learning curve still OK
- 30. time per iteration
- 31. time per iteration
- 32. If time per iteration is minimal, try to have fewer iterations
- 33. Find a good initialization for the bias - Usually we initialize weights randomly (or to zero) - But a careful initialization of the bias can help (especially in very unbalanced datasets) - We start the gradient descent from a better point and can save several iterations
- 34. Learning curve before bias init
- 35. Learning curve after bias init
- 36. Try a better optimization algorithm to converge faster
- 37. ADAM - converges faster - combines ideas from: gradient descent, momentum and rmsprop - basically just keeps moving averages and makes larger steps when values are consistent or gradients are small - useful for making better progress in plateaus
- 38. Learning curve ADAM
- 39. time per iteration
- 40. Conclusion - we implemented logistic regression from scratch - the first version was very slow - but we managed to improve the iteration time 40x - and also made it converge faster
- 41. Thank you! - Questions, but only simple ones please :) - Looking forward to discussing offline - Or write me an email Lorand@Lorand.me - Play with the code - And come work with me at http://bit.ly/slogreg

No public clipboards found for this slide

Be the first to comment