Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark Usage in the Open Source Ecosystem

2,174 views

Published on

Apache Spark is an active member of the broad open source community beyond the Apache Foundation. Every day thousands of users combine capabilities of Spark with other open source software to get their job done. This is not by chance. Spark has been designed to behave well with existing ecosystems. For example, PySpark is designed to work well with Pandas, Numpy and other python packages. In this talk we will present an analysis of libraries and open source tools that are commonly used along with Spark in JVM, Python and R ecosystems. Our quantitative results are based on usage of thousands of Spark users. We will show the Spark Summit attendees what the rest of their community finds useful to complement the power of Spark and what parts of Spark API is used in conjunction with most popular open source libraries.

Published in: Software
  • Be the first to comment

Apache Spark Usage in the Open Source Ecosystem

  1. 1. Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki
  2. 2. About me • Software Engineer /part-time Data Scientist atDatabricks • I started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR and Rnotebooks at Databricks 2
  3. 3. Stackoverflow 2016 trending tech 3
  4. 4. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3
  5. 5. Databricks Community Edition • In February Databricks launched a free version of its cloud based platform in beta • Since then more than 8,000 users registered • Users created over 61,000 notebooks indifferent languages • This is an analysis of third party libraries that our beta users imported to complement Apache Spark in Scala, Python, and R 5
  6. 6. What % of users use other libraries Language % users importing external libs Average # libs Median # libs Python 75 % 9 2 Scala 55 % 3 1 R 57 % 6 1 6
  7. 7. Installing libraries is easy 7
  8. 8. Python Packages 8
  9. 9. Most popular Python packages 9
  10. 10. What is test_helper? 10
  11. 11. What are these? ETL • re • datetime • pandas • json • csv • string • math /operator • urllib /urllib2 11 Visualization • matplotlib • ggplot • seaborn Advanced analytics • numpy • sklearn • graphframes • tensorflow • scipy Other • test_helper • os • md5
  12. 12. Python package categories 12
  13. 13. What packages go together? 13
  14. 14. Scala Packages 14
  15. 15. Most popular Scala libraries 15
  16. 16. What are these? ETL • java/scala util • scala.collection • scala.math • java.{io, nio} • java.text • o.a.commons • kafka • twitter4j 16 Visualization • ? Advanced analytics • spark.ml • graphframes Other • java.net • scala.sys
  17. 17. Scala package categories 17
  18. 18. What libraries go together? 18
  19. 19. R Packages 19
  20. 20. Most popular R packages 20
  21. 21. What are these? ETL • dplyr • plyr • reshape2 • jsonlite • tidyr • lubridate • httr • data.table 21 Visualization • ggplot2 • beanplot • plotly • ... Advanced analytics • sparkr • h2o • caret • e1071 Other • devtools • magrittr
  22. 22. R package categories 22
  23. 23. Comparing Python, Scala & R 23
  24. 24. Languages have unique features 24 Scala/ Python / R R / Python Scala / Python/ R • 25 % of users,use multiple languages • 3% of notebooks mix different languages
  25. 25. Summary • Spark users extensively mix itwith other packages in different languages – One ofgoals ofSpark project is working well with other projects • ETL related libraries are the most popular category – Opportunities for newdata sources • Notebooks are being used for “small data” aswell as“big data.” • Languages and their ecosystems have diverse capabilities. Users seem to be mixing languages to their advantage – Scala is missing visualization libraries 25
  26. 26. Try your favorite library in Databricks 26 http://databricks.com/ce Try latest version of Apache Spark and previewof Spark 2.0
  27. 27. Thank you!
  28. 28. What packages are used together? 28

×