This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
4. Agenda
• Buzzwords
• Spark in a Nutshell
• Spark Concepts
• Spark Core
• live demo session
• Spark SQL
• live demo session
• Road to Production
• Spark Drawbacks
• Our Spark Integration
• Spark is on a Rise
5.
6. Buzzword for large
and complex data sets
difficult to process using on-hand
database management tools or
traditional data processing applications
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-business-gordon
10. Not to Hadoop?
• Real-time, streaming
• Structures which could not be
decomposed to key-value pairs
• Jobs/algorithms which do not yield to
the MapReduce programming model
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
11. Not to Hadoop?
• Subset of data is enough
Remove excessive complexity or shrink data set via other
processing techniques, e.g.: hashing, clusterization
• Random, Interactive Access to Data
Well structured data
Bunch of scalable mature (No)SQL DB solutions exist
(Hbase/Cassandra/Columnar scalable DW engines)
• Sensitive Data
Security is still very challenging and immature
12. Why Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Contributors per month to Spark
28. RDD Operations
• transformations are executed on
workers
• actions may transfer data from the
workers to the driver
• сollect() sends all the partitions to the
single driver
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
31. Requirements
Analytics about Morning@Lohika events:
• unique participants by companies
• most loyal participants
• participants by position
• etc.
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
32. Data Format
Simple CSV files
all fields are optional
First Name Last Name Company Position Email Present
Vladimir Tsukur GlobalLogic
Tech/Team
Lead
flushdia@gmail.com 1
Mikalai Alimenkou XP Injection Tech Lead
mikalai.alimenkou@
xpinjection.com
1
Taras Matyashovsky Lohika
Software
Engineer
taras.matyashovsky@
gmail.com
0
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
33. Technologies
Technologies:
• Spring Boot 1.2.3.RELEASE
• Spark 1.3.1 - released April 17, 2015
• 2 Spark jar dependencies
• Apache 2.0 license, i.e. free to use
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
34. Features
• simple HTTP-based API
• file system: local and HDFS
• data formats: CSV and Parquet
• 3 compatible implementations based on:
• RDD (Spark Core)
• Data Frame DSL (Spark SQL)
• Data Frame SQL (Spark SQL)
• serialization: default Java and Kryo
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
46. Persistence & Caching
• by default stores the data in the JVM
heap as unserialized objects
• possibility to store on disk as
unserialized/serialized objects
• off-heap caching is experimental and
uses
55. Memory Management
Tune Executor Memory Fraction
RDD Storage (60%)
Shuffle and aggregation
buffers (20%)
User code (20%)
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
56. Memory Management
Tune storage level:
• store in memory and/or on disk
• store as unserialized/serialized objects
• replicate each partition on 1 or 2 cluster
nodes
• store in Tachyon
https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
57. Level of Parallelism
• spark.task.cpus
• 1 task per partition using 1 core to execute
• spark.default.parallelism
• can be controlled:
• repartition() and coalescence() functions
• degree of parallelism as a operations parameter
• storage system matters
58. Data Locality
• check data locality via UI
• configure data locality settings if
needed
• spark.locality.wait timeout
• execute certain jobs on a driver
• spark.localExecution.enabled
59.
60. Java API Drawbacks
• API can be experimental or used just
for development
• Spark Java API can be not up-to-date
as Scala API is main focus
63. Use Cases
• supplement Neo4j database used to
store/query big dimensions
• supplement RDBMS for querying of
high volumes of data
64. Use Cases
• represent existing computational graph
as flow of Spark-based operations
• predictive analytics based on Spark
MLib component
65. Lessons Learned
• Spark simplicity is deceptive
• Each use case is unique
• Be really aware:
• Databricks blog
• Mailing lists & Jira
• Pull requests
Spark is kind of magic
68. Project Tungsten
• the largest change to Spark’s execution
engine since the project’s inception
• focuses on substantially improving the
efficiency of memory and CPU for
Spark applications
• sun.misc.Unsafe
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
70. References
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-
business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-
models/
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early
release ebook from O'Reilly Media)
https://spark-prs.appspot.com/#all
https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-
sorting.html
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
http://www.slideshare.net/databricks/spark-sqlsse2015public
https://spark.apache.org/docs/latest/running-on-mesos.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
http://spark-packages.org/
Editor's Notes
Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn
Cluster Manager should be chosen and configured properly
Monitoring via web UI(s) and metrics
Web UI:
master web UI
worker web UI
driver web UI - available only during execution
history server - spark.eventLog.enabled = true
Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
Serialization: default and Kryo
Tune Executor Memory Fraction: RDD Storage (60%), Shuffle and Aggregation Buffers (20%), User code (20%)
Tune storage level:
store in memory and/or on disk
store as unserialized/serialized objects
replicate each partition on 1 or 2 cluster nodes
store in Tachyon
Level of Parallelism:
spark.task.cpus
1 task per partition using 1 core to execute
spark.default.parallelism
can be controlled:
repartition() and coalescence() functions
degree of parallelism as a operations parameter
storage system matters
Data locality:
check data locality via UI
configure data locality settings if needed
spark.locality.wait timeout
execute certain jobs on a driver
spark.localExecution.enabled