Are you a Java developer interested in big data processing and never had the chance to work with Apache Spark ? My presentation aims to help you get familiar with Spark concepts and start developing your own distributed processing application.
11. Different processing model
•More operation available
•Flexible way of composing operations
•Pluggable data sources
•Streaming capabilities built-in
•Pluggable algorithm
18. Resilient Distributed Dataset -
RDD
•Stored in memory and storage
•Immutable
•Enables parallel operations on collections of
elements
•Contains lineage information
22. Spark terminology
•Job – the work required to compute an RDD
•Stage – a wave of work within a job,
corresponding to one or more pipelined
RDD's
•Task – a unit of work within a stage,
correspoding to one RDD partition
•Shuffle – the transfer the data between stages
24. Conclusion
•Spark is :
•Complete and standalone solution for
distributed processing
•Fluent API
•Pluggable with other big data frameworks
•One of the most actively contributed Apache
project