This Edureka Apache Spark Interview Questions and Answers tutorial helps you in understanding how to tackle questions in a Spark interview and also gives you an idea of the questions that can be asked in a Spark Interview. The Spark interview questions cover a wide range of questions from various Spark components. Below are the topics covered in this tutorial:
1. Basic Questions
2. Spark Core Questions
3. Spark Streaming Questions
4. Spark GraphX Questions
5. Spark MLlib Questions
6. Spark SQL Questions
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Edureka!
This Edureka "Apache Spark Training" tutorial will talk about how Apache Spark works practically. We have demonstrated a Movie Recommendation Project using Apache Spark in this tutorial. Below are the topics covered in this tutorial:
1) Use Cases Of Real Time Analytics
2) Movie Recommendation System Using Spark
3) What Is Spark?
4) Getting Movie Dataset
5) Spark Streaming
6) Collaborative Filtering
7) Spark MLlib
8) Fetching Results
9) Storing Results
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Spark Overview
2) Hadoop Overview
3) Spark vs Hadoop
4) Why Spark Hadoop?
5) Using Hadoop With Spark
6) Use Case - Sports Analytics (NBA)
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy an Spark Application using the MQTT data source for the new Apache Spark 2.0 Structure Streaming functionality.
This Edureka Apache Spark Interview Questions and Answers tutorial helps you in understanding how to tackle questions in a Spark interview and also gives you an idea of the questions that can be asked in a Spark Interview. The Spark interview questions cover a wide range of questions from various Spark components. Below are the topics covered in this tutorial:
1. Basic Questions
2. Spark Core Questions
3. Spark Streaming Questions
4. Spark GraphX Questions
5. Spark MLlib Questions
6. Spark SQL Questions
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Edureka!
This Edureka "Apache Spark Training" tutorial will talk about how Apache Spark works practically. We have demonstrated a Movie Recommendation Project using Apache Spark in this tutorial. Below are the topics covered in this tutorial:
1) Use Cases Of Real Time Analytics
2) Movie Recommendation System Using Spark
3) What Is Spark?
4) Getting Movie Dataset
5) Spark Streaming
6) Collaborative Filtering
7) Spark MLlib
8) Fetching Results
9) Storing Results
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Spark Overview
2) Hadoop Overview
3) Spark vs Hadoop
4) Why Spark Hadoop?
5) Using Hadoop With Spark
6) Use Case - Sports Analytics (NBA)
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy an Spark Application using the MQTT data source for the new Apache Spark 2.0 Structure Streaming functionality.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Training will help you learn about PySpark API. You will get to know how python can be used with Apache Spark for Big Data Analytics. Edureka's structured training on Pyspark will help you master skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Oracle REST Data Services Best Practices/ OverviewKris Rice
This slide deck goes over the basic architecture of Oracle REST Data Services. It also points out various features to enable to make the best use of the product to safely enable an Oracle Database for RESTful access.
A quick review of REST and then onto how to make your Oracle tables and view available to REST applications using Oracle SQL Developer and Oracle REST Data Services.
First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs.
Speakers:
Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer.
Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.
This is a version of a talk I presented at Spark Summit East 2016 with Rachel Warren. In this version, I also discuss memory management on the JVM with pictures from Alexey Grishchenko, Sandy Ryza, and Mark Grover.
How to use Impala query plan and profile to fix performance issuesCloudera, Inc.
Apache Impala is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Training will help you learn about PySpark API. You will get to know how python can be used with Apache Spark for Big Data Analytics. Edureka's structured training on Pyspark will help you master skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Oracle REST Data Services Best Practices/ OverviewKris Rice
This slide deck goes over the basic architecture of Oracle REST Data Services. It also points out various features to enable to make the best use of the product to safely enable an Oracle Database for RESTful access.
A quick review of REST and then onto how to make your Oracle tables and view available to REST applications using Oracle SQL Developer and Oracle REST Data Services.
First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs.
Speakers:
Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer.
Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.
This is a version of a talk I presented at Spark Summit East 2016 with Rachel Warren. In this version, I also discuss memory management on the JVM with pictures from Alexey Grishchenko, Sandy Ryza, and Mark Grover.
How to use Impala query plan and profile to fix performance issuesCloudera, Inc.
Apache Impala is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
Mesopotamia: los orígenes de la civilización humana.Gustavo Bolaños
Material para la revisión del tema respectivo para los y las estudiantes de 7º año del profesor Gustavo Bolaños Ramírez, del Liceo de Atenas, Alajuela, Costa Rica.
Robot Farmers and Chefs: In the Field and In Your KitchenTim Gasper
Food production and preparation have always been labor and capital intensive, but with the internet of things, low-cost sensors, cloud-computing ubiquity, and big data analysis, farmers and chefs are being replaced with connected, big data robots—not just in the field but also in your kitchen. Tim Gasper explores the tech stack, data science techniques, and use cases driving this revolution.
As presented in 2017 at O'Reilly: Strata + Hadoop World Conference and Data Day Texas.
Who is the Salesforce Certified Platform
Developer I?
The candidate looking to obtain the Salesforce Certified Platform Developer I Certification
can assess the architecture environment and requirements and designs solutions on the
Force.com platform that meet sharing and visibility requirements. The candidate has
experience communicating solutions and design trade-offs to business stakeholders. The
Salesforce Platform Developer I Certification leverages the knowledge and content from
the Apex and Visualforce Architecture eBook.
The experience and skills that the candidate should possess are outlined below:
Has 5+ years of delivery experience.
Provides experienced guidance on the appropriate choice of platform technology.
Understands architecture options, design trade-offs, and has the ability to
communicate design choices.
Aware of globalization (multi-language, multi-currency) application design
considerations on a project.
Able to identify development-related risks, considerations, and limits for the
platform.
Experience with different types of development patterns/principles.
Aware of platform-specific design patterns and key limits.
Has held a technical architect role on multiple complex deployments or has gained
equivalent knowledge through participation and exposure to these types of projects
[either with single or multiple projects].
Familiarity with code development on the Force.com platform.
Understands object-oriented design patterns.
Experience with project and development lifecycle methodologies.
Understands strategies to build an optimized and performant solution.
Apache Spark is an open-source framework developed by AMPlab of University of California and, successively, donated to Apache Software Foundation. Unlike the MapReduce paradigm based on twolevel disk of Hadoop, the primitive in-memory multilayer provided by Spark allow you to have performance up to 100 times better.
This is a course about embedded systems programming. Embedded systems are everywhere today, including just to name a few the thermostats that control a building's temperature, the power-steering controller in modern automobiles, and the control systems in charge of jet engines. The prerequisites for reading this ebook are: knowledge about computer and processors architecture, Ada programming language.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
99 Apache Spark interview questions for professionals - https://www.amazon.com/dp/B01N29H04T
1. 99 Apache Spark Interview Questions for Professionals
Page 1 of 82
2. Apache Spark Interview Questions for Professionals
Page 2 of 82
Introduction
This guide will prepare you for an interview for an entry level or a senior level
position as an Apache Spark developer one. This book attempts to give an
understanding of both high-level concepts and technical details of Spark to the
reader. The intention of this book is to present basic and advanced Spark-related
information in the form of question and answer.
This book also makes heavy use of diagrams to make the aforementioned concepts
and details easierto understand. All resources used are referenced.
This book is useful for Apache developers who would like to change jobs or for
Managers/Leads who are looking for a set of questions to hire new developers. It
covers some of the questions which can only be answered by knowledgeable
professionals. hat are the top challenges that developers face while
is a question that covers a wide array of scenarios which
are faced by developers and architects while working with production systems.
Developers who are new to Apache Spark should find this book useful once they have
taken some training (or self study) and are ready to jump into Apache Spark. For
data engineers who would like to leverage HIVE using Spark, there are a few
questions on HIVE and Spark as well. Spark Machine (or MLlib) and Spark GraphX
are covered, but not in depth, as the focus of the book is the Spark core engine.
I intend to update this text later to accommodate more to beginners. For example,
Apache Zeppelin or Databricks notebook can be used to initially avoid setting up
complex environment.
You should not buy his book if you do not understand Big Data or Hadoop or some
kind of parallel processing architecture. This book is mostly written with the focus on
Apache Spark but since HDFS is the most preferred storage with Spark, some
proficiency is implied and there are a few questions on the topic. Thus Understanding
YARN and HDFS is important if you plan to use Spark with the Hadoop ecosystem.
In order to get the most out of this book, make sure you can explain the answers in
your own words. Interviewers test for both knowledge and depth. In some of the
questions, various configurations are mentioned, and you are not expectedto know all
the settings but you are expected to have an idea of all of them and the problems they
aim to solve.
The code included in this book is in Scala; however, code can be written in R, Java,
and Python with very similar syntax.
Join me in this exciting journey through Apache Spark!
3. Apache Spark Interview Questions for Professionals
Page 3 of 82
Contents
1. What is the difference between Spark and Hadoop?...............................................................................................7
2. What are the differences between functional and imperative languages, and why is functional programming
important?..................................................................................................................................................................9
3. What is a resilient distributed dataset (RDD),explain showing diagrams? ...........................................................10
4. Explain transformations and actions (in the context of RDDs)..............................................................................11
5. What are the Spark use cases?............................................................................................................................12
6. Why do we need transformations? What is lazy evaluation and why is it useful? ...................................................12
7. What is ParallelCollectionRDD? ........................................................................................................................13
8. Explain how ReduceByKey and GroupByKey work? ..........................................................................................13
9. What is the common workflow of a Spark program?............................................................................................14
10. Explain Spark environment for driver. Ref ......................................................................................................15
11. What are the transformations and actions that you have used in Spark? .............................................................16
12. How can you minimize data transfers when working with Spark? .....................................................................19
13. What is a lineage graph? ................................................................................................................................19
14. Describe the major libraries that constitute the Spark Ecosystem ......................................................................19
15. What are the different file formats that can be used in SparkSql? ......................................................................19
16. What are Pair RDDs?.....................................................................................................................................20
17. What is the difference between persist() and cache()........................................................................................20
18. What are the various levels of persistence in Apache Spark? Ref ......................................................................20
19. Which Storage Level to choose? Ref...............................................................................................................21
20. Explain advantages and drawbacks of RDD.....................................................................................................21
21. Explain why dataset is preferred over RDDs?..................................................................................................21
22. How to share data from Spark RDD between two applications? ........................................................................22
23. Does Apache Spark provide check pointing? ...................................................................................................22
24. Explain the internal working of caching?.........................................................................................................22
25. What is the function of Block manager?..........................................................................................................23
26. Why does Spark SQL consider the support of indexes unimportant? .................................................................23
27. How to convert existing UDTFs in Hive to Scala functions and use them from Spark SQL? Explain with example
Ref 23
28. Why use dataframes and datasets when we have RDD? Ref Video....................................................................24
29. What is a Catalyst and how does it work? Ref .................................................................................................25
30. What are the top challenges developers faces while writing Spark applications? Ref Video .............................28
4. Apache Spark Interview Questions for Professionals
Page 4 of 82
31. Explain the difference in implementation between DataFrames and DataSet?....................................................31
32. How is memory handled in Datasets?..............................................................................................................32
33. What are the limitations of dataset?.................................................................................................................32
34. What are the contentions with memory?..........................................................................................................32
35. Show Command to run Spark in YARN client mode? ......................................................................................33
36. Show Command to run Spark in YARN cluster mode?.....................................................................................33
37. What is Standalone and YARN mode?............................................................................................................33
38. Explain client mode and cluster mode in Spark? ..............................................................................................34
39. Which cluster managers are supported by Spark?.............................................................................................34
40. What is Executor memory? ............................................................................................................................34
41. What is DStream and what is the difference between batch and Dstream in Spark streaming?.............................35
42. How does Spark Streaming work? ..................................................................................................................35
43. Difference between map() and flatMap()? .......................................................................................................37
44. What is reduce() action, Is there any difference between reduce() and reduceByKey()?......................................37
45. What is the disadvantage of reduce() action and how can we overcome this limitation? ......................................38
46. What are Accumulators and when are accumulators truly reliable? ...................................................................38
47. What is Broadcast Variables and what advantage do they provide? ..................................................................38
48. What is piping? Demonstrate with an example of a data pipeline. .....................................................................39
49. What is a driver? ...........................................................................................................................................40
50. What does a Spark Engine do? .......................................................................................................................40
51. What are the steps that occur when you run a Spark application on a cluster? ....................................................40
52. What is a schema RDD/DataFrame?...............................................................................................................41
53. What are Row objects?...................................................................................................................................41
54. How does Spark achieve fault tolerance?.........................................................................................................41
55. What parameter is set if cores need to be defined across executors? ..................................................................42
56. Name few Spark Master system properties?.....................................................................................................42
57. Define Partitions in reference to Spark implementation?...................................................................................43
58. Differences between how Spark and MapReduce manage cluster resources under YARN. Ref ...........................43
59. What is GraphX and what is PageRank? Ref...................................................................................................46
60. What does MLlib do? Ref..............................................................................................................................53
61. What is a Parquet file? ...................................................................................................................................58
62. Why is Parquet used for Spark SQL? Ref........................................................................................................58
63. What is schema evolution and what is its disadvantage, explain schema merging in reference to parquet file? Ref
5. Apache Spark Interview Questions for Professionals
Page 5 of 82
62
64. Will Spark replace MapReduce?.....................................................................................................................64
65. What is Spark Executor?................................................................................................................................64
66. Name the different types of Cluster Managers in Spark. ...................................................................................65
67. How many ways we can create RDDs, show example? ....................................................................................65
68. How do you flatten rows in Spark? Explain with example. Ref.........................................................................65
69. What is Hive on Spark?..................................................................................................................................66
70. Explain Spark Streaming Architecture?...........................................................................................................66
71. What are the types of Transformations on DStreams? ......................................................................................66
72. What is Receiver in Spark Streaming, and can you build custom receivers?.......................................................66
73. Explain the process of Live streaming storing DStream data to database? Ref....................................................67
74. How is Spark streaming fault tolerant?............................................................................................................71
75. Explain transform() method used in dSteam? Ref ............................................................................................72
76. What file systems does Spark support?............................................................................................................72
77. How is data security achieved in Spark?..........................................................................................................73
78. Explain Kerberos security? Ref ......................................................................................................................76
79. Name the various types of distributing that Spark supports? .............................................................................77
80. Show some example queries using the Scala DataFrame API. Ref....................................................................77
81. What are the conditions where Spark driver can parallelize dataSets as RDDs?..................................................79
82. Can repartition() operation decrease the number of partitions? Ref....................................................................79
83. What is the drawback of repartition() and coalesce() operations? ......................................................................79
84. In a join operaton for example val joinVal = rddA.join(rddB) will it generate partition? .....................................79
85. Consider the following code in Spark, what is the final value in fVal variable?..................................................79
86. Scala pattern matching - Show various ways code can be written? ....................................................................79
87. What is the return result when a query is executed using Spark SQL or HIVE? Hint: RDD or dataframe/dataset? 79
88. If we want to display just the schema of a dataframe/dataset what method is called? ..........................................79
89. Show various implementations for the following query in Spark? .....................................................................80
90. What are the most important factors you want to consider when you start machine learning project?...................80
91. As a data scientist, which algorithm would you suggest if legal aspects and ease of explanation to non technical
people are the main criteria?......................................................................................................................................80
92. For the supervised learning algorithm, what percentage of data is split between training and test dataset?............80
93. Compare performance of Avro and parquet file formats and their usage (in the context of Spark) .......................80
94.
6. Apache Spark Interview Questions for Professionals
Page 6 of 82
these web services?...................................................................................................................................................81
95. When you should not use Spark? ....................................................................................................................81
96. Can you use Spark to access and analyze data stored in Cassandra databases? ...................................................81
97. With which mathematical properties can you achieve parallelism?....................................................................81
98. What are various types of Partitioning in Apache Spark?..................................................................................81
99. How to set partitioning for data in Apache Spark? ...........................................................................................82
7. 99 Apache Spark Interview Questions for Professionals
Page 7 of 82
1. What is the difference between Spark and Hadoop?
Features SPARK Hadoop
Inspiration Hadoop Map-Reduce and Scala programming language,
developed by UC-Berkeley's AMPLab in 2009, use
generalized computation instead of MapReduce
Query optimization - RDBMS
Real time processing capability
Google, papers in 2004
outlining MapReduce
No optimization
Batch Processing
Speed 100X in-memory and
10X on Disk
Heavy Disk read I/O
intensive
Ease of Use Easily to write application using Java, Scala, Python,R
(Functional programming style)
Interactive Shell available with Scala and Python
High level simple map-reduce Operations
Java Imperative
programming style
No shell
complex map-reduce
operations
Iterative Workflow Great at Iterative workloads (Machine learning ..etc) Not ideal for iterative work
Tools Well integrated tools (Spark SQL, Streaming, Mlib and
GraphX) to develop complex analytical application
Loosely coupled large set of
tools, but matured
Deployment Hadoop YARN, Mesos, Amazon-EC2 Usually use Oozie and
Azkaban to create workflow
Data Source HDFS(Hadoop), HBase, Cassandra, MongoDB,
Amazon-S3, RDBMS, file, socket, Twitter
RDBMS (using sqoop),
streaming using FLUME
Applications
multiple jobs in sequence or parallel
Application processes are called executors, run on
clusters(workers)
unit; Processes data with
MapReduce and writes data
to storage
Executors Executors can run multiple tasks in a single processor Each MapReduce runs in its
own processor
8. Apache Spark Interview Questions for Professionals
Page 8 of 82
Shuffle
above the configured threshold (200 by default)
Always sorts its partition
during shuffle
Shared Variable Broadcast variables: Read-only(look-up) variable, ships
only once to worker
Accumulators: Workers add values and driver reads the
data, and fault tolerant
Hadoop counterhas
additional (system) metric
Persisting/Caching
RDD
Cached RDDs can be used & reused across the
operation, thus increasing the processing speed
None
Lazy Evaluation Transformation functions and execution plan bundled
together and execute only with RDD action function
None
Memory Management
and Compression
Memory is conserved,because ofthe compact format.
Speed is improved by custom code-generation.
Custom compression can be
achieved using AVRO,
Kyro; no memory
management
Optimizer and Query
Planning
Optimizer is a Rule Executor for logical plans. It uses a
collection of logical plan optimizations. Generates
encoders via runtime code-generation. The generated
code can operate directly on the Tungsten compact
format. Query is optimized logical and physical plan
(inspired by RDBMS query planning and optimization)
None
9. Apache Spark Interview Questions for Professionals
Page 9 of 82
2. What are the differences between functional and imperative languages, and why is functional programming important?
Following features of Scala makes it uniquely suitable for Spark.
Immutability - Immutable means that you can't change your variables; you mark them as final in Java, or use the val
keyword in Scala
Higher order functions - These are functions that take other functions as parameters, or whose result is a function. Here is a
function apply which takes another function f and a value v and applies function f to v: example - def apply(f: Int => String,
v: Int) = f(v)
Lazy loading - Lazy val is executed when it is accessed the first time else no execution.
Pattern matching - Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a
first-match policy
Currying - If we turn this into a function object that we can assign or pass around,the signature of that function looks like
this: val sizeConstraintFn: IntPairPred => Int => Email => Boolean = sizeConstraint _ Such a chain of one-parameter
functions is called a curried function
Partial application - When applying the function, you do not pass in arguments for all of the parameters defined by the
function, but only for some of them, leaving the remaining ones blank. What you get back is a new function whose parameter
list only contains those parameters from the original function that were left blank.
Monads - Most Scala collections are monadic, and operating on them using map and flatMap operations,or using for-
comprehensions is referred to as monadic-style.
Programming approach difference:
Characteristic Imperative approach Functional approach
Programmer focus How to perform tasks (algorithms)
and how to track changes in state.
What information is desired and what
transformations are required.
State changes Important. Non-existent.
Order of execution Important. Low importance.
Primary flow control Loops, conditionals, and function
(method) calls.
Function calls, including recursion.
Primary manipulation unit Instances ofstructures or classes. Functions as first-class objects and data
collections.
10. Apache Spark Interview Questions for Professionals
Page 10 of 82
3. What is a resilient distributed dataset (RDD), explain showing diagrams?
Resilient distributed dataset (RDD) is a read-only and fault-tolerant collection of objects partitioned across a clusterof
computers that can be operated on in parallel with one another.There are two ways to create RDDs: parallelizing an existing
collection in yourdriver program, or referencing a dataset in an external storage system,such as a shared filesystem, HDFS,
HBase, S3, Cassandra or RDBMS.
RDDs (Resilient Distributed Datasets)are basic abstractions in Apache Spark that represent the data coming into the system
in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-
only, portioned, collection of records,which are
Immutable RDDs cannot be altered.
Resilient If a node holding the partition fails the othernode takes the data.
Lazy evaluated
Cacheable
Type inferred
Ref
11. Apache Spark Interview Questions for Professionals
Page 11 of 82
4. Explain transformations and actions (in the context of RDDs)
Transformations are functions executed on demand to produce a new RDD. All transformations are followed by actions.
Some examples of transformations include map, filter and reduceByKey.
ReduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform
the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
Actions are the results of RDD computations or transformations. After an action is performed, the data from the RDD moves
back to the local machine. Some examples of actions include reduce, collect, first, and take.