SlideShare a Scribd company logo
MSDS7330-403-FINAL PAPER 1
Teaching Apache Spark: Demonstrations on the
Databricks Cloud Platform
Yao Yao and Mooyoung Lee
ABSTRACT—Apache Spark is a fast and general engine for
big data analytics processing with libraries for SQL,
streaming, and advanced analytics. In this project, we will
present an overview of Spark with the goal of giving
students enough guidance to enable them to quickly begin
using Spark 2.1 via the online Databricks Community
Edition for future projects.
INDEX TERMS—Spark, Databricks, MapReduce, Python,
SQL
I. INTRODUCTION
ur goal is to present a concise overview of Apache Spark
to our cohorts about its capabilities, uses, and
applications via the Databricks cloud platform. The
presentation will include original content, but will leverage
existing resources where appropriate. This project consists of
three deliverables: a 90-minute in-class presentation, recorded
videos suitable for use on the 2DS asynchronous platform, and
this paper documenting the key information. At the end of the
presentation, students should feel confident knowing the
basics of Spark to implement for future projects.
II. PROBLEM STATEMENT
There are three main challenges that Spark 2.0 intends to
solve: CPU speed, end-to-end applications using one engine,
and decision implementation based on reliable real-time
data[1]
. We plan to not only teach our fellow cohorts about the
fundamentals of how to use Spark and the capabilities of the
Spark ecosystem, but also provide compelling demonstrations
of Spark code that would inspire future data science projects
to be implemented on the online Databricks cloud platform.
III. RESEARCH METHODOLIGY
We learned how to use Spark on the online Databricks cloud
platform via the Databricks documentation, Lynda.com, IBM's
cognitiveclass, and O'Reilly's Definitive Guide. The videos are
broken up into four parts to address the basic introduction and
three demonstrations that use case studies to highlight the
functionality of Spark.
Yao Yao is a graduate student in the Masters of Science in Data Science
program at Southern Methodist University (e-mail: yaoyao@smu.edu). He
resides in Chicago, Illinois.
Mooyoung Lee is a graduate student in the Masters of Science in Data
Science program at Southern Methodist University (e-mail:
mooyoungl@smu.edu). He resides in Plano, Texas.
The first video introduces Spark with its development history,
ecosystem, and integrations. The first demo covers how to
sign up for the community edition of Databricks.com, create a
cluster, load in datasets and libraries, find datasets using the
shell command, run various code using different programming
commands and languages, check revision history, comment
code, create a dashboard, and publish a notebook.
The second video addresses the CPU speed challenges,
parallel computing, provides Spark's alternative solution to
MapReduce, and Project Tungsten for Spark 2.0[1]
. The second
demo shows how sc.parallelize is used for parallel computing
in a word count example, linear regression speed, and
visualizations with GraphX.
The third video addresses the challenge to unify all analytics
under one engine and how Spark allows for end-to-end
applications to be built by being able to switch between
languages, implement specific libraries, call on the same
dataset in the same notebook, and call high level APIs to allow
for vertical integration[1]
. The third demo shows different
programming languages calling on the same data set, Spark
visualizations, Spark SQL, and the ability to read in and write
the schema for JSON data.
The fourth video addresses the challenges with the reliability
of data streaming and the solutions that Spark provides for
node crashes, asequential data, and data consistency[2]
. The
fourth demo shows a streaming example and a custom
application with imported library packages.
IV. HISTORY / ECOSYSTEM
Apache Spark is an unified analytics platform developed in
2009 at UC Berkley's AMPLab. It became open source in
2010, donated to the Apache Software Foundation in 2013,
and is a top-contributed Apache project since 2014. Spark 1.0
was released in 2014, Project Tungsten was released in 2015,
and Spark 2.0 was released in 2016 (Display 1).
Display 1. Timeline of Apache Spark and its developments[3]
.
O
MSDS7330-403-FINAL PAPER 2
Display 2: Spark's Ecosystem with Core API, platform, and
high level APIs[4]
.
Spark uses common programming languages such as R, SQL,
Python, Scala, and Java in the Core API for the end-user to be
more productive towards various tasks (Display 2). Users can
build their predictive models based on various library
capabilities, while coding effectively in the language of their
choice, being able to switch languages, and call on the same
dataset in the same notebook[4]
.
High level APIs such as Spark SQL and DataFrames, data
streaming, machine learning library, and GraphX allows users
to build end-to-end apps[1]
. The Spark open source ecosystem
also allows the integration to external environments, data
sources, and applications. Databricks cluster manager is used
for the online platform. Spark can also run with Hadoop’s
YARN, Mesos, or as a standalone[5]
. 
V. SPARK 2.0'S SOLUTIONS
1. CPU Usage
2010 2017
Storage 100 MB/s (HDD) 1000 MB/s (SSD)
Network 1 GB/s 10 GB/s
CPU ~3 GHz ~3 GHz
Display 3. Hardware trends of current computing systems[1]
As storage capacity and network increased by 10-fold, CPU
speed remained relatively the same. Hardware may have more
cores, but the speed is not 10 times faster (Display 3)[1]
. We
are close to the end of frequency scaling for CPU, where the
speed cannot run more cycles per second without using more
power and generating excessive heat.
To accommodate for the CPU speed, hardware manufacturers
have created multiple core parallel computing processes,
which requires a form of MapReduce to compensate for
distributed computation[1,6]
.
Display 4: MapReduce jobs execute in sequence due to the
synchronization step[6]
The original MapReduce executed jobs in a simple but rigid
structure, where  "map" is the transformation step for local
computation for each record, "shuffle" is the synchronization
step, and "reduce" is the communication step to combine the
results from all the nodes in a cluster (Display 4)[7]
.
MapReduce jobs execute in sequence and each of those jobs
are high-latency, where no subsequent job could start until the
previous job had finished completely.
Spark instead uses column ranges that keep their schema
model and indexing, which could be read inside MapReduce
records. Spark also uses an alternative multi-step directed
acyclic graphs (DAGs), which mitigates slow nodes by
executing nodes all at once and not step by step[7]
. This
eliminates the synchronization step required by MapReduce
and lowers latency (Display 5). In addition, Spark supports in-
memory data sharing across DAGs, so that different jobs can
work with the same data at very high speeds[7]
.
Display 5: Spark is able to outperform MapReduce[8]
.
To further optimize Spark's CPU usage, Project Tungsten for
Spark 2.0 mitigates runtime code generation, removes
expensive iterator calls, and fuse multiple operators, where
less excessive data is created in the memory[1]
. User code is
converted into binary, which complies to pointer arithmetic for
the runtime to be more efficient (Display 6)[1]
. This allows the
user to code efficiently while performance is increased.
MSDS7330-403-FINAL PAPER 3
Dataframe code df.where(df("year") > 2015)
Logical expression GreaterThan(year#345, literal(2015))
Java Bytecode bool filter(Object baseObject) {
int offset = baseOffset +
bitSetWidthInBytes + 3*8L;
int value = Platform.getInt(baseObject,
offset);
return value > 2015;}
Display 6: The conversion of DataFrame code by the user into
Java Bytecode for faster performance[1]
The Databricks online platform also benefits from cloud
computing, where one can increase memory by purchasing
multiple clusters for running parallel jobs[9]
. Cloud computing
also mitigates load time and frees up your local machine for
other purposes.
2. End-to-End Applications
With multiple specialized engines intended for different
applications, there are more systems to install, configure,
connect, manage, and debug. Performance dwindles because it
is hard to move big data across nodes and dynamically
allocate resources for different computations. Writing temp
data to file for another engine to run analysis slows down
processes between systems[1]
.
With the same engine, data sets could be cached in RAM
while implementing different computations and dynamically
allocating resources[10]
. Built-in visualizations, high level
APIs, and the ability to run multiple languages in the same
notebook allow for vertical integration[4]
. Performance gains
are cumulative due to the aggregation of marginal gains while
creating diverse products using multiple components (Display
7).
Display 7: Diversity in application solutions using multiple
components in Spark[11]
3. Decisions Based on Data Streaming
Streamed data may not be reliable due to node crashes, where
artificial fluctuations in the data are caused by latency of the
system, asequential data, where data is not in order, and data
inconsistency, where multiple data storage locations may have
different results (Display 8)[12]
. Since Spark nodes have
columns indexing, each node is tagged with time stamps and
special referencing, where the fault tolerance fixes node
crashes, special filtering fixes asequential data, and overall
indexing fixes data consistency.
Display 8: Artificial fluctuations in the streaming data are
caused by latency of the system and node crashes[12]
The benefits of using Spark is the speed needed to quickly use
streaming data to make real time decisions and the
sophistication needed to run best algorithms on the data. Being
able to implement ad-hoc queries on top of streaming data,
static data, and batch processes would allow more flexibility
in the real-time decision making process than a pure streaming
system (Display 9)[12]
.
Display 9: Pure streaming system verses continuous
application streaming with queries and batch jobs[12]
Continuous applications built on structured streaming allow
for personalized results from web searches, automated stock
trading, trends in news events, credit card fraud prevention,
and video streaming, which benefit from implementing quick
decisions per device based on various Spark applications from
the cloud (Display 10)[13]
.
MSDS7330-403-FINAL PAPER 4
Display 10: Spark streaming solutions where each device can
make decisions based on real-time data[13]
VI. CONCLUSION
Spark continues to dominate the analytical landscape with its
efficient solutions to CPU usage, end-to-end applications, and
data streaming[1]
. The diversity of organizations, job fields,
and applications that use Spark will continue to grow as more
people find use in its various implementations and advanced
analytics (Display 11)[11]
.
Display 11: Diversity in jobs that use Apache Spark[11]
VII. DELIVERABLES
A. Combined Video
Getting Started with Spark, Spark's Approach to CPU Usage,
End-to-End Applications, and Data Streaming Applications:
B. In-Class Presentation
Conducted on 8/2/17 at 8:30 PM CST to Dr. Sohail Rafiqi's
MSDS7330 section 403 class
C. Other Resources
This paper, demonstration notebooks, and PowerPoint slides
are located here: https://github.com/yaowser/learn-spark
VIII. PREVIOUS WORK
[1] M. Zaharia. What to Expect for Big Data and Apache
Spark in 2017. Available: https://youtu.be/vtxwXSGl9V8 and
https://www.slideshare.net/databricks/spark-summit-east-
2017-matei-zaharia-keynote-trends-for-big-data-and-apache-
spark-in-2017
[2] Apache Spark Official Website. Available:
http://spark.apache.org/
[3] Jump Start with Apache Spark 2.0 on Databricks.
Available: https://www.slideshare.net/databricks/jump-start-
with-apache-spark-20-on-databricks
[4] R. Rivera. The future of Apache Spark. Available:
https://www.linkedin.com/pulse/future-apache-spark-rodrigo-
rivera
[5] D. Lynn. Apache Spark Cluster Managers: YARN, Mesos,
or Standalone? Available: http://www.agildata.com/apache-
spark-cluster-managers-yarn-mesos-or-standalone/
[6] M. Zaharia. Three Ways Spark Went Against
Conventional Wisdom. Available:
https://youtu.be/y7KQcwK2w9I and
https://www.slideshare.net/databricks/unified-big-data-
processing-with-apache-spark-qcon-2014
[7] M. Olsen. MapReduce and Spark. Available:
http://vision.cloudera.com/mapreduce-spark/
[8] Performance Optimization Case Study: Shattering
Hadoop's Sort Record with Spark and Scala. Available:
https://www.slideshare.net/databricks/2015-0317-scala-days
[9] Databricks Cloud Community Edition. Available:
http://community.cloud.databricks.com/
[10] Introduction to Apache Spark on Databricks. Available:
https://community.cloud.databricks.com/?o=71876330457650
22#notebook/418623867444693/command/418623867444694
[11] A. Choudhary. 2016 Spark Survey. Available:
https://www.slideshare.net/abhishekcreate/2016-spark-survey
[12] M. Zaharia, et al. Structured Streaming In Apache Spark.
Available: https://databricks.com/blog/2016/07/28/structured-
streaming-in-apache-spark.html
[13] M. Zaharia. How Companies are Using Spark, and Where
the Edge in Big Data Will Be. Available:
https://youtu.be/KspReT2JjeE
MSDS7330-403-FINAL PAPER 5
IX. RESOURCES
Databricks Spark Documentation. Available:
https://docs.databricks.com/spark/latest/
M. Mayo. Top Spark Ecosystem Projects. Available:
http://www.kdnuggets.com/2016/03/top-spark-ecosystem-
projects.html
B. Yavuz. Using 3rd Party Libraries in Databricks: Apache
Spark Packages and Maven Libraries. Available:
https://databricks.com/blog/2015/07/28/using-3rd-party-
libraries-in-databricks-apache-spark-packages-and-maven-
libraries.html
Apache Spark YouTube Channel. Available:
http://www.youtube.com/channel/UCRzsq7k4-kT-
h3TDUBQ82-w
B. Sullins. Apache Spark Essential Training. Available:
https://www.lynda.com/Apache-Spark-tutorials/Apache-
Spark-Essential-Training/550568-2.html
L. Langit. Extending Hadoop Data Sceince Streaming Spark
Storm Kafka. Available: https://www.lynda.com/Hadoop-
tutorials/Extending-Hadoop-Data-Science-Streaming-Spark-
Storm-Kafka/516574-2.html
Spark Fundamentals I. Available:
https://courses.cognitiveclass.ai/courses/course-
v1:BigDataUniversity+BD0211EN+2016/info
Spark Fundamentals II. Available:
https://courses.cognitiveclass.ai/courses/course-
v1:BigDataUniversity+BD0212EN+2016/info
Apache Spark Makers Build. Available:
https://courses.cognitiveclass.ai/courses/course-
v1:BigDataUniversity+TMP0105EN+2016/info
Exploring Spark's GraphX. Available:
https://courses.cognitiveclass.ai/courses/course-
v1:BigDataUniversity+BD0223EN+2016/info
Analyzing Big Data in R using Apache Spark. Available:
https://courses.cognitiveclass.ai/courses/course-
v1:BigDataUniversity+RP0105EN+2016/info
Definitive Guide Apache Spark Excerpts. Available:
http://go.databricks.com/definitive-guide-apache-spark
Definitive Guide Apache Spark Raw Chapters. Available:
http://shop.oreilly.com/product/0636920034957.do
Spark The Definitive Guide Community Edition. Available:
https://github.com/databricks/Spark-The-Definitive-Guide

More Related Content

What's hot

Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
phanleson
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
GauravBiswas9
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
OpenStack
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
Olesya Eidam
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration Framework
Liang Men
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
Kumari Surabhi
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
t_ivanov
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 

What's hot (20)

Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration Framework
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Apache spark
Apache sparkApache spark
Apache spark
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Similar to Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

A Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdfA Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Big Data Processing Using Spark.pptx
Big  Data  Processing  Using  Spark.pptxBig  Data  Processing  Using  Spark.pptx
Big Data Processing Using Spark.pptx
DeekshaM35
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
Frank Schroeter
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
Edureka!
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
Edureka!
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Dharmjit Singh
 
Review on Apache Spark Technology
Review on Apache Spark TechnologyReview on Apache Spark Technology
Review on Apache Spark Technology
IRJET Journal
 
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Darshan Gorasiya
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 

Similar to Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform (20)

A Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdfA Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdf
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Big Data Processing Using Spark.pptx
Big  Data  Processing  Using  Spark.pptxBig  Data  Processing  Using  Spark.pptx
Big Data Processing Using Spark.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Review on Apache Spark Technology
Review on Apache Spark TechnologyReview on Apache Spark Technology
Review on Apache Spark Technology
 
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 

More from Yao Yao

Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
Yao Yao
 
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao
 
Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm Paper
Yao Yao
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm Poster
Yao Yao
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm Powerpoint
Yao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Yao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Yao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
Yao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
Yao Yao
 
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Yao Yao
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Yao Yao
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic Regression
Yao Yao
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity Data
Yao Yao
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear Regression
Yao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
Yao Yao
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random moves
Yao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
Yao Yao
 

More from Yao Yao (19)

Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
 
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
 
Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm Paper
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm Poster
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm Powerpoint
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic Regression
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity Data
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear Regression
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random moves
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 

Recently uploaded

2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 

Recently uploaded (20)

2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

  • 1. MSDS7330-403-FINAL PAPER 1 Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform Yao Yao and Mooyoung Lee ABSTRACT—Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics. In this project, we will present an overview of Spark with the goal of giving students enough guidance to enable them to quickly begin using Spark 2.1 via the online Databricks Community Edition for future projects. INDEX TERMS—Spark, Databricks, MapReduce, Python, SQL I. INTRODUCTION ur goal is to present a concise overview of Apache Spark to our cohorts about its capabilities, uses, and applications via the Databricks cloud platform. The presentation will include original content, but will leverage existing resources where appropriate. This project consists of three deliverables: a 90-minute in-class presentation, recorded videos suitable for use on the 2DS asynchronous platform, and this paper documenting the key information. At the end of the presentation, students should feel confident knowing the basics of Spark to implement for future projects. II. PROBLEM STATEMENT There are three main challenges that Spark 2.0 intends to solve: CPU speed, end-to-end applications using one engine, and decision implementation based on reliable real-time data[1] . We plan to not only teach our fellow cohorts about the fundamentals of how to use Spark and the capabilities of the Spark ecosystem, but also provide compelling demonstrations of Spark code that would inspire future data science projects to be implemented on the online Databricks cloud platform. III. RESEARCH METHODOLIGY We learned how to use Spark on the online Databricks cloud platform via the Databricks documentation, Lynda.com, IBM's cognitiveclass, and O'Reilly's Definitive Guide. The videos are broken up into four parts to address the basic introduction and three demonstrations that use case studies to highlight the functionality of Spark. Yao Yao is a graduate student in the Masters of Science in Data Science program at Southern Methodist University (e-mail: yaoyao@smu.edu). He resides in Chicago, Illinois. Mooyoung Lee is a graduate student in the Masters of Science in Data Science program at Southern Methodist University (e-mail: mooyoungl@smu.edu). He resides in Plano, Texas. The first video introduces Spark with its development history, ecosystem, and integrations. The first demo covers how to sign up for the community edition of Databricks.com, create a cluster, load in datasets and libraries, find datasets using the shell command, run various code using different programming commands and languages, check revision history, comment code, create a dashboard, and publish a notebook. The second video addresses the CPU speed challenges, parallel computing, provides Spark's alternative solution to MapReduce, and Project Tungsten for Spark 2.0[1] . The second demo shows how sc.parallelize is used for parallel computing in a word count example, linear regression speed, and visualizations with GraphX. The third video addresses the challenge to unify all analytics under one engine and how Spark allows for end-to-end applications to be built by being able to switch between languages, implement specific libraries, call on the same dataset in the same notebook, and call high level APIs to allow for vertical integration[1] . The third demo shows different programming languages calling on the same data set, Spark visualizations, Spark SQL, and the ability to read in and write the schema for JSON data. The fourth video addresses the challenges with the reliability of data streaming and the solutions that Spark provides for node crashes, asequential data, and data consistency[2] . The fourth demo shows a streaming example and a custom application with imported library packages. IV. HISTORY / ECOSYSTEM Apache Spark is an unified analytics platform developed in 2009 at UC Berkley's AMPLab. It became open source in 2010, donated to the Apache Software Foundation in 2013, and is a top-contributed Apache project since 2014. Spark 1.0 was released in 2014, Project Tungsten was released in 2015, and Spark 2.0 was released in 2016 (Display 1). Display 1. Timeline of Apache Spark and its developments[3] . O
  • 2. MSDS7330-403-FINAL PAPER 2 Display 2: Spark's Ecosystem with Core API, platform, and high level APIs[4] . Spark uses common programming languages such as R, SQL, Python, Scala, and Java in the Core API for the end-user to be more productive towards various tasks (Display 2). Users can build their predictive models based on various library capabilities, while coding effectively in the language of their choice, being able to switch languages, and call on the same dataset in the same notebook[4] . High level APIs such as Spark SQL and DataFrames, data streaming, machine learning library, and GraphX allows users to build end-to-end apps[1] . The Spark open source ecosystem also allows the integration to external environments, data sources, and applications. Databricks cluster manager is used for the online platform. Spark can also run with Hadoop’s YARN, Mesos, or as a standalone[5] .  V. SPARK 2.0'S SOLUTIONS 1. CPU Usage 2010 2017 Storage 100 MB/s (HDD) 1000 MB/s (SSD) Network 1 GB/s 10 GB/s CPU ~3 GHz ~3 GHz Display 3. Hardware trends of current computing systems[1] As storage capacity and network increased by 10-fold, CPU speed remained relatively the same. Hardware may have more cores, but the speed is not 10 times faster (Display 3)[1] . We are close to the end of frequency scaling for CPU, where the speed cannot run more cycles per second without using more power and generating excessive heat. To accommodate for the CPU speed, hardware manufacturers have created multiple core parallel computing processes, which requires a form of MapReduce to compensate for distributed computation[1,6] . Display 4: MapReduce jobs execute in sequence due to the synchronization step[6] The original MapReduce executed jobs in a simple but rigid structure, where  "map" is the transformation step for local computation for each record, "shuffle" is the synchronization step, and "reduce" is the communication step to combine the results from all the nodes in a cluster (Display 4)[7] . MapReduce jobs execute in sequence and each of those jobs are high-latency, where no subsequent job could start until the previous job had finished completely. Spark instead uses column ranges that keep their schema model and indexing, which could be read inside MapReduce records. Spark also uses an alternative multi-step directed acyclic graphs (DAGs), which mitigates slow nodes by executing nodes all at once and not step by step[7] . This eliminates the synchronization step required by MapReduce and lowers latency (Display 5). In addition, Spark supports in- memory data sharing across DAGs, so that different jobs can work with the same data at very high speeds[7] . Display 5: Spark is able to outperform MapReduce[8] . To further optimize Spark's CPU usage, Project Tungsten for Spark 2.0 mitigates runtime code generation, removes expensive iterator calls, and fuse multiple operators, where less excessive data is created in the memory[1] . User code is converted into binary, which complies to pointer arithmetic for the runtime to be more efficient (Display 6)[1] . This allows the user to code efficiently while performance is increased.
  • 3. MSDS7330-403-FINAL PAPER 3 Dataframe code df.where(df("year") > 2015) Logical expression GreaterThan(year#345, literal(2015)) Java Bytecode bool filter(Object baseObject) { int offset = baseOffset + bitSetWidthInBytes + 3*8L; int value = Platform.getInt(baseObject, offset); return value > 2015;} Display 6: The conversion of DataFrame code by the user into Java Bytecode for faster performance[1] The Databricks online platform also benefits from cloud computing, where one can increase memory by purchasing multiple clusters for running parallel jobs[9] . Cloud computing also mitigates load time and frees up your local machine for other purposes. 2. End-to-End Applications With multiple specialized engines intended for different applications, there are more systems to install, configure, connect, manage, and debug. Performance dwindles because it is hard to move big data across nodes and dynamically allocate resources for different computations. Writing temp data to file for another engine to run analysis slows down processes between systems[1] . With the same engine, data sets could be cached in RAM while implementing different computations and dynamically allocating resources[10] . Built-in visualizations, high level APIs, and the ability to run multiple languages in the same notebook allow for vertical integration[4] . Performance gains are cumulative due to the aggregation of marginal gains while creating diverse products using multiple components (Display 7). Display 7: Diversity in application solutions using multiple components in Spark[11] 3. Decisions Based on Data Streaming Streamed data may not be reliable due to node crashes, where artificial fluctuations in the data are caused by latency of the system, asequential data, where data is not in order, and data inconsistency, where multiple data storage locations may have different results (Display 8)[12] . Since Spark nodes have columns indexing, each node is tagged with time stamps and special referencing, where the fault tolerance fixes node crashes, special filtering fixes asequential data, and overall indexing fixes data consistency. Display 8: Artificial fluctuations in the streaming data are caused by latency of the system and node crashes[12] The benefits of using Spark is the speed needed to quickly use streaming data to make real time decisions and the sophistication needed to run best algorithms on the data. Being able to implement ad-hoc queries on top of streaming data, static data, and batch processes would allow more flexibility in the real-time decision making process than a pure streaming system (Display 9)[12] . Display 9: Pure streaming system verses continuous application streaming with queries and batch jobs[12] Continuous applications built on structured streaming allow for personalized results from web searches, automated stock trading, trends in news events, credit card fraud prevention, and video streaming, which benefit from implementing quick decisions per device based on various Spark applications from the cloud (Display 10)[13] .
  • 4. MSDS7330-403-FINAL PAPER 4 Display 10: Spark streaming solutions where each device can make decisions based on real-time data[13] VI. CONCLUSION Spark continues to dominate the analytical landscape with its efficient solutions to CPU usage, end-to-end applications, and data streaming[1] . The diversity of organizations, job fields, and applications that use Spark will continue to grow as more people find use in its various implementations and advanced analytics (Display 11)[11] . Display 11: Diversity in jobs that use Apache Spark[11] VII. DELIVERABLES A. Combined Video Getting Started with Spark, Spark's Approach to CPU Usage, End-to-End Applications, and Data Streaming Applications: B. In-Class Presentation Conducted on 8/2/17 at 8:30 PM CST to Dr. Sohail Rafiqi's MSDS7330 section 403 class C. Other Resources This paper, demonstration notebooks, and PowerPoint slides are located here: https://github.com/yaowser/learn-spark VIII. PREVIOUS WORK [1] M. Zaharia. What to Expect for Big Data and Apache Spark in 2017. Available: https://youtu.be/vtxwXSGl9V8 and https://www.slideshare.net/databricks/spark-summit-east- 2017-matei-zaharia-keynote-trends-for-big-data-and-apache- spark-in-2017 [2] Apache Spark Official Website. Available: http://spark.apache.org/ [3] Jump Start with Apache Spark 2.0 on Databricks. Available: https://www.slideshare.net/databricks/jump-start- with-apache-spark-20-on-databricks [4] R. Rivera. The future of Apache Spark. Available: https://www.linkedin.com/pulse/future-apache-spark-rodrigo- rivera [5] D. Lynn. Apache Spark Cluster Managers: YARN, Mesos, or Standalone? Available: http://www.agildata.com/apache- spark-cluster-managers-yarn-mesos-or-standalone/ [6] M. Zaharia. Three Ways Spark Went Against Conventional Wisdom. Available: https://youtu.be/y7KQcwK2w9I and https://www.slideshare.net/databricks/unified-big-data- processing-with-apache-spark-qcon-2014 [7] M. Olsen. MapReduce and Spark. Available: http://vision.cloudera.com/mapreduce-spark/ [8] Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala. Available: https://www.slideshare.net/databricks/2015-0317-scala-days [9] Databricks Cloud Community Edition. Available: http://community.cloud.databricks.com/ [10] Introduction to Apache Spark on Databricks. Available: https://community.cloud.databricks.com/?o=71876330457650 22#notebook/418623867444693/command/418623867444694 [11] A. Choudhary. 2016 Spark Survey. Available: https://www.slideshare.net/abhishekcreate/2016-spark-survey [12] M. Zaharia, et al. Structured Streaming In Apache Spark. Available: https://databricks.com/blog/2016/07/28/structured- streaming-in-apache-spark.html [13] M. Zaharia. How Companies are Using Spark, and Where the Edge in Big Data Will Be. Available: https://youtu.be/KspReT2JjeE
  • 5. MSDS7330-403-FINAL PAPER 5 IX. RESOURCES Databricks Spark Documentation. Available: https://docs.databricks.com/spark/latest/ M. Mayo. Top Spark Ecosystem Projects. Available: http://www.kdnuggets.com/2016/03/top-spark-ecosystem- projects.html B. Yavuz. Using 3rd Party Libraries in Databricks: Apache Spark Packages and Maven Libraries. Available: https://databricks.com/blog/2015/07/28/using-3rd-party- libraries-in-databricks-apache-spark-packages-and-maven- libraries.html Apache Spark YouTube Channel. Available: http://www.youtube.com/channel/UCRzsq7k4-kT- h3TDUBQ82-w B. Sullins. Apache Spark Essential Training. Available: https://www.lynda.com/Apache-Spark-tutorials/Apache- Spark-Essential-Training/550568-2.html L. Langit. Extending Hadoop Data Sceince Streaming Spark Storm Kafka. Available: https://www.lynda.com/Hadoop- tutorials/Extending-Hadoop-Data-Science-Streaming-Spark- Storm-Kafka/516574-2.html Spark Fundamentals I. Available: https://courses.cognitiveclass.ai/courses/course- v1:BigDataUniversity+BD0211EN+2016/info Spark Fundamentals II. Available: https://courses.cognitiveclass.ai/courses/course- v1:BigDataUniversity+BD0212EN+2016/info Apache Spark Makers Build. Available: https://courses.cognitiveclass.ai/courses/course- v1:BigDataUniversity+TMP0105EN+2016/info Exploring Spark's GraphX. Available: https://courses.cognitiveclass.ai/courses/course- v1:BigDataUniversity+BD0223EN+2016/info Analyzing Big Data in R using Apache Spark. Available: https://courses.cognitiveclass.ai/courses/course- v1:BigDataUniversity+RP0105EN+2016/info Definitive Guide Apache Spark Excerpts. Available: http://go.databricks.com/definitive-guide-apache-spark Definitive Guide Apache Spark Raw Chapters. Available: http://shop.oreilly.com/product/0636920034957.do Spark The Definitive Guide Community Edition. Available: https://github.com/databricks/Spark-The-Definitive-Guide