Learn how to use PySpark for processing massive amounts of data. Combined with the GitHub repo - https://github.com/rdempsey/pyspark-for-data-processing - this presentation will help you gain familiarity with processing data using Python and Spark.
If you're thinking about machine learning and not sure if it can help improve your business, but want to find out, set up a free 20-minute consultation with us: https://calendly.com/robertwdempsey/free-consultation
Using PySpark to Process
Boat Loads of Data
Robert Dempsey, CEO
Atlantic Dominion Solutions
We’ve mastered three jobs so you can
focus on one - growing your business.
The Three Jobs
At Atlantic Dominion Solutions we perform three functions for
Consulting: we assess and advise in the areas of technology,
team and process to determine how machine learning can have
the biggest impact on your business.
Implementation: after a strategy session to determine the work
you need we get to work using our proven methodology and
begin delivering smarter applications.
Training: continuous improvement requires continuous learning.
We provide both on-premises and online training.
Co-authoring the book Building
Machine Learning Pipelines.
Written for software developers and
data scientists, Building Machine
Learning Pipelines teaches the skills
required to create and use the
infrastructure needed to run
modern intelligent systems.
Writing the Book
Robert Dempsey, CEO
Books and online courses
Lotus Guides, District Data Labs
Atlantic Dominion Solutions, LLC
1. When acquiring knowledge start by going wide instead of
2. Always focus on what's important to people rather than
just the technology.
3. Be able to clearly communicate what you know with
Got Clean Air?
• Clean air is important.
• Toxic pollutants are known or suspected of causing cancer,
reproductive effects, birth defects, and adverse
Questions to Answer
1. Which state has the highest level of pollutants?
2. Which county has the highest level of pollutants?
3. What are the top 5 pollutants by unit of measure?
4. What are the trends of pollutants by state over time?
What Is PySpark?
• An API that exposes the Spark programming model to
• Build on top of Spark's Java API
• Data is processed with Python and cached/shuffled in the
• Driver programs
• Launch parallel operations on a cluster
• Contain application functions
• Deﬁne distributed datasets
• Access Spark through a SparkContext
• Uses Py4J to launch a JVM and create a
When to Use It
• When you need to…
• Process boat loads of data (TB)
• Perform operations that require all the data to be in
memory (machine learning)
• Efficiently process streaming data
• Create an overly complicated use case to present at a
• Create a virtual environment (Anaconda)
• Install dependencies
• Run docker-compose to create the Spark containers
• Run a script (or all of them!) per the README
• Data explorer notebook
• Q1 - Most polluted state
• Q2 - Most polluted county
• Q3 - Top pollutants by unit of measure
• Q4 - Pollutants over time