Using PySpark to Process Boat Loads of Data

Using PySpark to Process
Boat Loads of Data
Robert Dempsey, CEO
Atlantic Dominion Solutions

We’ve mastered three jobs so you can
focus on one - growing your business.

The Three Jobs
At Atlantic Dominion Solutions we perform three functions for
our customers:
Consulting: we assess and advise in the areas of technology,
team and process to determine how machine learning can have
the biggest impact on your business.
Implementation: after a strategy session to determine the work
you need we get to work using our proven methodology and
begin delivering smarter applications.
Training: continuous improvement requires continuous learning.
We provide both on-premises and online training.

Co-authoring the book Building
Machine Learning Pipelines.
Written for software developers and
data scientists, Building Machine
Learning Pipelines teaches the skills
required to create and use the
infrastructure needed to run
modern intelligent systems.
machinelearningpipelines.com
Writing the Book

Robert Dempsey, CEO
Software Engineer
Books and online courses
Lotus Guides, District Data Labs
Atlantic Dominion Solutions, LLC
Professional
Author
Instructor
Owner

MTAC Framework™
Mindset
Toolbox
Application
Communication

1. When acquiring knowledge start by going wide instead of
deep.
2. Always focus on what's important to people rather than
just the technology.
3. Be able to clearly communicate what you know with
others.
Core Principles

MTAC Framework™ Applied
Mindset: use-case centric example
Toolbox: Python, PySpark, Docker
Application: Code & Analysis
Communication: Q&A

Keep It Simple
Image: Jesse van Dijk : http://jessevandijkart.com/the-labyrinth-of-tsan-kamal/

Solve the Problem
Image: Paulo : https://paullus23.deviantart.com/art/Bliss-soccer-ﬁeld-326563199

Got Clean Air?
• Clean air is important.
• Toxic pollutants are known or suspected of causing cancer,
reproductive effects, birth defects, and adverse
environmental effects.

Questions to Answer
1. Which state has the highest level of pollutants?
2. Which county has the highest level of pollutants?
3. What are the top 5 pollutants by unit of measure?
4. What are the trends of pollutants by state over time?

The Core of Spark
• Computational engine that schedules, distributes and
monitors computational tasks running on a cluster

Higher Level Tools
• Spark SQL: SQL and structured data
• MLlib: machine learning
• GraphX: graph processing
• Spark Streaming: process streaming data

Storage
• Local file system
• Amazon S3
• Cassandra
• Hive
• HBase
• File formats
• Text files
• Sequence files
• Avro
• Parquet
• Hadoop Input Format

Hadoop?
• Not necessary, but…
• If you have multiple nodes you need a resource manager
like YARN or Mesos
• You'll need access to distributed storage like HDFS,
Amazon S3 or Cassandra

What Is PySpark?
• An API that exposes the Spark programming model to
Python
• Build on top of Spark's Java API
• Data is processed with Python and cached/shuffled in the
JVM
• Driver programs

Driver Programs
• Launch parallel operations on a cluster
• Contain application functions
• Deﬁne distributed datasets
• Access Spark through a SparkContext
• Uses Py4J to launch a JVM and create a
JavaSparkContext

When to Use It
• When you need to…
• Process boat loads of data (TB)
• Perform operations that require all the data to be in
memory (machine learning)
• Efficiently process streaming data
• Create an overly complicated use case to present at a
meetup

Docker
• Software container platform
• Containers are application only (no OS)
• Deployed anywhere with same CPU architecture (x86-64,
ARM)
• Available for *nix, Mac, Windows

Architecture #1
Agent
File
System
Apache
Spark
File
System
Agent ES
1 2 3
Data Flow

Architecture #2
Data Flow
Agent
1 2 3
Agent
Agent
Athena
S3
S3
Apache
Spark

Architecture #3
Data Flow
Agent
1 2 3
Agent
Agent
ES
S3
HDFS
Apache
Kafka
Apache
Spark
HBase

What We’ll Build (Simple)
Agent
File
System
Apache
Spark
File
System
1 2 3
Data Flow

Python
• Analysis
• Visualization
• Code in our Spark jobs

PySpark
• Process all the data!
• Perform aggregations

Docker
• Run Spark in a Docker container.
• So you don’t have to install anything.

README
• https://github.com/rdempsey/pyspark-for-data-processing
• Create a virtual environment (Anaconda)
• Install dependencies
• Run docker-compose to create the Spark containers
• Run a script (or all of them!) per the README

Dive In
• Data explorer notebook
• Q1 - Most polluted state
• Q2 - Most polluted county
• Q3 - Top pollutants by unit of measure
• Q4 - Pollutants over time

Intro to Data Science for
Software Engineers
Goes live October 23, 2017
Normally: $97
Pre-Launch: $47
http://lotusguides.com

Where to Find Me
Website
Lotus Guides
LinkedIn
Twitter
Github
robertwdempsey.com
lotusguides.com
robertwdempsey
rdempsey
rdempsey

Using PySpark to Process Boat Loads of Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Using PySpark to Process Boat Loads of Data

Similar to Using PySpark to Process Boat Loads of Data (20)

More from Robert Dempsey

More from Robert Dempsey (20)

Recently uploaded

Recently uploaded (20)

Using PySpark to Process Boat Loads of Data