SlideShare a Scribd company logo
1 of 66
Download to read offline
A One Hour Introduction to Analytics with PySpark
Introduction to PySpark
http://www.slideshare.net/rjurney/introduction-to-pyspark
or
http://bit.ly/intro_to_pyspark
Agile Data Science 2.0
Russell Jurney
2
Data Engineer
Data Scientist
Visualization Software Engineer
85%
85%
85%
Writer
85%
Teacher
50%
Russell Jurney is a veteran data
scientist and thought leader. He
coined the term Agile Data Science in
the book of that name from O’Reilly
in 2012, which outlines the first agile
development methodology for data
science. Russell has constructed
numerous full-stack analytics
products over the past ten years and
now works with clients helping them
extract value from their data assets.
Russell Jurney
Skill
Principal Consultant at Data Syndrome
Russell Jurney
Data Syndrome, LLC
Email : russell.jurney@gmail.com
Web : datasyndrome.com
Principal Consultant
Building Full-Stack Data Analytics Applications with Spark
http://bit.ly/agile_data_science
Agile Data Science 2.0
Agile Data Science 2.0 4
Realtime Predictive
Analytics
Rapidly learn to build entire predictive systems driven by
Kafka, PySpark, Speak Streaming, Spark MLlib and with a web
front-end using Python/Flask and JQuery.
Available for purchase at http://datasyndrome.com/video
Lorem Ipsum dolor siamet suame this placeholder for text can simply
random text. It has roots in a piece of classical. variazioni deiwords which
whichhtly. ven on your zuniga merida della is not denis.
Product Consulting
We build analytics products and systems
consisting of big data viz, predictions,
recommendations, reports and search.
Corporate Training
We offer training courses for data
scientists and engineers and data
science teams,
Video Training
We offer video training courses that rapidly
acclimate you with a technology and
technique.
Agile Data Science 2.0 6
What is Spark? What makes it go?
Concepts
Data Syndrome: Agile Data Science 2.0
Hadoop / HDFS
HDFS splits large data among many machines
7
Data Syndrome: Agile Data Science 2.0
Hadoop / MapReduce
In the beginning there was MapReduce
8
Data Syndrome: Agile Data Science 2.0
Spark / RDD
Spark RDDs are iterable MapReduce relations
9
Data Syndrome: Agile Data Science 2.0
Spark / DataFrame
Fast SQLish RDD thingies
10
Data Syndrome: Agile Data Science 2.0
Spark Streaming
Spark on realtime streams in mini-batches
11
Data Syndrome: Agile Data Science 2.0
Spark Ecosystem
Lots of cool stuff working together…
12
/
Agile Data Science 2.0 13
Setting up our class environment
Setup
Data Syndrome: Agile Data Science 2.0 14
Python 3 > 2.7
While the break in compatibility between Python 2.X
and 3.X was unfortunate and unnecessary , Python 3
has increasingly become the platform of choice for
analytics work. With a few alterations, all code in this
course will execute in a Python 2.7 environment.
Data Syndrome: Agile Data Science 2.0 15
Virtualbox
Virtualbox is a Free and Open Source Software (FOSS)
virtualization product for AMD64/Intel64 processors. It
supports many operating systems, and is under active
development.
https://www.virtualbox.org/wiki/Downloads
Data Syndrome: Agile Data Science 2.0 16
Vagrant
Vagrant sits on top of Virtualbox and provides easy to
use, reproducible development environments.
https://www.vagrantup.com/downloads.html
Data Syndrome: Agile Data Science 2.0
Vagrant Setup
17
Initializing our Vagrant Environment
# Get the project
git clone https://github.com/rjurney/Agile_Data_Code_2/
# Setup and connect to our virtual machine
vagrant up; vagrant ssh
# Now, from within Vagrant
cd Agile_Data_Code_2
intro_download.sh
# See Appendix A and install.sh for manual install
Data Syndrome: Agile Data Science 2.0 18
Amazon EC2
Alternatively, Amazon Web Services provide a simple
way to launch a prepared image for use in this exercise.
Data Syndrome: Agile Data Science 2.0
EC2 Setup for Ubuntu Linux
19
Initializing our EC2 Environment
# See ec2.sh, which uses aws/ec2_bootstrap.sh
# To use add: —user-data file://aws/ec2_bootstrap.sh
# Get the project
git clone git@github.com:rjurney/Agile_Data_Code_2.git
# Setup AWS CLI tools
pip install awscli
# Edit and run r3.xlarge instance with your key
./ec2.sh
# ssh to the machine
Data Syndrome: Agile Data Science 2.0
EC2 Setup for Ubuntu Linux
Initializing our EC2 Environment
20
# Contents of ec2.sh
# Launch our instance, which ec2_bootstrap.sh will initialize

aws ec2 run-instances 

--image-id ami-4ae1fb5d 

--key-name agile_data_science 

--user-data file://aws/ec2_bootstrap.sh 

--instance-type r3.xlarge 

--ebs-optimized 

--placement "AvailabilityZone=us-east-1d" 

--block-device-mappings '{"DeviceName":"/dev/sda1","Ebs":
{"DeleteOnTermination":false,"VolumeSize":1024}}' 

--count 1

Data Syndrome: Agile Data Science 2.0
EC2 Setup for Ubuntu Linux
Initializing our EC2 Environment
21
# Download the data
cd Agile_Data_Code_2
./intro_download.sh
Data Syndrome: Agile Data Science 2.0
EC2 Setup for Ubuntu Linux
Initializing our EC2 Environment
22
# Download the data
cd Agile_Data_Code_2
./intro_download.sh
Data Syndrome: Agile Data Science 2.0
Documentation Setup
Opening the right web pages to answer your questions
23
http://spark.apache.org/docs/latest/api/python/pyspark.html
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html
Agile Data Science 2.0 24
Learning the basics of PySpark
Basic PySpark
Data Syndrome: Agile Data Science 2.0
Hello, World!
How to load data and perform an operation on it in Spark
25
# See ch02/spark.py
# Load the text file using the SparkContext

csv_lines = sc.textFile("data/example.csv")



# Map the data to split the lines into a list

data = csv_lines.map(lambda line: line.split(","))



# Collect the dataset into local RAM

data.collect()
Data Syndrome: Agile Data Science 2.0
Creating Objects from CSV using a function
How to create objects from CSV using a function instead of a lambda
26
# See ch02/groupby.py
csv_lines = sc.textFile("data/example.csv")



# Turn the CSV lines into objects

def csv_to_record(line):

parts = line.split(",")

record = {

"name": parts[0],

"company": parts[1],

"title": parts[2]

}

return record



# Apply the function to every record

records = csv_lines.map(csv_to_record)



# Inspect the first item in the dataset

records.first()
Data Syndrome: Agile Data Science 2.0
Using a GroupBy to Count Jobs
Count things using the groupBy API
27
# Group the records by the name of the person

grouped_records = records.groupBy(lambda x: x["name"])



# Show the first group

grouped_records.first()



# Count the groups

job_counts = grouped_records.map(

lambda x: {

"name": x[0],

"job_count": len(x[1])

}

)



job_counts.first()



job_counts.collect()
Data Syndrome: Agile Data Science 2.0
Map vs FlatMap
Understanding the difference between these two operators
28
# See ch02/flatmap.py
csv_lines = sc.textFile("data/example.csv")



# Compute a relation of words by line

words_by_line = csv_lines

.map(lambda line: line.split(","))



words_by_line.collect()



# Compute a relation of words

flattened_words = csv_lines

.map(lambda line: line.split(","))

.flatMap(lambda x: x)



flattened_words.collect()
Data Syndrome: Agile Data Science 2.0
Map vs FlatMap
Understanding the difference between these two operators
29
words_by_line.collect()
[['Russell Jurney', 'Relato', 'CEO'],
['Florian Liebert', 'Mesosphere', 'CEO'],
['Don Brown', 'Rocana', 'CIO'],
['Steve Jobs', 'Apple', 'CEO'],
['Donald Trump', 'The Trump Organization', 'CEO'],
['Russell Jurney', 'Data Syndrome', 'Principal Consultant']]
flattened_words.collect()
['Russell Jurney',
'Relato',
'CEO',
'Florian Liebert',
'Mesosphere',
'CEO',
'Don Brown',
'Rocana',
'CIO',
'Steve Jobs',
'Apple',
'CEO',
'Donald Trump',
'The Trump Organization',
'CEO',
'Russell Jurney',
'Data Syndrome',
'Principal Consultant']
Data Syndrome: Agile Data Science 2.0
Using DataFrames and Spark SQL to Count Jobs
Converting an RDD to a DataFrame to use Spark SQL
30
# See ch02/sql.py
csv_lines = sc.textFile("data/example.csv")



from pyspark.sql import Row



# Convert the CSV into a pyspark.sql.Row

def csv_to_row(line):

parts = line.split(",")

row = Row(

name=parts[0],

company=parts[1],

title=parts[2]

)

return row



# Apply the function to get rows in an RDD

rows = csv_lines.map(csv_to_row)
Data Syndrome: Agile Data Science 2.0
Using DataFrames and Spark SQL to Count Jobs
Converting an RDD to a DataFrame to use Spark SQL
31
# Convert to a pyspark.sql.DataFrame

rows_df = rows.toDF()



# Register the DataFrame for Spark SQL

rows_df.registerTempTable("executives")



# Generate a new DataFrame with SQL using the SparkSession

job_counts = spark.sql("""
SELECT
name,
COUNT(*) AS total
FROM executives
GROUP BY name
""")

job_counts.show()



# Go back to an RDD

job_counts.rdd.collect()
Agile Data Science 2.0 32
Working with a more complex dataset
Exploratory Data Analysis
with Airline Data
Data Syndrome: Agile Data Science 2.0
Loading a Parquet Columnar File
Using the Apache Parquet format to load columnar data
33
# See ch02/load_on_time_performance.py
# Load the parquet file containing flight delay records

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')



# Register the data for Spark SQL

on_time_dataframe.registerTempTable("on_time_performance")



# Check out the columns

on_time_dataframe.columns



# Check out some data

on_time_dataframe

.select("FlightDate", "TailNum", "Origin", "Dest", "Carrier", "DepDelay", "ArrDelay")

.show()
Data Syndrome: Agile Data Science 2.0
Sampling a DataFrame
Sampling a DataFrame to get a better view of its data
34
# Trim the fields and keep the result

trimmed_on_time = on_time_dataframe

.select(

"FlightDate",

"TailNum",

"Origin",

"Dest",

"Carrier",

"DepDelay",

"ArrDelay"

)
# Sample 0.01% of the data and show

trimmed_on_time.sample(False, 0.0001).show()
Data Syndrome: Agile Data Science 2.0
Calculating a Histogram
Computing the distribution of a column in a dataset
35
# See ch02/histogram.py

# Load the parquet file containing flight delay records

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')



# Register the data for Spark SQL

on_time_dataframe.registerTempTable("on_time_performance")



# Compute a histogram of departure delays

on_time_dataframe

.select("DepDelay")

.rdd

.flatMap(lambda x: x)

.histogram(10)
Data Syndrome: Agile Data Science 2.0
Displaying a Histogram
Using pyplot to display a histogram
36
import numpy as np

import matplotlib.mlab as mlab

import matplotlib.pyplot as plt



# Function to plot a histogram using pyplot

def create_hist(rdd_histogram_data):

"""Given an RDD.histogram, plot a pyplot histogram"""

heights = np.array(rdd_histogram_data[1])

full_bins = rdd_histogram_data[0]

mid_point_bins = full_bins[:-1]

widths = [abs(i - j) for i, j in zip(full_bins[:-1], full_bins[1:])]

bar = plt.bar(mid_point_bins, heights, width=widths, color='b')

return bar



# Compute a histogram of departure delays

departure_delay_histogram = on_time_dataframe

.select("DepDelay")

.rdd

.flatMap(lambda x: x)

.histogram(10, [-60,-30,-15,-10,-5,0,5,10,15,30,60,90,120,180])



create_hist(departure_delay_histogram)
Data Syndrome: Agile Data Science 2.0
Displaying a Histogram
Using pyplot to display a histogram
37
Data Syndrome: Agile Data Science 2.0
Counting Airplanes
How many airplanes are in the US fleet in total?
38
# See ch05/assess_airplanes.py
# Load the parquet file

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')

on_time_dataframe.registerTempTable("on_time_performance")



# Dump the unneeded fields

tail_numbers = on_time_dataframe.rdd.map(lambda x: x.TailNum)

tail_numbers = tail_numbers.filter(lambda x: x != '')



# distinct() gets us unique tail numbers

unique_tail_numbers = tail_numbers.distinct()



# now we need a count() of unique tail numbers

airplane_count = unique_tail_numbers.count()

print("Total airplanes: {}".format(airplane_count))
Data Syndrome: Agile Data Science 2.0
Counting Total Flights by Month
Preparing data for a chart
39
# See ch05/total_flights.py
# Load the parquet file

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')



# Use SQL to look at the total flights by month across 2015

on_time_dataframe.registerTempTable("on_time_dataframe")

total_flights_by_month = spark.sql(

"""SELECT Month, Year, COUNT(*) AS total_flights

FROM on_time_dataframe

GROUP BY Year, Month

ORDER BY Year, Month"""

)



# This map/asDict trick makes the rows print a little prettier. It is optional.

flights_chart_data = total_flights_by_month.rdd.map(lambda row: row.asDict())

flights_chart_data.collect()
Data Syndrome: Agile Data Science 2.0
Preparing Complex Records for Storage
Getting data ready for storage in a document or key/value store
40
# See ch05/extract_airplanes.py
# Load the parquet file

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')

on_time_dataframe.registerTempTable("on_time_performance")



# Filter down to the fields we need to identify and link to a flight

flights = on_time_dataframe.rdd.map(lambda x: 

(x.Carrier, x.FlightDate, x.FlightNum, x.Origin, x.Dest, x.TailNum)

)



# Group flights by tail number, sorted by date, then flight number, then origin/dest

flights_per_airplane = flights

.map(lambda nameTuple: (nameTuple[5], [nameTuple[0:5]]))

.reduceByKey(lambda a, b: a + b)

.map(lambda tuple:

{

'TailNum': tuple[0], 

'Flights': sorted(tuple[1], key=lambda x: (x[1], x[2], x[3], x[4]))

}

)

flights_per_airplane.first()
Data Syndrome: Agile Data Science 2.0
Counting Flight Delays
Analyzing and understanding why flights are late
41
# See ch07/explore_delays.py
# Load the on-time parquet file

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')

on_time_dataframe.registerTempTable("on_time_performance")



total_flights = on_time_dataframe.count()



# Flights that were late leaving...

late_departures = on_time_dataframe.filter(on_time_dataframe.DepDelayMinutes > 0)

total_late_departures = late_departures.count()



# Flights that were late arriving...

late_arrivals = on_time_dataframe.filter(on_time_dataframe.ArrDelayMinutes > 0)

total_late_arrivals = late_arrivals.count()
# Get the percentage of flights that are late, rounded to 1 decimal place

pct_late = round((total_late_arrivals / (total_flights * 1.0)) * 100, 1)
Data Syndrome: Agile Data Science 2.0
Hero Flights
How many flights made up for time in the air? Those that departed late and arrived on time?
42
# See ch07/explore_delays.py
# Flights that left late but made up time to arrive on time...

on_time_heros = on_time_dataframe.filter(

(on_time_dataframe.DepDelayMinutes > 0)

&

(on_time_dataframe.ArrDelayMinutes <= 0)

)

total_on_time_heros = on_time_heros.count()
Data Syndrome: Agile Data Science 2.0
Presenting Results
Displaying the answers in plaintext we’ve just calculated
43
# See ch07/explore_delays.py
print("Total flights: {:,}".format(total_flights))

print("Late departures: {:,}".format(total_late_departures))

print("Late arrivals: {:,}".format(total_late_arrivals))

print("Recoveries: {:,}".format(total_on_time_heros))

print("Percentage Late: {}%".format(pct_late))
Data Syndrome: Agile Data Science 2.0
44
# See ch07/explore_delays.py
# Get the average minutes late departing and arriving

spark.sql("""

SELECT

ROUND(AVG(DepDelay),1) AS AvgDepDelay,

ROUND(AVG(ArrDelay),1) AS AvgArrDelay

FROM on_time_performance

"""

).show()
Average Lateness Departing and Arriving
Drilling down into flights and how late they are…
Data Syndrome: Agile Data Science 2.0
Sampling Late Flights
Getting to know our data by sampling records of interest
45
# Why are flights late? Lets look at some delayed flights and the delay causes

late_flights = spark.sql("""

SELECT

ArrDelayMinutes,

WeatherDelay,

CarrierDelay,

NASDelay,

SecurityDelay,

LateAircraftDelay

FROM

on_time_performance

WHERE

WeatherDelay IS NOT NULL

OR

CarrierDelay IS NOT NULL

OR

NASDelay IS NOT NULL

OR

SecurityDelay IS NOT NULL

OR

LateAircraftDelay IS NOT NULL

ORDER BY

FlightDate

""")

late_flights.sample(False, 0.01).show()
Data Syndrome: Agile Data Science 2.0
Why are Flights Late?
Analyzing and understanding why flights are late
46
# Calculate the percentage contribution to delay for each source

total_delays = spark.sql("""

SELECT

ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_weather_delay,

ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_carrier_delay,

ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay,

ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_security_delay,

ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_late_aircraft_delay

FROM on_time_performance

""")

total_delays.show()
Data Syndrome: Agile Data Science 2.0
How Often are Weather Delayed Flights Late?
Analyzing and understanding why flights are late
47
# Eyeball the first to define our buckets

weather_delay_histogram = on_time_dataframe

.select("WeatherDelay")

.rdd

.flatMap(lambda x: x)

.histogram([1, 5, 10, 15, 30, 60, 120, 240, 480, 720, 24*60.0])

print(weather_delay_histogram)
create_hist(weather_delay_histogram)
Data Syndrome: Agile Data Science 2.0
How Often are Weather Delayed Flights Late?
Analyzing and understanding why flights are late
48
Data Syndrome: Agile Data Science 2.0
Preparing Histogram Data for d3.js
Analyzing and understanding why flights are late
49
# Transform the data into something easily consumed by d3

def histogram_to_publishable(histogram):

record = {'key': 1, 'data': []}

for label, value in zip(histogram[0], histogram[1]):

record['data'].append(

{

'label': label,

'value': value

}

)

return record
# Recompute the weather histogram with a filter for on-time flights

weather_delay_histogram = on_time_dataframe

.filter(

(on_time_dataframe.WeatherDelay != None)

&

(on_time_dataframe.WeatherDelay > 0)

)

.select("WeatherDelay")

.rdd

.flatMap(lambda x: x)

.histogram([0, 15, 30, 60, 120, 240, 480, 720, 24*60.0])

print(weather_delay_histogram)



record = histogram_to_publishable(weather_delay_histogram)
Agile Data Science 2.0 50
Building a classifier model
Predictive Analytics
Machine Learning
Data Syndrome: Agile Data Science 2.0
Download Prepared Training Data
Saving time by using a prepared dataset
51
# Be in the root directory of the project
cd Agile_Data_Code_2
# Run the download script
ch08/download_data.sh
Data Syndrome: Agile Data Science 2.0
String Vectorization
From properties of items to vector format
52
Data Syndrome: Agile Data Science 2.0 53
scikit-learn was 166. Spark MLlib is very powerful!
ch08/train_spark_mllib_model.py
190 Line Model
# !/usr/bin/env python



import sys, os, re



# Pass date and base path to main() from airflow

def main(base_path):



# Default to "."

try: base_path

except NameError: base_path = "."

if not base_path:

base_path = "."



APP_NAME = "train_spark_mllib_model.py"



# If there is no SparkSession, create the environment

try:

sc and spark

except NameError as e:

import findspark

findspark.init()

import pyspark

import pyspark.sql



sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()



#

# {

# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",

# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,

# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"

# }

#

from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType

from pyspark.sql.types import StructType, StructField

from pyspark.sql.functions import udf



schema = StructType([

StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0

StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"

StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"

StructField("Carrier", StringType(), True), # "Carrier":"WN"

StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31

StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4

StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365

StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0

StructField("Dest", StringType(), True), # "Dest":"SAN"

StructField("Distance", DoubleType(), True), # "Distance":368.0

StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"

StructField("FlightNum", StringType(), True), # "FlightNum":"6109"

StructField("Origin", StringType(), True), # "Origin":"TUS"

])



input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(

base_path

)

features = spark.read.json(input_path, schema=schema)

features.first()



#

# Check for nulls in features before using Spark ML

#

null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]

cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)

print(list(cols_with_nulls))



#

# Add a Route variable to replace FlightNum

#

from pyspark.sql.functions import lit, concat

features_with_route = features.withColumn(

'Route',

concat(

features.Origin,

lit('-'),

features.Dest

)

)

features_with_route.show(6)



#

# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)

#

from pyspark.ml.feature import Bucketizer



# Setup the Bucketizer

splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]

arrival_bucketizer = Bucketizer(

splits=splits,

inputCol="ArrDelay",

outputCol="ArrDelayBucket"

)



# Save the bucketizer

arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)

arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)



# Apply the bucketizer

ml_bucketized_features = arrival_bucketizer.transform(features_with_route)

ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()



#

# Extract features tools in with pyspark.ml.feature

#

from pyspark.ml.feature import StringIndexer, VectorAssembler



# Turn category fields into indexes

for column in ["Carrier", "Origin", "Dest", "Route"]:

string_indexer = StringIndexer(

inputCol=column,

outputCol=column + "_index"

)



string_indexer_model = string_indexer.fit(ml_bucketized_features)

ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)



# Drop the original column

ml_bucketized_features = ml_bucketized_features.drop(column)



# Save the pipeline model

string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(

base_path,

column

)

string_indexer_model.write().overwrite().save(string_indexer_output_path)



# Combine continuous, numeric fields with indexes of nominal ones

# ...into one feature vector

numeric_columns = [

"DepDelay", "Distance",

"DayOfMonth", "DayOfWeek",

"DayOfYear"]

index_columns = ["Carrier_index", "Origin_index",

"Dest_index", "Route_index"]

vector_assembler = VectorAssembler(

inputCols=numeric_columns + index_columns,

outputCol="Features_vec"

)

final_vectorized_features = vector_assembler.transform(ml_bucketized_features)



# Save the numeric vector assembler

vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)

vector_assembler.write().overwrite().save(vector_assembler_path)



# Drop the index columns

for column in index_columns:

final_vectorized_features = final_vectorized_features.drop(column)



# Inspect the finalized features

final_vectorized_features.show()



# Instantiate and fit random forest classifier on all the data

from pyspark.ml.classification import RandomForestClassifier

rfc = RandomForestClassifier(

featuresCol="Features_vec",

labelCol="ArrDelayBucket",

predictionCol="Prediction",

maxBins=4657,

maxMemoryInMB=1024

)

model = rfc.fit(final_vectorized_features)



# Save the new model over the old one

model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(

base_path

)

model.write().overwrite().save(model_output_path)



# Evaluate model using test data

predictions = model.transform(final_vectorized_features)



from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(

predictionCol="Prediction",

labelCol="ArrDelayBucket",

metricName="accuracy"

)

accuracy = evaluator.evaluate(predictions)

print("Accuracy = {}".format(accuracy))



# Check the distribution of predictions

predictions.groupBy("Prediction").count().show()



# Check a sample

predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)



if __name__ == "__main__":

main(sys.argv[1])

Data Syndrome: Agile Data Science 2.0
Loading Our Training Data
Loading our data as a DataFrame to use the Spark ML APIs
54
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType

from pyspark.sql.types import StructType, StructField

from pyspark.sql.functions import udf



schema = StructType([

StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0

StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"

StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"

StructField("Carrier", StringType(), True), # "Carrier":"WN"

StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31

StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4

StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365

StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0

StructField("Dest", StringType(), True), # "Dest":"SAN"

StructField("Distance", DoubleType(), True), # "Distance":368.0

StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"

StructField("FlightNum", StringType(), True), # "FlightNum":"6109"

StructField("Origin", StringType(), True), # "Origin":"TUS"

])



features = spark.read.json(

"data/simple_flight_delay_features.jsonl.bz2",

schema=schema

)

features.first()
Data Syndrome: Agile Data Science 2.0
Checking the Data for Nulls
Nulls will cause problems hereafter, so detect and address them first
55
#

# Check for nulls in features before using Spark ML

#

null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]

cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)

print(list(cols_with_nulls))
Data Syndrome: Agile Data Science 2.0
Adding a Feature - The Route
Route is defined as origin airport code + “-“ + destination airport code
56
#

# Add a Route variable to replace FlightNum

#

from pyspark.sql.functions import lit, concat



features_with_route = features.withColumn(

'Route',

concat(

features.Origin,

lit('-'),

features.Dest

)

)

features_with_route.select("Origin", "Dest", "Route").show(5)
Data Syndrome: Agile Data Science 2.0
Bucketizing ArrDelay into ArrDelayBucket
We can’t classify a continuous variable, so we must bucketize it to make it nominal/categorical
57
#

# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay

#

from pyspark.ml.feature import Bucketizer



splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]

bucketizer = Bucketizer(

splits=splits,

inputCol="ArrDelay",

outputCol="ArrDelayBucket"

)

ml_bucketized_features = bucketizer.transform(features_with_route)



# Check the buckets out

ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
Data Syndrome: Agile Data Science 2.0
Indexing String Columns into Numeric Columns
Nominal/categorical/string columns need to be made numeric before we can vectorize them
58
#

# Extract features tools in with pyspark.ml.feature

#

from pyspark.ml.feature import StringIndexer, VectorAssembler



# Turn category fields into categoric feature vectors, then drop intermediate fields

for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",

"Origin", "Dest", "Route"]:

string_indexer = StringIndexer(

inputCol=column,

outputCol=column + "_index"

)

ml_bucketized_features = string_indexer.fit(ml_bucketized_features)

.transform(ml_bucketized_features)



# Check out the indexes

ml_bucketized_features.show(6)
Data Syndrome: Agile Data Science 2.0
Combining Numeric and Indexed Fields into One Vector
Our classifier needs a single field, so we combine all our numeric fields into one feature vector
59
# Handle continuous, numeric fields by combining them into one feature vector

numeric_columns = ["DepDelay", "Distance"]

index_columns = ["Carrier_index", "DayOfMonth_index",

"DayOfWeek_index", "DayOfYear_index", "Origin_index",

"Origin_index", "Dest_index", "Route_index"]

vector_assembler = VectorAssembler(

inputCols=numeric_columns + index_columns,

outputCol="Features_vec"

)

final_vectorized_features = vector_assembler.transform(ml_bucketized_features)



# Drop the index columns

for column in index_columns:

final_vectorized_features = final_vectorized_features.drop(column)



# Check out the features

final_vectorized_features.show()
Data Syndrome: Agile Data Science 2.0
Splitting our Data in a Test/Train Split
We need to split our data to evaluate the performance of our classifier
60
#

# Cross validate, train and evaluate classifier

#



# Test/train split

training_data, test_data = final_vectorized_features.randomSplit([0.7, 0.3])
Data Syndrome: Agile Data Science 2.0
Training Our Model
This is the magic in machine learning, and it is only a couple of lines of code
61
# Instantiate and fit random forest classifier

from pyspark.ml.classification import RandomForestClassifier

rfc = RandomForestClassifier(

featuresCol="Features_vec",

labelCol="ArrDelayBucket",

maxBins=4657,

maxMemoryInMB=1024

)

model = rfc.fit(training_data)
Data Syndrome: Agile Data Science 2.0
Evaluating Our Model
Using the test/train split to evaluate our model for accuracy
62
# Evaluate model using test data

predictions = model.transform(test_data)



from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="ArrDelayBucket", metricName="accuracy")

accuracy = evaluator.evaluate(predictions)

print("Accuracy = {}".format(accuracy))
Data Syndrome: Agile Data Science 2.0
Sampling Our Predictions
Making sure they pass the sniff check
63
# Check a sample

predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
Data Syndrome: Agile Data Science 2.0
Experiment Setup
Necessary to improve model
64
Data Syndrome: Agile Data Science 2.0 65
155 additional lines to setup an experiment
and add 3 new features to improvement the model
ch09/improve_spark_mllib_model.py
345 L.O.C.
# !/usr/bin/env python



import sys, os, re

import json

import datetime, iso8601

from tabulate import tabulate



# Pass date and base path to main() from airflow

def main(base_path):

APP_NAME = "train_spark_mllib_model.py"



# If there is no SparkSession, create the environment

try:

sc and spark

except NameError as e:

import findspark

findspark.init()

import pyspark

import pyspark.sql



sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()



#

# {

# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",

# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,

# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"

# }

#

from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType

from pyspark.sql.types import StructType, StructField

from pyspark.sql.functions import udf



schema = StructType([

StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0

StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"

StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"

StructField("Carrier", StringType(), True), # "Carrier":"WN"

StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31

StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4

StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365

StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0

StructField("Dest", StringType(), True), # "Dest":"SAN"

StructField("Distance", DoubleType(), True), # "Distance":368.0

StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"

StructField("FlightNum", StringType(), True), # "FlightNum":"6109"

StructField("Origin", StringType(), True), # "Origin":"TUS"

])



input_path = "{}/data/simple_flight_delay_features.json".format(

base_path

)

features = spark.read.json(input_path, schema=schema)

features.first()



#

# Add a Route variable to replace FlightNum

#

from pyspark.sql.functions import lit, concat

features_with_route = features.withColumn(

'Route',

concat(

features.Origin,

lit('-'),

features.Dest

)

)

features_with_route.show(6)



#

# Add the hour of day of scheduled arrival/departure

#

from pyspark.sql.functions import hour

features_with_hour = features_with_route.withColumn(

"CRSDepHourOfDay",

hour(features.CRSDepTime)

)

features_with_hour = features_with_hour.withColumn(

"CRSArrHourOfDay",

hour(features.CRSArrTime)

)

features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show()



#

# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)

#

from pyspark.ml.feature import Bucketizer



# Setup the Bucketizer

splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]

arrival_bucketizer = Bucketizer(

splits=splits,

inputCol="ArrDelay",

outputCol="ArrDelayBucket"

)



# Save the model

arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)

arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)



# Apply the model

ml_bucketized_features = arrival_bucketizer.transform(features_with_hour)

ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()



#

# Extract features tools in with pyspark.ml.feature

#

from pyspark.ml.feature import StringIndexer, VectorAssembler



# Turn category fields into indexes

for column in ["Carrier", "Origin", "Dest", "Route"]:

string_indexer = StringIndexer(

inputCol=column,

outputCol=column + "_index"

)



string_indexer_model = string_indexer.fit(ml_bucketized_features)

ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)

# Save the pipeline model

string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format(

base_path,

column

)

string_indexer_model.write().overwrite().save(string_indexer_output_path)



# Combine continuous, numeric fields with indexes of nominal ones

# ...into one feature vector

numeric_columns = [

"DepDelay", "Distance",

"DayOfMonth", "DayOfWeek",

"DayOfYear", "CRSDepHourOfDay",

"CRSArrHourOfDay"]

index_columns = ["Carrier_index", "Origin_index",

"Dest_index", "Route_index"]

vector_assembler = VectorAssembler(

inputCols=numeric_columns + index_columns,

outputCol="Features_vec"

)

final_vectorized_features = vector_assembler.transform(ml_bucketized_features)



# Save the numeric vector assembler

vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)

vector_assembler.write().overwrite().save(vector_assembler_path)



# Drop the index columns

for column in index_columns:

final_vectorized_features = final_vectorized_features.drop(column)



# Inspect the finalized features

final_vectorized_features.show()



#

# Cross validate, train and evaluate classifier: loop 5 times for 4 metrics

#



from collections import defaultdict

scores = defaultdict(list)

feature_importances = defaultdict(list)

metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]

split_count = 3



for i in range(1, split_count + 1):

print("nRun {} out of {} of test/train splits in cross validation...".format(

i,

split_count,

)

)



# Test/train split

training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])



# Instantiate and fit random forest classifier on all the data

from pyspark.ml.classification import RandomForestClassifier

rfc = RandomForestClassifier(

featuresCol="Features_vec",

labelCol="ArrDelayBucket",

predictionCol="Prediction",

maxBins=4657,

)

model = rfc.fit(training_data)



# Save the new model over the old one

model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(

base_path

)

model.write().overwrite().save(model_output_path)



# Evaluate model using test data

predictions = model.transform(test_data)



# Evaluate this split's results for each metric

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

for metric_name in metric_names:

evaluator = MulticlassClassificationEvaluator(

labelCol="ArrDelayBucket",

predictionCol="Prediction",

metricName=metric_name

)

score = evaluator.evaluate(predictions)



scores[metric_name].append(score)

print("{} = {}".format(metric_name, score))



#

# Collect feature importances

#

feature_names = vector_assembler.getInputCols()

feature_importance_list = model.featureImportances

for feature_name, feature_importance in zip(feature_names, feature_importance_list):

feature_importances[feature_name].append(feature_importance)



#

# Evaluate average and STD of each metric and print a table

#

import numpy as np

score_averages = defaultdict(float)



# Compute the table data

average_stds = [] # ha

for metric_name in metric_names:

metric_scores = scores[metric_name]



average_accuracy = sum(metric_scores) / len(metric_scores)

score_averages[metric_name] = average_accuracy



std_accuracy = np.std(metric_scores)



average_stds.append((metric_name, average_accuracy, std_accuracy))



# Print the table

print("nExperiment Log")

print("--------------")

print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))



#

# Persist the score to a sccore log that exists between runs

#

import pickle
# Load the score log or initialize an empty one

try:

score_log_filename = "{}/models/score_log.pickle".format(base_path)

score_log = pickle.load(open(score_log_filename, "rb"))

if not isinstance(score_log, list):

score_log = []

except IOError:

score_log = []



# Compute the existing score log entry

score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}



# Compute and display the change in score for each metric

try:

last_log = score_log[-1]

except (IndexError, TypeError, AttributeError):

last_log = score_log_entry



experiment_report = []

for metric_name in metric_names:

run_delta = score_log_entry[metric_name] - last_log[metric_name]

experiment_report.append((metric_name, run_delta))



print("nExperiment Report")

print("-----------------")

print(tabulate(experiment_report, headers=["Metric", "Score"]))



# Append the existing average scores to the log

score_log.append(score_log_entry)



# Persist the log for next run

pickle.dump(score_log, open(score_log_filename, "wb"))



#

# Analyze and report feature importance changes

#



# Compute averages for each feature

feature_importance_entry = defaultdict(float)

for feature_name, value_list in feature_importances.items():

average_importance = sum(value_list) / len(value_list)

feature_importance_entry[feature_name] = average_importance



# Sort the feature importances in descending order and print

import operator

sorted_feature_importances = sorted(

feature_importance_entry.items(),

key=operator.itemgetter(1),

reverse=True

)



print("nFeature Importances")

print("-------------------")

print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))



#

# Compare this run's feature importances with the previous run's

#



# Load the feature importance log or initialize an empty one

try:

feature_log_filename = "{}/models/feature_log.pickle".format(base_path)

feature_log = pickle.load(open(feature_log_filename, "rb"))

if not isinstance(feature_log, list):

feature_log = []

except IOError:

feature_log = []



# Compute and display the change in score for each feature

try:

last_feature_log = feature_log[-1]

except (IndexError, TypeError, AttributeError):

last_feature_log = defaultdict(float)

for feature_name, importance in feature_importance_entry.items():

last_feature_log[feature_name] = importance



# Compute the deltas

feature_deltas = {}

for feature_name in feature_importances.keys():

run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]

feature_deltas[feature_name] = run_delta



# Sort feature deltas, biggest change first

import operator

sorted_feature_deltas = sorted(

feature_deltas.items(),

key=operator.itemgetter(1),

reverse=True

)



# Display sorted feature deltas

print("nFeature Importance Delta Report")

print("-------------------------------")

print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))



# Append the existing average deltas to the log

feature_log.append(feature_importance_entry)



# Persist the log for next run

pickle.dump(feature_log, open(feature_log_filename, "wb"))



if __name__ == "__main__":

main(sys.argv[1])

Data Syndrome Russell Jurney
Principal Consultant
Email : rjurney@datasyndrome.com
Web : datasyndrome.com
Data Syndrome, LLC

More Related Content

What's hot

Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
DataStax
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 

What's hot (20)

Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 

Similar to Introduction to PySpark

Similar to Introduction to PySpark (20)

Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Adatao Live Demo at the First Spark Summit
Adatao Live Demo at the First Spark SummitAdatao Live Demo at the First Spark Summit
Adatao Live Demo at the First Spark Summit
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and Tricks
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing ItYou Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
 
User 2013-oracle-big-data-analytics-1971985
User 2013-oracle-big-data-analytics-1971985User 2013-oracle-big-data-analytics-1971985
User 2013-oracle-big-data-analytics-1971985
 

More from Russell Jurney

More from Russell Jurney (9)

Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem Domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Azkaban and Pig at LinkedIn
Azkaban and Pig at LinkedInAzkaban and Pig at LinkedIn
Azkaban and Pig at LinkedIn
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 

Recently uploaded (20)

WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 

Introduction to PySpark

  • 1. A One Hour Introduction to Analytics with PySpark Introduction to PySpark http://www.slideshare.net/rjurney/introduction-to-pyspark or http://bit.ly/intro_to_pyspark
  • 2. Agile Data Science 2.0 Russell Jurney 2 Data Engineer Data Scientist Visualization Software Engineer 85% 85% 85% Writer 85% Teacher 50% Russell Jurney is a veteran data scientist and thought leader. He coined the term Agile Data Science in the book of that name from O’Reilly in 2012, which outlines the first agile development methodology for data science. Russell has constructed numerous full-stack analytics products over the past ten years and now works with clients helping them extract value from their data assets. Russell Jurney Skill Principal Consultant at Data Syndrome Russell Jurney Data Syndrome, LLC Email : russell.jurney@gmail.com Web : datasyndrome.com Principal Consultant
  • 3. Building Full-Stack Data Analytics Applications with Spark http://bit.ly/agile_data_science Agile Data Science 2.0
  • 4. Agile Data Science 2.0 4 Realtime Predictive Analytics Rapidly learn to build entire predictive systems driven by Kafka, PySpark, Speak Streaming, Spark MLlib and with a web front-end using Python/Flask and JQuery. Available for purchase at http://datasyndrome.com/video
  • 5. Lorem Ipsum dolor siamet suame this placeholder for text can simply random text. It has roots in a piece of classical. variazioni deiwords which whichhtly. ven on your zuniga merida della is not denis. Product Consulting We build analytics products and systems consisting of big data viz, predictions, recommendations, reports and search. Corporate Training We offer training courses for data scientists and engineers and data science teams, Video Training We offer video training courses that rapidly acclimate you with a technology and technique.
  • 6. Agile Data Science 2.0 6 What is Spark? What makes it go? Concepts
  • 7. Data Syndrome: Agile Data Science 2.0 Hadoop / HDFS HDFS splits large data among many machines 7
  • 8. Data Syndrome: Agile Data Science 2.0 Hadoop / MapReduce In the beginning there was MapReduce 8
  • 9. Data Syndrome: Agile Data Science 2.0 Spark / RDD Spark RDDs are iterable MapReduce relations 9
  • 10. Data Syndrome: Agile Data Science 2.0 Spark / DataFrame Fast SQLish RDD thingies 10
  • 11. Data Syndrome: Agile Data Science 2.0 Spark Streaming Spark on realtime streams in mini-batches 11
  • 12. Data Syndrome: Agile Data Science 2.0 Spark Ecosystem Lots of cool stuff working together… 12 /
  • 13. Agile Data Science 2.0 13 Setting up our class environment Setup
  • 14. Data Syndrome: Agile Data Science 2.0 14 Python 3 > 2.7 While the break in compatibility between Python 2.X and 3.X was unfortunate and unnecessary , Python 3 has increasingly become the platform of choice for analytics work. With a few alterations, all code in this course will execute in a Python 2.7 environment.
  • 15. Data Syndrome: Agile Data Science 2.0 15 Virtualbox Virtualbox is a Free and Open Source Software (FOSS) virtualization product for AMD64/Intel64 processors. It supports many operating systems, and is under active development. https://www.virtualbox.org/wiki/Downloads
  • 16. Data Syndrome: Agile Data Science 2.0 16 Vagrant Vagrant sits on top of Virtualbox and provides easy to use, reproducible development environments. https://www.vagrantup.com/downloads.html
  • 17. Data Syndrome: Agile Data Science 2.0 Vagrant Setup 17 Initializing our Vagrant Environment # Get the project git clone https://github.com/rjurney/Agile_Data_Code_2/ # Setup and connect to our virtual machine vagrant up; vagrant ssh # Now, from within Vagrant cd Agile_Data_Code_2 intro_download.sh # See Appendix A and install.sh for manual install
  • 18. Data Syndrome: Agile Data Science 2.0 18 Amazon EC2 Alternatively, Amazon Web Services provide a simple way to launch a prepared image for use in this exercise.
  • 19. Data Syndrome: Agile Data Science 2.0 EC2 Setup for Ubuntu Linux 19 Initializing our EC2 Environment # See ec2.sh, which uses aws/ec2_bootstrap.sh # To use add: —user-data file://aws/ec2_bootstrap.sh # Get the project git clone git@github.com:rjurney/Agile_Data_Code_2.git # Setup AWS CLI tools pip install awscli # Edit and run r3.xlarge instance with your key ./ec2.sh # ssh to the machine
  • 20. Data Syndrome: Agile Data Science 2.0 EC2 Setup for Ubuntu Linux Initializing our EC2 Environment 20 # Contents of ec2.sh # Launch our instance, which ec2_bootstrap.sh will initialize
 aws ec2 run-instances 
 --image-id ami-4ae1fb5d 
 --key-name agile_data_science 
 --user-data file://aws/ec2_bootstrap.sh 
 --instance-type r3.xlarge 
 --ebs-optimized 
 --placement "AvailabilityZone=us-east-1d" 
 --block-device-mappings '{"DeviceName":"/dev/sda1","Ebs": {"DeleteOnTermination":false,"VolumeSize":1024}}' 
 --count 1

  • 21. Data Syndrome: Agile Data Science 2.0 EC2 Setup for Ubuntu Linux Initializing our EC2 Environment 21 # Download the data cd Agile_Data_Code_2 ./intro_download.sh
  • 22. Data Syndrome: Agile Data Science 2.0 EC2 Setup for Ubuntu Linux Initializing our EC2 Environment 22 # Download the data cd Agile_Data_Code_2 ./intro_download.sh
  • 23. Data Syndrome: Agile Data Science 2.0 Documentation Setup Opening the right web pages to answer your questions 23 http://spark.apache.org/docs/latest/api/python/pyspark.html http://spark.apache.org/docs/latest/api/python/pyspark.sql.html http://spark.apache.org/docs/latest/api/python/pyspark.ml.html
  • 24. Agile Data Science 2.0 24 Learning the basics of PySpark Basic PySpark
  • 25. Data Syndrome: Agile Data Science 2.0 Hello, World! How to load data and perform an operation on it in Spark 25 # See ch02/spark.py # Load the text file using the SparkContext
 csv_lines = sc.textFile("data/example.csv")
 
 # Map the data to split the lines into a list
 data = csv_lines.map(lambda line: line.split(","))
 
 # Collect the dataset into local RAM
 data.collect()
  • 26. Data Syndrome: Agile Data Science 2.0 Creating Objects from CSV using a function How to create objects from CSV using a function instead of a lambda 26 # See ch02/groupby.py csv_lines = sc.textFile("data/example.csv")
 
 # Turn the CSV lines into objects
 def csv_to_record(line):
 parts = line.split(",")
 record = {
 "name": parts[0],
 "company": parts[1],
 "title": parts[2]
 }
 return record
 
 # Apply the function to every record
 records = csv_lines.map(csv_to_record)
 
 # Inspect the first item in the dataset
 records.first()
  • 27. Data Syndrome: Agile Data Science 2.0 Using a GroupBy to Count Jobs Count things using the groupBy API 27 # Group the records by the name of the person
 grouped_records = records.groupBy(lambda x: x["name"])
 
 # Show the first group
 grouped_records.first()
 
 # Count the groups
 job_counts = grouped_records.map(
 lambda x: {
 "name": x[0],
 "job_count": len(x[1])
 }
 )
 
 job_counts.first()
 
 job_counts.collect()
  • 28. Data Syndrome: Agile Data Science 2.0 Map vs FlatMap Understanding the difference between these two operators 28 # See ch02/flatmap.py csv_lines = sc.textFile("data/example.csv")
 
 # Compute a relation of words by line
 words_by_line = csv_lines
 .map(lambda line: line.split(","))
 
 words_by_line.collect()
 
 # Compute a relation of words
 flattened_words = csv_lines
 .map(lambda line: line.split(","))
 .flatMap(lambda x: x)
 
 flattened_words.collect()
  • 29. Data Syndrome: Agile Data Science 2.0 Map vs FlatMap Understanding the difference between these two operators 29 words_by_line.collect() [['Russell Jurney', 'Relato', 'CEO'], ['Florian Liebert', 'Mesosphere', 'CEO'], ['Don Brown', 'Rocana', 'CIO'], ['Steve Jobs', 'Apple', 'CEO'], ['Donald Trump', 'The Trump Organization', 'CEO'], ['Russell Jurney', 'Data Syndrome', 'Principal Consultant']] flattened_words.collect() ['Russell Jurney', 'Relato', 'CEO', 'Florian Liebert', 'Mesosphere', 'CEO', 'Don Brown', 'Rocana', 'CIO', 'Steve Jobs', 'Apple', 'CEO', 'Donald Trump', 'The Trump Organization', 'CEO', 'Russell Jurney', 'Data Syndrome', 'Principal Consultant']
  • 30. Data Syndrome: Agile Data Science 2.0 Using DataFrames and Spark SQL to Count Jobs Converting an RDD to a DataFrame to use Spark SQL 30 # See ch02/sql.py csv_lines = sc.textFile("data/example.csv")
 
 from pyspark.sql import Row
 
 # Convert the CSV into a pyspark.sql.Row
 def csv_to_row(line):
 parts = line.split(",")
 row = Row(
 name=parts[0],
 company=parts[1],
 title=parts[2]
 )
 return row
 
 # Apply the function to get rows in an RDD
 rows = csv_lines.map(csv_to_row)
  • 31. Data Syndrome: Agile Data Science 2.0 Using DataFrames and Spark SQL to Count Jobs Converting an RDD to a DataFrame to use Spark SQL 31 # Convert to a pyspark.sql.DataFrame
 rows_df = rows.toDF()
 
 # Register the DataFrame for Spark SQL
 rows_df.registerTempTable("executives")
 
 # Generate a new DataFrame with SQL using the SparkSession
 job_counts = spark.sql(""" SELECT name, COUNT(*) AS total FROM executives GROUP BY name """)
 job_counts.show()
 
 # Go back to an RDD
 job_counts.rdd.collect()
  • 32. Agile Data Science 2.0 32 Working with a more complex dataset Exploratory Data Analysis with Airline Data
  • 33. Data Syndrome: Agile Data Science 2.0 Loading a Parquet Columnar File Using the Apache Parquet format to load columnar data 33 # See ch02/load_on_time_performance.py # Load the parquet file containing flight delay records
 on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 
 # Register the data for Spark SQL
 on_time_dataframe.registerTempTable("on_time_performance")
 
 # Check out the columns
 on_time_dataframe.columns
 
 # Check out some data
 on_time_dataframe
 .select("FlightDate", "TailNum", "Origin", "Dest", "Carrier", "DepDelay", "ArrDelay")
 .show()
  • 34. Data Syndrome: Agile Data Science 2.0 Sampling a DataFrame Sampling a DataFrame to get a better view of its data 34 # Trim the fields and keep the result
 trimmed_on_time = on_time_dataframe
 .select(
 "FlightDate",
 "TailNum",
 "Origin",
 "Dest",
 "Carrier",
 "DepDelay",
 "ArrDelay"
 ) # Sample 0.01% of the data and show
 trimmed_on_time.sample(False, 0.0001).show()
  • 35. Data Syndrome: Agile Data Science 2.0 Calculating a Histogram Computing the distribution of a column in a dataset 35 # See ch02/histogram.py
 # Load the parquet file containing flight delay records
 on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 
 # Register the data for Spark SQL
 on_time_dataframe.registerTempTable("on_time_performance")
 
 # Compute a histogram of departure delays
 on_time_dataframe
 .select("DepDelay")
 .rdd
 .flatMap(lambda x: x)
 .histogram(10)
  • 36. Data Syndrome: Agile Data Science 2.0 Displaying a Histogram Using pyplot to display a histogram 36 import numpy as np
 import matplotlib.mlab as mlab
 import matplotlib.pyplot as plt
 
 # Function to plot a histogram using pyplot
 def create_hist(rdd_histogram_data):
 """Given an RDD.histogram, plot a pyplot histogram"""
 heights = np.array(rdd_histogram_data[1])
 full_bins = rdd_histogram_data[0]
 mid_point_bins = full_bins[:-1]
 widths = [abs(i - j) for i, j in zip(full_bins[:-1], full_bins[1:])]
 bar = plt.bar(mid_point_bins, heights, width=widths, color='b')
 return bar
 
 # Compute a histogram of departure delays
 departure_delay_histogram = on_time_dataframe
 .select("DepDelay")
 .rdd
 .flatMap(lambda x: x)
 .histogram(10, [-60,-30,-15,-10,-5,0,5,10,15,30,60,90,120,180])
 
 create_hist(departure_delay_histogram)
  • 37. Data Syndrome: Agile Data Science 2.0 Displaying a Histogram Using pyplot to display a histogram 37
  • 38. Data Syndrome: Agile Data Science 2.0 Counting Airplanes How many airplanes are in the US fleet in total? 38 # See ch05/assess_airplanes.py # Load the parquet file
 on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 on_time_dataframe.registerTempTable("on_time_performance")
 
 # Dump the unneeded fields
 tail_numbers = on_time_dataframe.rdd.map(lambda x: x.TailNum)
 tail_numbers = tail_numbers.filter(lambda x: x != '')
 
 # distinct() gets us unique tail numbers
 unique_tail_numbers = tail_numbers.distinct()
 
 # now we need a count() of unique tail numbers
 airplane_count = unique_tail_numbers.count()
 print("Total airplanes: {}".format(airplane_count))
  • 39. Data Syndrome: Agile Data Science 2.0 Counting Total Flights by Month Preparing data for a chart 39 # See ch05/total_flights.py # Load the parquet file
 on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 
 # Use SQL to look at the total flights by month across 2015
 on_time_dataframe.registerTempTable("on_time_dataframe")
 total_flights_by_month = spark.sql(
 """SELECT Month, Year, COUNT(*) AS total_flights
 FROM on_time_dataframe
 GROUP BY Year, Month
 ORDER BY Year, Month"""
 )
 
 # This map/asDict trick makes the rows print a little prettier. It is optional.
 flights_chart_data = total_flights_by_month.rdd.map(lambda row: row.asDict())
 flights_chart_data.collect()
  • 40. Data Syndrome: Agile Data Science 2.0 Preparing Complex Records for Storage Getting data ready for storage in a document or key/value store 40 # See ch05/extract_airplanes.py # Load the parquet file
 on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 on_time_dataframe.registerTempTable("on_time_performance")
 
 # Filter down to the fields we need to identify and link to a flight
 flights = on_time_dataframe.rdd.map(lambda x: 
 (x.Carrier, x.FlightDate, x.FlightNum, x.Origin, x.Dest, x.TailNum)
 )
 
 # Group flights by tail number, sorted by date, then flight number, then origin/dest
 flights_per_airplane = flights
 .map(lambda nameTuple: (nameTuple[5], [nameTuple[0:5]]))
 .reduceByKey(lambda a, b: a + b)
 .map(lambda tuple:
 {
 'TailNum': tuple[0], 
 'Flights': sorted(tuple[1], key=lambda x: (x[1], x[2], x[3], x[4]))
 }
 )
 flights_per_airplane.first()
  • 41. Data Syndrome: Agile Data Science 2.0 Counting Flight Delays Analyzing and understanding why flights are late 41 # See ch07/explore_delays.py # Load the on-time parquet file
 on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 on_time_dataframe.registerTempTable("on_time_performance")
 
 total_flights = on_time_dataframe.count()
 
 # Flights that were late leaving...
 late_departures = on_time_dataframe.filter(on_time_dataframe.DepDelayMinutes > 0)
 total_late_departures = late_departures.count()
 
 # Flights that were late arriving...
 late_arrivals = on_time_dataframe.filter(on_time_dataframe.ArrDelayMinutes > 0)
 total_late_arrivals = late_arrivals.count() # Get the percentage of flights that are late, rounded to 1 decimal place
 pct_late = round((total_late_arrivals / (total_flights * 1.0)) * 100, 1)
  • 42. Data Syndrome: Agile Data Science 2.0 Hero Flights How many flights made up for time in the air? Those that departed late and arrived on time? 42 # See ch07/explore_delays.py # Flights that left late but made up time to arrive on time...
 on_time_heros = on_time_dataframe.filter(
 (on_time_dataframe.DepDelayMinutes > 0)
 &
 (on_time_dataframe.ArrDelayMinutes <= 0)
 )
 total_on_time_heros = on_time_heros.count()
  • 43. Data Syndrome: Agile Data Science 2.0 Presenting Results Displaying the answers in plaintext we’ve just calculated 43 # See ch07/explore_delays.py print("Total flights: {:,}".format(total_flights))
 print("Late departures: {:,}".format(total_late_departures))
 print("Late arrivals: {:,}".format(total_late_arrivals))
 print("Recoveries: {:,}".format(total_on_time_heros))
 print("Percentage Late: {}%".format(pct_late))
  • 44. Data Syndrome: Agile Data Science 2.0 44 # See ch07/explore_delays.py # Get the average minutes late departing and arriving
 spark.sql("""
 SELECT
 ROUND(AVG(DepDelay),1) AS AvgDepDelay,
 ROUND(AVG(ArrDelay),1) AS AvgArrDelay
 FROM on_time_performance
 """
 ).show() Average Lateness Departing and Arriving Drilling down into flights and how late they are…
  • 45. Data Syndrome: Agile Data Science 2.0 Sampling Late Flights Getting to know our data by sampling records of interest 45 # Why are flights late? Lets look at some delayed flights and the delay causes
 late_flights = spark.sql("""
 SELECT
 ArrDelayMinutes,
 WeatherDelay,
 CarrierDelay,
 NASDelay,
 SecurityDelay,
 LateAircraftDelay
 FROM
 on_time_performance
 WHERE
 WeatherDelay IS NOT NULL
 OR
 CarrierDelay IS NOT NULL
 OR
 NASDelay IS NOT NULL
 OR
 SecurityDelay IS NOT NULL
 OR
 LateAircraftDelay IS NOT NULL
 ORDER BY
 FlightDate
 """)
 late_flights.sample(False, 0.01).show()
  • 46. Data Syndrome: Agile Data Science 2.0 Why are Flights Late? Analyzing and understanding why flights are late 46 # Calculate the percentage contribution to delay for each source
 total_delays = spark.sql("""
 SELECT
 ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_weather_delay,
 ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_carrier_delay,
 ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay,
 ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_security_delay,
 ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_late_aircraft_delay
 FROM on_time_performance
 """)
 total_delays.show()
  • 47. Data Syndrome: Agile Data Science 2.0 How Often are Weather Delayed Flights Late? Analyzing and understanding why flights are late 47 # Eyeball the first to define our buckets
 weather_delay_histogram = on_time_dataframe
 .select("WeatherDelay")
 .rdd
 .flatMap(lambda x: x)
 .histogram([1, 5, 10, 15, 30, 60, 120, 240, 480, 720, 24*60.0])
 print(weather_delay_histogram) create_hist(weather_delay_histogram)
  • 48. Data Syndrome: Agile Data Science 2.0 How Often are Weather Delayed Flights Late? Analyzing and understanding why flights are late 48
  • 49. Data Syndrome: Agile Data Science 2.0 Preparing Histogram Data for d3.js Analyzing and understanding why flights are late 49 # Transform the data into something easily consumed by d3
 def histogram_to_publishable(histogram):
 record = {'key': 1, 'data': []}
 for label, value in zip(histogram[0], histogram[1]):
 record['data'].append(
 {
 'label': label,
 'value': value
 }
 )
 return record # Recompute the weather histogram with a filter for on-time flights
 weather_delay_histogram = on_time_dataframe
 .filter(
 (on_time_dataframe.WeatherDelay != None)
 &
 (on_time_dataframe.WeatherDelay > 0)
 )
 .select("WeatherDelay")
 .rdd
 .flatMap(lambda x: x)
 .histogram([0, 15, 30, 60, 120, 240, 480, 720, 24*60.0])
 print(weather_delay_histogram)
 
 record = histogram_to_publishable(weather_delay_histogram)
  • 50. Agile Data Science 2.0 50 Building a classifier model Predictive Analytics Machine Learning
  • 51. Data Syndrome: Agile Data Science 2.0 Download Prepared Training Data Saving time by using a prepared dataset 51 # Be in the root directory of the project cd Agile_Data_Code_2 # Run the download script ch08/download_data.sh
  • 52. Data Syndrome: Agile Data Science 2.0 String Vectorization From properties of items to vector format 52
  • 53. Data Syndrome: Agile Data Science 2.0 53 scikit-learn was 166. Spark MLlib is very powerful! ch08/train_spark_mllib_model.py 190 Line Model # !/usr/bin/env python
 
 import sys, os, re
 
 # Pass date and base path to main() from airflow
 def main(base_path):
 
 # Default to "."
 try: base_path
 except NameError: base_path = "."
 if not base_path:
 base_path = "."
 
 APP_NAME = "train_spark_mllib_model.py"
 
 # If there is no SparkSession, create the environment
 try:
 sc and spark
 except NameError as e:
 import findspark
 findspark.init()
 import pyspark
 import pyspark.sql
 
 sc = pyspark.SparkContext()
 spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
 
 #
 # {
 # "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
 # "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
 # "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
 # }
 #
 from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
 from pyspark.sql.types import StructType, StructField
 from pyspark.sql.functions import udf
 
 schema = StructType([
 StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
 StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
 StructField("Carrier", StringType(), True), # "Carrier":"WN"
 StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
 StructField("Dest", StringType(), True), # "Dest":"SAN"
 StructField("Distance", DoubleType(), True), # "Distance":368.0
 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
 StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
 StructField("Origin", StringType(), True), # "Origin":"TUS"
 ])
 
 input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(
 base_path
 )
 features = spark.read.json(input_path, schema=schema)
 features.first()
 
 #
 # Check for nulls in features before using Spark ML
 #
 null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
 cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
 print(list(cols_with_nulls))
 
 #
 # Add a Route variable to replace FlightNum
 #
 from pyspark.sql.functions import lit, concat
 features_with_route = features.withColumn(
 'Route',
 concat(
 features.Origin,
 lit('-'),
 features.Dest
 )
 )
 features_with_route.show(6)
 
 #
 # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
 #
 from pyspark.ml.feature import Bucketizer
 
 # Setup the Bucketizer
 splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
 arrival_bucketizer = Bucketizer(
 splits=splits,
 inputCol="ArrDelay",
 outputCol="ArrDelayBucket"
 )
 
 # Save the bucketizer
 arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
 arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
 
 # Apply the bucketizer
 ml_bucketized_features = arrival_bucketizer.transform(features_with_route)
 ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
 
 #
 # Extract features tools in with pyspark.ml.feature
 #
 from pyspark.ml.feature import StringIndexer, VectorAssembler
 
 # Turn category fields into indexes
 for column in ["Carrier", "Origin", "Dest", "Route"]:
 string_indexer = StringIndexer(
 inputCol=column,
 outputCol=column + "_index"
 )
 
 string_indexer_model = string_indexer.fit(ml_bucketized_features)
 ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
 
 # Drop the original column
 ml_bucketized_features = ml_bucketized_features.drop(column)
 
 # Save the pipeline model
 string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(
 base_path,
 column
 )
 string_indexer_model.write().overwrite().save(string_indexer_output_path)
 
 # Combine continuous, numeric fields with indexes of nominal ones
 # ...into one feature vector
 numeric_columns = [
 "DepDelay", "Distance",
 "DayOfMonth", "DayOfWeek",
 "DayOfYear"]
 index_columns = ["Carrier_index", "Origin_index",
 "Dest_index", "Route_index"]
 vector_assembler = VectorAssembler(
 inputCols=numeric_columns + index_columns,
 outputCol="Features_vec"
 )
 final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
 
 # Save the numeric vector assembler
 vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
 vector_assembler.write().overwrite().save(vector_assembler_path)
 
 # Drop the index columns
 for column in index_columns:
 final_vectorized_features = final_vectorized_features.drop(column)
 
 # Inspect the finalized features
 final_vectorized_features.show()
 
 # Instantiate and fit random forest classifier on all the data
 from pyspark.ml.classification import RandomForestClassifier
 rfc = RandomForestClassifier(
 featuresCol="Features_vec",
 labelCol="ArrDelayBucket",
 predictionCol="Prediction",
 maxBins=4657,
 maxMemoryInMB=1024
 )
 model = rfc.fit(final_vectorized_features)
 
 # Save the new model over the old one
 model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(
 base_path
 )
 model.write().overwrite().save(model_output_path)
 
 # Evaluate model using test data
 predictions = model.transform(final_vectorized_features)
 
 from pyspark.ml.evaluation import MulticlassClassificationEvaluator
 evaluator = MulticlassClassificationEvaluator(
 predictionCol="Prediction",
 labelCol="ArrDelayBucket",
 metricName="accuracy"
 )
 accuracy = evaluator.evaluate(predictions)
 print("Accuracy = {}".format(accuracy))
 
 # Check the distribution of predictions
 predictions.groupBy("Prediction").count().show()
 
 # Check a sample
 predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
 
 if __name__ == "__main__":
 main(sys.argv[1])

  • 54. Data Syndrome: Agile Data Science 2.0 Loading Our Training Data Loading our data as a DataFrame to use the Spark ML APIs 54 from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
 from pyspark.sql.types import StructType, StructField
 from pyspark.sql.functions import udf
 
 schema = StructType([
 StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
 StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
 StructField("Carrier", StringType(), True), # "Carrier":"WN"
 StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
 StructField("Dest", StringType(), True), # "Dest":"SAN"
 StructField("Distance", DoubleType(), True), # "Distance":368.0
 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
 StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
 StructField("Origin", StringType(), True), # "Origin":"TUS"
 ])
 
 features = spark.read.json(
 "data/simple_flight_delay_features.jsonl.bz2",
 schema=schema
 )
 features.first()
  • 55. Data Syndrome: Agile Data Science 2.0 Checking the Data for Nulls Nulls will cause problems hereafter, so detect and address them first 55 #
 # Check for nulls in features before using Spark ML
 #
 null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
 cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
 print(list(cols_with_nulls))
  • 56. Data Syndrome: Agile Data Science 2.0 Adding a Feature - The Route Route is defined as origin airport code + “-“ + destination airport code 56 #
 # Add a Route variable to replace FlightNum
 #
 from pyspark.sql.functions import lit, concat
 
 features_with_route = features.withColumn(
 'Route',
 concat(
 features.Origin,
 lit('-'),
 features.Dest
 )
 )
 features_with_route.select("Origin", "Dest", "Route").show(5)
  • 57. Data Syndrome: Agile Data Science 2.0 Bucketizing ArrDelay into ArrDelayBucket We can’t classify a continuous variable, so we must bucketize it to make it nominal/categorical 57 #
 # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay
 #
 from pyspark.ml.feature import Bucketizer
 
 splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
 bucketizer = Bucketizer(
 splits=splits,
 inputCol="ArrDelay",
 outputCol="ArrDelayBucket"
 )
 ml_bucketized_features = bucketizer.transform(features_with_route)
 
 # Check the buckets out
 ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
  • 58. Data Syndrome: Agile Data Science 2.0 Indexing String Columns into Numeric Columns Nominal/categorical/string columns need to be made numeric before we can vectorize them 58 #
 # Extract features tools in with pyspark.ml.feature
 #
 from pyspark.ml.feature import StringIndexer, VectorAssembler
 
 # Turn category fields into categoric feature vectors, then drop intermediate fields
 for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",
 "Origin", "Dest", "Route"]:
 string_indexer = StringIndexer(
 inputCol=column,
 outputCol=column + "_index"
 )
 ml_bucketized_features = string_indexer.fit(ml_bucketized_features)
 .transform(ml_bucketized_features)
 
 # Check out the indexes
 ml_bucketized_features.show(6)
  • 59. Data Syndrome: Agile Data Science 2.0 Combining Numeric and Indexed Fields into One Vector Our classifier needs a single field, so we combine all our numeric fields into one feature vector 59 # Handle continuous, numeric fields by combining them into one feature vector
 numeric_columns = ["DepDelay", "Distance"]
 index_columns = ["Carrier_index", "DayOfMonth_index",
 "DayOfWeek_index", "DayOfYear_index", "Origin_index",
 "Origin_index", "Dest_index", "Route_index"]
 vector_assembler = VectorAssembler(
 inputCols=numeric_columns + index_columns,
 outputCol="Features_vec"
 )
 final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
 
 # Drop the index columns
 for column in index_columns:
 final_vectorized_features = final_vectorized_features.drop(column)
 
 # Check out the features
 final_vectorized_features.show()
  • 60. Data Syndrome: Agile Data Science 2.0 Splitting our Data in a Test/Train Split We need to split our data to evaluate the performance of our classifier 60 #
 # Cross validate, train and evaluate classifier
 #
 
 # Test/train split
 training_data, test_data = final_vectorized_features.randomSplit([0.7, 0.3])
  • 61. Data Syndrome: Agile Data Science 2.0 Training Our Model This is the magic in machine learning, and it is only a couple of lines of code 61 # Instantiate and fit random forest classifier
 from pyspark.ml.classification import RandomForestClassifier
 rfc = RandomForestClassifier(
 featuresCol="Features_vec",
 labelCol="ArrDelayBucket",
 maxBins=4657,
 maxMemoryInMB=1024
 )
 model = rfc.fit(training_data)
  • 62. Data Syndrome: Agile Data Science 2.0 Evaluating Our Model Using the test/train split to evaluate our model for accuracy 62 # Evaluate model using test data
 predictions = model.transform(test_data)
 
 from pyspark.ml.evaluation import MulticlassClassificationEvaluator
 evaluator = MulticlassClassificationEvaluator(labelCol="ArrDelayBucket", metricName="accuracy")
 accuracy = evaluator.evaluate(predictions)
 print("Accuracy = {}".format(accuracy))
  • 63. Data Syndrome: Agile Data Science 2.0 Sampling Our Predictions Making sure they pass the sniff check 63 # Check a sample
 predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
  • 64. Data Syndrome: Agile Data Science 2.0 Experiment Setup Necessary to improve model 64
  • 65. Data Syndrome: Agile Data Science 2.0 65 155 additional lines to setup an experiment and add 3 new features to improvement the model ch09/improve_spark_mllib_model.py 345 L.O.C. # !/usr/bin/env python
 
 import sys, os, re
 import json
 import datetime, iso8601
 from tabulate import tabulate
 
 # Pass date and base path to main() from airflow
 def main(base_path):
 APP_NAME = "train_spark_mllib_model.py"
 
 # If there is no SparkSession, create the environment
 try:
 sc and spark
 except NameError as e:
 import findspark
 findspark.init()
 import pyspark
 import pyspark.sql
 
 sc = pyspark.SparkContext()
 spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
 
 #
 # {
 # "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
 # "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
 # "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
 # }
 #
 from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
 from pyspark.sql.types import StructType, StructField
 from pyspark.sql.functions import udf
 
 schema = StructType([
 StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
 StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
 StructField("Carrier", StringType(), True), # "Carrier":"WN"
 StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
 StructField("Dest", StringType(), True), # "Dest":"SAN"
 StructField("Distance", DoubleType(), True), # "Distance":368.0
 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
 StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
 StructField("Origin", StringType(), True), # "Origin":"TUS"
 ])
 
 input_path = "{}/data/simple_flight_delay_features.json".format(
 base_path
 )
 features = spark.read.json(input_path, schema=schema)
 features.first()
 
 #
 # Add a Route variable to replace FlightNum
 #
 from pyspark.sql.functions import lit, concat
 features_with_route = features.withColumn(
 'Route',
 concat(
 features.Origin,
 lit('-'),
 features.Dest
 )
 )
 features_with_route.show(6)
 
 #
 # Add the hour of day of scheduled arrival/departure
 #
 from pyspark.sql.functions import hour
 features_with_hour = features_with_route.withColumn(
 "CRSDepHourOfDay",
 hour(features.CRSDepTime)
 )
 features_with_hour = features_with_hour.withColumn(
 "CRSArrHourOfDay",
 hour(features.CRSArrTime)
 )
 features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show()
 
 #
 # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
 #
 from pyspark.ml.feature import Bucketizer
 
 # Setup the Bucketizer
 splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
 arrival_bucketizer = Bucketizer(
 splits=splits,
 inputCol="ArrDelay",
 outputCol="ArrDelayBucket"
 )
 
 # Save the model
 arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
 arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
 
 # Apply the model
 ml_bucketized_features = arrival_bucketizer.transform(features_with_hour)
 ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
 
 #
 # Extract features tools in with pyspark.ml.feature
 #
 from pyspark.ml.feature import StringIndexer, VectorAssembler
 
 # Turn category fields into indexes
 for column in ["Carrier", "Origin", "Dest", "Route"]:
 string_indexer = StringIndexer(
 inputCol=column,
 outputCol=column + "_index"
 )
 
 string_indexer_model = string_indexer.fit(ml_bucketized_features)
 ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
 # Save the pipeline model
 string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format(
 base_path,
 column
 )
 string_indexer_model.write().overwrite().save(string_indexer_output_path)
 
 # Combine continuous, numeric fields with indexes of nominal ones
 # ...into one feature vector
 numeric_columns = [
 "DepDelay", "Distance",
 "DayOfMonth", "DayOfWeek",
 "DayOfYear", "CRSDepHourOfDay",
 "CRSArrHourOfDay"]
 index_columns = ["Carrier_index", "Origin_index",
 "Dest_index", "Route_index"]
 vector_assembler = VectorAssembler(
 inputCols=numeric_columns + index_columns,
 outputCol="Features_vec"
 )
 final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
 
 # Save the numeric vector assembler
 vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)
 vector_assembler.write().overwrite().save(vector_assembler_path)
 
 # Drop the index columns
 for column in index_columns:
 final_vectorized_features = final_vectorized_features.drop(column)
 
 # Inspect the finalized features
 final_vectorized_features.show()
 
 #
 # Cross validate, train and evaluate classifier: loop 5 times for 4 metrics
 #
 
 from collections import defaultdict
 scores = defaultdict(list)
 feature_importances = defaultdict(list)
 metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]
 split_count = 3
 
 for i in range(1, split_count + 1):
 print("nRun {} out of {} of test/train splits in cross validation...".format(
 i,
 split_count,
 )
 )
 
 # Test/train split
 training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])
 
 # Instantiate and fit random forest classifier on all the data
 from pyspark.ml.classification import RandomForestClassifier
 rfc = RandomForestClassifier(
 featuresCol="Features_vec",
 labelCol="ArrDelayBucket",
 predictionCol="Prediction",
 maxBins=4657,
 )
 model = rfc.fit(training_data)
 
 # Save the new model over the old one
 model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(
 base_path
 )
 model.write().overwrite().save(model_output_path)
 
 # Evaluate model using test data
 predictions = model.transform(test_data)
 
 # Evaluate this split's results for each metric
 from pyspark.ml.evaluation import MulticlassClassificationEvaluator
 for metric_name in metric_names:
 evaluator = MulticlassClassificationEvaluator(
 labelCol="ArrDelayBucket",
 predictionCol="Prediction",
 metricName=metric_name
 )
 score = evaluator.evaluate(predictions)
 
 scores[metric_name].append(score)
 print("{} = {}".format(metric_name, score))
 
 #
 # Collect feature importances
 #
 feature_names = vector_assembler.getInputCols()
 feature_importance_list = model.featureImportances
 for feature_name, feature_importance in zip(feature_names, feature_importance_list):
 feature_importances[feature_name].append(feature_importance)
 
 #
 # Evaluate average and STD of each metric and print a table
 #
 import numpy as np
 score_averages = defaultdict(float)
 
 # Compute the table data
 average_stds = [] # ha
 for metric_name in metric_names:
 metric_scores = scores[metric_name]
 
 average_accuracy = sum(metric_scores) / len(metric_scores)
 score_averages[metric_name] = average_accuracy
 
 std_accuracy = np.std(metric_scores)
 
 average_stds.append((metric_name, average_accuracy, std_accuracy))
 
 # Print the table
 print("nExperiment Log")
 print("--------------")
 print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))
 
 #
 # Persist the score to a sccore log that exists between runs
 #
 import pickle # Load the score log or initialize an empty one
 try:
 score_log_filename = "{}/models/score_log.pickle".format(base_path)
 score_log = pickle.load(open(score_log_filename, "rb"))
 if not isinstance(score_log, list):
 score_log = []
 except IOError:
 score_log = []
 
 # Compute the existing score log entry
 score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}
 
 # Compute and display the change in score for each metric
 try:
 last_log = score_log[-1]
 except (IndexError, TypeError, AttributeError):
 last_log = score_log_entry
 
 experiment_report = []
 for metric_name in metric_names:
 run_delta = score_log_entry[metric_name] - last_log[metric_name]
 experiment_report.append((metric_name, run_delta))
 
 print("nExperiment Report")
 print("-----------------")
 print(tabulate(experiment_report, headers=["Metric", "Score"]))
 
 # Append the existing average scores to the log
 score_log.append(score_log_entry)
 
 # Persist the log for next run
 pickle.dump(score_log, open(score_log_filename, "wb"))
 
 #
 # Analyze and report feature importance changes
 #
 
 # Compute averages for each feature
 feature_importance_entry = defaultdict(float)
 for feature_name, value_list in feature_importances.items():
 average_importance = sum(value_list) / len(value_list)
 feature_importance_entry[feature_name] = average_importance
 
 # Sort the feature importances in descending order and print
 import operator
 sorted_feature_importances = sorted(
 feature_importance_entry.items(),
 key=operator.itemgetter(1),
 reverse=True
 )
 
 print("nFeature Importances")
 print("-------------------")
 print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))
 
 #
 # Compare this run's feature importances with the previous run's
 #
 
 # Load the feature importance log or initialize an empty one
 try:
 feature_log_filename = "{}/models/feature_log.pickle".format(base_path)
 feature_log = pickle.load(open(feature_log_filename, "rb"))
 if not isinstance(feature_log, list):
 feature_log = []
 except IOError:
 feature_log = []
 
 # Compute and display the change in score for each feature
 try:
 last_feature_log = feature_log[-1]
 except (IndexError, TypeError, AttributeError):
 last_feature_log = defaultdict(float)
 for feature_name, importance in feature_importance_entry.items():
 last_feature_log[feature_name] = importance
 
 # Compute the deltas
 feature_deltas = {}
 for feature_name in feature_importances.keys():
 run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]
 feature_deltas[feature_name] = run_delta
 
 # Sort feature deltas, biggest change first
 import operator
 sorted_feature_deltas = sorted(
 feature_deltas.items(),
 key=operator.itemgetter(1),
 reverse=True
 )
 
 # Display sorted feature deltas
 print("nFeature Importance Delta Report")
 print("-------------------------------")
 print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))
 
 # Append the existing average deltas to the log
 feature_log.append(feature_importance_entry)
 
 # Persist the log for next run
 pickle.dump(feature_log, open(feature_log_filename, "wb"))
 
 if __name__ == "__main__":
 main(sys.argv[1])

  • 66. Data Syndrome Russell Jurney Principal Consultant Email : rjurney@datasyndrome.com Web : datasyndrome.com Data Syndrome, LLC