SlideShare a Scribd company logo
1 of 119
Download to read offline
Building Full Stack Data Analytics Applications with Kafka and Spark
Agile Data Science 2.0
http://bit.ly/agile_data_slides_6
Agile Data Science 2.0
Russell Jurney
2
Data Engineer
Data Scientist
Visualization Software Engineer
85%
85%
85%
Writer
85%
Teacher
50%
Russell Jurney is a veteran data
scientist and thought leader. He
coined the term Agile Data Science in
the book of that name from O’Reilly
in 2012, which outlines the first agile
development methodology for data
science. Russell has constructed
numerous full-stack analytics
products over the past ten years and
now works with clients helping them
extract value from their data assets.
Russell Jurney
Skill
Principal Consultant at Data Syndrome
Russell Jurney
Data Syndrome, LLC
Email : russell.jurney@gmail.com
Web : datasyndrome.com
Principal Consultant
Lorem Ipsum dolor siamet suame this placeholder for text can simply
random text. It has roots in a piece of classical. variazioni deiwords which
whichhtly. ven on your zuniga merida della is not denis.
Product Consulting
We build analytics products and systems
consisting of big data viz, predictions,
recommendations, reports and search.
Corporate Training
We offer training courses for data
scientists and engineers and data
science teams,
Video Training
We offer video training courses that rapidly
acclimate you with a technology and
technique.
Agile Data Science 2.0 4
What makes data science “agile data science”?
Theory
Agile Data Science 2.0
Agile Manifesto
5
Four principles for software engineering
Agile Data Science 2.0 6
Yes. Building applications is a fundamental skill for today’s data scientist.
Data Products or Data Science?
Agile Data Science 2.0 7
Data Products
or
Data Science?
Agile Data Science 2.0 8
If someone else has to start over and rebuild it, it ain’t agile.
Big Data or Data Science?
Agile Data Science 2.0 9
Goal of
Methodology
The goal of agile data science in <140 characters: to
document and guide exploratory data analysis to
discover and follow the critical path to a compelling
product.
Agile Data Science 2.0
Agile Data Science Manifesto
10
Seven Principles for Agile Data Science
Discover and pursue the critical path to a killer product
Iterate, iterate, iterate: tables, charts, reports, predictions1.
Integrate the tyrannical opinion of data in product management4.
Get Meta. Describe the process, not just the end-state7.
Ship intermediate output. Even failed experiments have output2.
Climb up and down the data-value pyramid as we work5.
Prototype experiments over implementing tasks3.
6.
Agile Data Science 2.0
Iterate, iterate, iterate: tables, charts, reports, predictions
11
Seven Principles for Agile Data Science
Agile Data Science 2.0
Ship intermediate output. Even failed experiments have output.
12
Seven Principles for Agile Data Science
—>
Agile Data Science 2.0
Prototype experiments over implementing tasks
13
Seven Principles for Agile Data Science
Agile Data Science 2.0
Climb up and down the data-value pyramid
14
Seven Principles for Agile Data Science
People will pay more for the things towards the top, but you need the things
on the bottom to have the things above. They are foundational. See:
Maslow’s Theory of Needs.
Data Value Pyramid
Agile Data Science 2.0
Discover and pursue the critical path to a killer product
15
Seven Principles for Agile Data Science
In analytics, the end-goal moves or is complex in nature, so we model as a
network of tasks rather than as a strictly linear process.
Critical Path
Agile Data Science 2.0
Get Meta. Describe the process, not just the end state
16
Seven Principles for Agile Data Science
Agile Data Science 2.0
Doing Agile Wrong
17
iterating between sprints, not within them is a bad idea.
Agile Data Science 2.0
Data Science Teams
18
The roles in a data science team are many, necessitating generalists
Agile Data Science 2.0
Data Science Releases
19
Snowballing releases towards a broader audience
Agile Data Science 2.0 20
Decorate your environment by visualizing your dataset!
Collage
Agile Data Science 2.0 21
Things we use to build the apps
Tools
Agile Data Science 2.0
Agile Data Science 2.0 Stack
22
Apache Spark Apache Kafka MongoDB
Batch and Realtime
Realtime Queue Document Store
Flask
Simple Web App
Example of a high productivity stack for “big” data applications
ElasticSearch
Search
d3.js
Visualization
Agile Data Science 2.0
Flow of Data Processing
23
Tools and processes in collecting, refining, publishing and decorating data
{“hello”: “world”}
Agile Data Science 2.0
Architecture of Agile Data Science
24
How the system comes together
Data Syndrome: Agile Data Science 2.0
Apache Spark Ecosystem
25
HDFS, Amazon S3, Spark, Spark SQL, Spark MLlib, Spark Streaming
/
Agile Data Science 2.0 26
SQL or dataflow programming?
Programming Models
Agile Data Science 2.0 27
Describing what you want and letting the planner figure out how
SQL
SELECT associations2.object_id,
associations2.term_id, associations2.cat_ID,
associations2.term_taxonomy_id

FROM (SELECT objects_tags.object_id,
objects_tags.term_id, wp_cb_tags2cats.cat_ID,
categories.term_taxonomy_id

FROM (SELECT
wp_term_relationships.object_id,
wp_term_taxonomy.term_id,
wp_term_taxonomy.term_taxonomy_id

FROM wp_term_relationships

LEFT JOIN wp_term_taxonomy ON
wp_term_relationships.term_taxonomy_id =
wp_term_taxonomy.term_taxonomy_id

ORDER BY object_id ASC, term_id ASC) 

AS objects_tags

LEFT JOIN wp_cb_tags2cats ON
objects_tags.term_id = wp_cb_tags2cats.tag_ID

LEFT JOIN (SELECT
wp_term_relationships.object_id,
wp_term_taxonomy.term_id as cat_ID,
wp_term_taxonomy.term_taxonomy_id

FROM wp_term_relationships

LEFT JOIN wp_term_taxonomy ON
wp_term_relationships.term_taxonomy_id =
wp_term_taxonomy.term_taxonomy_id

WHERE wp_term_taxonomy.taxonomy =
'category'

GROUP BY object_id, cat_ID,
term_taxonomy_id

ORDER BY object_id, cat_ID,
term_taxonomy_id) 

AS categories on wp_cb_tags2cats.cat_ID
= categories.term_id

WHERE objects_tags.term_id =
wp_cb_tags2cats.tag_ID

GROUP BY object_id, term_id, cat_ID,
term_taxonomy_id

ORDER BY object_id ASC, term_id ASC, cat_ID
ASC) 

AS associations2

LEFT JOIN categories ON associations2.object_id
= categories.object_id

WHERE associations2.cat_ID <> categories.cat_ID

GROUP BY object_id, term_id, cat_ID,
term_taxonomy_id

ORDER BY object_id, term_id, cat_ID,
term_taxonomy_id
Agile Data Science 2.0 28
Flowing data through operations to effect change
Dataflow Programming
Agile Data Science 2.0 29
The best of both worlds!
SQL AND
Dataflow
Programming
# Flights that were late arriving...

late_arrivals =
on_time_dataframe.filter(on_time_dataframe.ArrD
elayMinutes > 0)

total_late_arrivals = late_arrivals.count()



# Flights that left late but made up time to
arrive on time...

on_time_heros = on_time_dataframe.filter(

(on_time_dataframe.DepDelayMinutes > 0)

&

(on_time_dataframe.ArrDelayMinutes <= 0)

)

total_on_time_heros = on_time_heros.count()



# Get the percentage of flights that are late,
rounded to 1 decimal place

pct_late = round((total_late_arrivals /
(total_flights * 1.0)) * 100, 1)



print("Total flights:
{:,}".format(total_flights))

print("Late departures:
{:,}".format(total_late_departures))

print("Late arrivals:
{:,}".format(total_late_arrivals))

print("Recoveries:
{:,}".format(total_on_time_heros))

print("Percentage Late: {}%".format(pct_late))



# Why are flights late? Lets look at some
delayed flights and the delay causes

late_flights = spark.sql("""

SELECT

ArrDelayMinutes,

WeatherDelay,

CarrierDelay,

NASDelay,

SecurityDelay,

LateAircraftDelay

FROM

on_time_performance

WHERE

WeatherDelay IS NOT NULL

OR

CarrierDelay IS NOT NULL

OR

NASDelay IS NOT NULL

OR

SecurityDelay IS NOT NULL

OR

LateAircraftDelay IS NOT NULL

ORDER BY

FlightDate

""")

late_flights.sample(False, 0.01).show()
# Calculate the percentage contribution to delay for each source

total_delays = spark.sql("""

SELECT

ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_weather_delay,

ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_carrier_delay,

ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay,

ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_security_delay,

ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_late_aircraft_delay

FROM on_time_performance

""")

total_delays.show()



# Generate a histogram of the weather and carrier delays

weather_delay_histogram = on_time_dataframe

.select("WeatherDelay")

.rdd

.flatMap(lambda x: x)

.histogram(10)



print("{}n{}".format(weather_delay_histogram[0],
weather_delay_histogram[1]))



# Eyeball the first to define our buckets

weather_delay_histogram = on_time_dataframe

.select("WeatherDelay")

.rdd

.flatMap(lambda x: x)

.histogram([1, 15, 30, 60, 120, 240, 480, 720, 24*60.0])

print(weather_delay_histogram)
# Transform the data into something easily consumed by d3

record = {'key': 1, 'data': []}

for label, count in zip(weather_delay_histogram[0],
weather_delay_histogram[1]):

record['data'].append(

{

'label': label,

'count': count

}

)



# Save to Mongo directly, since this is a Tuple not a dataframe or RDD

from pymongo import MongoClient

client = MongoClient()

client.relato.weather_delay_histogram.insert_one(record)
Agile Data Science 2.0 30
Batch processing still dominant for charts, reports, search indexes and
training predictive models
Apache Airflow
for
Batch Scheduling
Agile Data Science 2.0 31
FAA on-time performance data
Data
Data Syndrome: Agile Data Science 2.0
Collect and Serialize Events in JSON
I never regret using JSON
32
Data Syndrome: Agile Data Science 2.0
FAA On-Time Performance Records
95% of commercial flights
33http://www.transtats.bts.gov/Fields.asp?table_id=236
Data Syndrome: Agile Data Science 2.0
FAA On-Time Performance Records
95% of commercial flights
34
"Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum",
"OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips",
"OriginStateName","OriginWac","DestAirportID","DestAirportSeqID","DestCityMarketID","Dest","DestCityName","DestState",
"DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups",
"DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15",
"ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime",
"Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay",
"FirstDepTime","TotalAddGTime","LongestAddGTime","DivAirportLandings","DivReachedDest","DivActualElapsedTime","DivArrDelay",
"DivDistance","Div1Airport","Div1AirportID","Div1AirportSeqID","Div1WheelsOn","Div1TotalGTime","Div1LongestGTime",
"Div1WheelsOff","Div1TailNum","Div2Airport","Div2AirportID","Div2AirportSeqID","Div2WheelsOn","Div2TotalGTime",
"Div2LongestGTime","Div2WheelsOff","Div2TailNum","Div3Airport","Div3AirportID","Div3AirportSeqID","Div3WheelsOn",
"Div3TotalGTime","Div3LongestGTime","Div3WheelsOff","Div3TailNum","Div4Airport","Div4AirportID","Div4AirportSeqID",
"Div4WheelsOn","Div4TotalGTime","Div4LongestGTime","Div4WheelsOff","Div4TailNum","Div5Airport","Div5AirportID",
"Div5AirportSeqID","Div5WheelsOn","Div5TotalGTime","Div5LongestGTime","Div5WheelsOff","Div5TailNum"
Data Syndrome: Agile Data Science 2.0
openflights.org Database
Airports, Airlines, Routes
35
Data Syndrome: Agile Data Science 2.0
Scraping the FAA Registry
Airplane Data by Tail Number
36
Data Syndrome: Agile Data Science 2.0
Wikipedia Airlines Entries
Descriptions of Airlines
37
Data Syndrome: Agile Data Science 2.0
National Centers for Environmental Information
Historical Weather Observations
38
Agile Data Science 2.0 39
How to store it and how to fetch it…
Data Structure
and
Access Patterns
Data Syndrome: Agile Data Science 2.0
First Order Form
Storing groups of documents in key/value stores
40
Key/Value and some Document Stores
Partition Tolerance
No Single Master
Data Syndrome: Agile Data Science 2.0
First Order Form
Storing groups of documents in key/value stores
41
Data Syndrome: Agile Data Science 2.0
Second Order Form
Using column stores and compound keys to enable range scans
42
Range Scans for Fun and Profit
One or more Masters
Data Syndrome: Agile Data Science 2.0
Second Order Form
Using column stores and compound keys to enable range scans
43
# Achieve everything many access patterns via key composition:
Requirement: SELECT FLIGHTS WHERE FlightDate < X AND FlightDate> Y
Compose Key: TailNum + FlightDate, ex. N16954-2015-12-16
Range Scan: FROM: N16954-2015-12-01
TO: N16954-2015-01-01
Data Syndrome: Agile Data Science 2.0
Third Order Form
Using databases with B-Trees to query and analyze raw records
44
SELECT COUNT(*), SUM(*)
…
FROM …
GROUP BY …
WHERE …
Agile Data Science 2.0 45
Working our way up the data value pyramid
Climbing the Stack
Agile Data Science 2.0 46
Starting by “plumbing” the system from end to end
Plumbing
Data Syndrome: Agile Data Science 2.0
Publishing Flight Records
Plumbing our master records through to the web
47
Data Syndrome: Agile Data Science 2.0
Publishing Flight Records to MongoDB
Plumbing our master records through to the web
48
import pymongo

import pymongo_spark

# Important: activate pymongo_spark.

pymongo_spark.activate()

# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')

# Convert to RDD of dicts and save to MongoDB

as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict())
as_dict.saveToMongoDB(‘mongodb://localhost:27017/agile_data_science.on_time_performance')
Data Syndrome: Agile Data Science 2.0
Publishing Flight Records to ElasticSearch
Plumbing our master records through to the web
49
# Load the parquet file

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')



# Save the DataFrame to Elasticsearch

on_time_dataframe.write.format("org.elasticsearch.spark.sql")

.option("es.resource","agile_data_science/on_time_performance")

.option("es.batch.size.entries","100")

.mode("overwrite")

.save()
Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
50
from flask import Flask, render_template, request

from pymongo import MongoClient

from bson import json_util



# Set up Flask and Mongo

app = Flask(__name__)

client = MongoClient()



# Controller: Fetch an email and display it

@app.route("/on_time_performance")

def on_time_performance():



carrier = request.args.get('Carrier')

flight_date = request.args.get('FlightDate')

flight_num = request.args.get('FlightNum')



flight = client.agile_data_science.on_time_performance.find_one({

'Carrier': carrier,

'FlightDate': flight_date,

'FlightNum': int(flight_num)

})



return json_util.dumps(flight)



if __name__ == "__main__":

app.run(debug=True)
Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
51
Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
52
Agile Data Science 2.0 53
Getting to know your data
Tables and Charts
Data Syndrome: Agile Data Science 2.0
Tables in PySpark
Back end development in PySpark
54
# Load the parquet file

on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')



# Use SQL to look at the total flights by month across 2015

on_time_dataframe.registerTempTable("on_time_dataframe")

total_flights_by_month = spark.sql(

"""SELECT Month, Year, COUNT(*) AS total_flights

FROM on_time_dataframe

GROUP BY Year, Month

ORDER BY Year, Month"""

)



# This map/asDict trick makes the rows print a little prettier. It is optional.

flights_chart_data = total_flights_by_month.map(lambda row: row.asDict())

flights_chart_data.collect()



# Save chart to MongoDB

import pymongo_spark

pymongo_spark.activate()

flights_chart_data.saveToMongoDB(

'mongodb://localhost:27017/agile_data_science.flights_by_month'

)
Data Syndrome: Agile Data Science 2.0
Tables in Flask and Jinja2
Front end development in Flask: controller and template
55
# Controller: Fetch a flight table

@app.route("/total_flights")

def total_flights():

total_flights = client.agile_data_science.flights_by_month.find({}, 

sort = [

('Year', 1),

('Month', 1)

])

return render_template('total_flights.html', total_flights=total_flights)
{% extends "layout.html" %}

{% block body %}

<div>

<p class="lead">Total Flights by Month</p>

<table class="table table-condensed table-striped" style="width: 200px;">

<thead>

<th>Month</th>

<th>Total Flights</th>

</thead>

<tbody>

{% for month in total_flights %}

<tr>

<td>{{month.Month}}</td>

<td>{{month.total_flights}}</td>

</tr>

{% endfor %}

</tbody>

</table>

</div>

{% endblock %}
Data Syndrome: Agile Data Science 2.0
Tables
Visualizing data
56
Data Syndrome: Agile Data Science 2.0
Charts in Flask and d3.js
Visualizing data with JSON and d3.js
57
# Serve the chart's data via an asynchronous request (formerly known as 'AJAX')

@app.route("/total_flights.json")

def total_flights_json():

total_flights = client.agile_data_science.flights_by_month.find({}, 

sort = [

('Year', 1),

('Month', 1)

])

return json_util.dumps(total_flights, ensure_ascii=False)
var width = 960,

height = 350;



var y = d3.scale.linear()

.range([height, 0]);

// We define the domain once we get our data in d3.json, below



var chart = d3.select(".chart")

.attr("width", width)

.attr("height", height);



d3.json("/total_flights.json", function(data) {



var defaultColor = 'steelblue';

var modeColor = '#4CA9F5';



var maxY = d3.max(data, function(d) { return d.total_flights; });

y.domain([0, maxY]);



var varColor = function(d, i) {

if(d['total_flights'] == maxY) { return modeColor; }

else { return defaultColor; }

}

var barWidth = width / data.length;

var bar = chart.selectAll("g")

.data(data)

.enter()

.append("g")

.attr("transform", function(d, i) { return "translate(" + i * barWidth + ",0)"; });



bar.append("rect")

.attr("y", function(d) { return y(d.total_flights); })

.attr("height", function(d) { return height - y(d.total_flights); })

.attr("width", barWidth - 1)

.style("fill", varColor);



bar.append("text")

.attr("x", barWidth / 2)

.attr("y", function(d) { return y(d.total_flights) + 3; })

.attr("dy", ".75em")

.text(function(d) { return d.total_flights; });

});
Data Syndrome: Agile Data Science 2.0
Charts
Visualizing data
58
Data Syndrome: Agile Data Science 2.0
Charts in Flask and d3.js
Visualizing data with JSON and d3.js
59
# Serve the chart's data via an asynchronous request (formerly known as 'AJAX')

@app.route("/total_flights.json")

def total_flights_json():

total_flights = client.agile_data_science.flights_by_month.find({}, 

sort = [

('Year', 1),

('Month', 1)

])

return json_util.dumps(total_flights, ensure_ascii=False)
var width = 960,
height = 350;
var y = d3.scale.linear()
.range([height, 0]);
// We define the domain once we get our data in d3.json, below
var chart = d3.select(".chart")
.attr("width", width)
.attr("height", height);
d3.json("/total_flights.json", function(data) {
y.domain([0, d3.max(data, function(d) { return d.total_flights; })]);
var barWidth = width / data.length;
var bar = chart.selectAll("g")
.data(data)
.enter()
.append("g")
.attr("transform", function(d, i) {
return "translate(" + i * barWidth + ",0)"; });
bar.append("rect")
.attr("y", function(d) { return y(d.total_flights); })
.attr("height", function(d) { return height - y(d.total_flights); })
.attr("width", barWidth - 1);
bar.append("text")
.attr("x", barWidth / 2)
.attr("y", function(d) { return y(d.total_flights) + 3; })
.attr("dy", ".75em")
.text(function(d) { return d.total_flights; });
});
Data Syndrome: Agile Data Science 2.0
Charts
Visualizing data
60
Data Syndrome: Agile Data Science 2.0
Charts in Flask and d3.js
Visualizing data with JSON and d3.js
61
# Serve the chart's data via an asynchronous request (formerly known as 'AJAX')

@app.route("/total_flights.json")

def total_flights_json():

total_flights = client.agile_data_science.flights_by_month.find({}, 

sort = [

('Year', 1),

('Month', 1)

])

return json_util.dumps(total_flights, ensure_ascii=False)
var width = 960,

height = 350;



var y = d3.scale.linear()

.range([height, 0]);

// We define the domain once we get our data in d3.json, below



var chart = d3.select(".chart")

.attr("width", width)

.attr("height", height);



d3.json("/total_flights.json", function(data) {



var defaultColor = 'steelblue';

var modeColor = '#4CA9F5';



var maxY = d3.max(data, function(d) { return d.total_flights; });

y.domain([0, maxY]);



var varColor = function(d, i) {

if(d['total_flights'] == maxY) { return modeColor; }

else { return defaultColor; }

}

var barWidth = width / data.length;

var bar = chart.selectAll("g")

.data(data)

.enter()

.append("g")

.attr("transform", function(d, i) { return "translate(" + i * barWidth + ",0)"; });



bar.append("rect")

.attr("y", function(d) { return y(d.total_flights); })

.attr("height", function(d) { return height - y(d.total_flights); })

.attr("width", barWidth - 1)

.style("fill", varColor);



bar.append("text")

.attr("x", barWidth / 2)

.attr("y", function(d) { return y(d.total_flights) + 3; })

.attr("dy", ".75em")

.text(function(d) { return d.total_flights; });

});
Data Syndrome: Agile Data Science 2.0
Charts
Visualizing data
62
Agile Data Science 2.0 63
Exploring your data through interaction
Reports
Data Syndrome: Agile Data Science 2.0
Creating Interactive Ontologies from Semi-Structured Data
Extracting and visualizing entities
64
Data Syndrome: Agile Data Science 2.0
Home Page
Extracting and decorating entities
65
Data Syndrome: Agile Data Science 2.0
Airline Entity @ /airline/DL
Extracting and decorating entities
66
Data Syndrome: Agile Data Science 2.0
Summarizing all Airline Entities @ /airplanes
Describing entities in aggregate
67
Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 2.0
Describing entities in aggregate
68
Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 3.0
Describing entities in aggregate
69
Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 3.0 - Entity Resolution Problem
Describing entities in aggregate
70
...
AIRBUS
AIRBUS INDUSTRIE
...
EMBRAER
EMBRAER S A
...
GULFSTREAM AEROSPACE
GULFSTREAM AEROSPACE CORP
...
MCDONNELL DOUGLAS
MCDONNELL DOUGLAS AIRCRAFT CO
MCDONNELL DOUGLAS CORPORATION
...
Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 3.0 - Entity Resolution Hack
Describing entities in aggregate
71
# Detect the longest common beginning string in a pair of strings
def longest_common_beginning(s1, s2):
if s1 == s2:
return s1
min_length = min(len(s1), len(s2))
i = 0
while i < min_length:
if s1[i] == s2[i]:
i += 1
else:
break
return s1[0:i]
# Compare two manufacturers, returning a tuple describing the result
def compare_manufacturers(mfrs):
mfr1 = mfrs[0]
mfr2 = mfrs[1]
lcb = longest_common_beginning(mfr1, mfr2)
len_lcb = len(lcb)
record = {
'mfr1': mfr1,
'mfr2': mfr2,
'lcb': lcb,
'len_lcb': len_lcb,
'eq': mfr1 == mfr2
}
return record
# Pair every unique instance of Manufacturer field with every
# other for comparison
comparison_pairs = manufacturer_variety.join(manufacturer_variety)
# Do the comparisons
comparisons = comparison_pairs.rdd.map(compare_manufacturers)
# Matches have > 5 starting chars in common
matches = comparisons.filter(lambda f: f['eq'] == False and f['len_lcb'] > 5)
#
# Now we create a mapping of duplicate keys from their raw value
# to the one we're going to use
#
# 1) Group the matches by the longest common beginning ('lcb')
common_lcbs = matches.groupBy(lambda x: x['lcb'])
# 2) Emit the raw value for each side of the match along with the key, our 'lcb'
mfr1_map = common_lcbs.map(
lambda x: [(y['mfr1'], x[0]) for y in x[1]]).flatMap(lambda x: x)
mfr2_map = common_lcbs.map(
lambda x: [(y['mfr2'], x[0]) for y in x[1]]).flatMap(lambda x: x)
# 3) Combine the two sides of the comparison's records
map_with_dupes = mfr1_map.union(mfr2_map)
# 4) Remove duplicates
mfr_dedupe_mapping = map_with_dupes.distinct()
# 5) Convert mapping to dataframe to join to airplanes dataframe
mapping_dataframe = mfr_dedupe_mapping.toDF()
# 6) Give the mapping column names
mapping_dataframe.registerTempTable("mapping_dataframe")
mapping_dataframe = spark.sql(
"SELECT _1 AS Raw, _2 AS NewManufacturer FROM mapping_dataframe"
)
Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 4.0
Describing entities in aggregate
72
Data Syndrome: Agile Data Science 2.0
Searching Flights
Using ElasticSearch to search for flights
73
Agile Data Science 2.0 74
Predicting the future for fun and profit
Predictions
Data Syndrome: Agile Data Science 2.0
Back End Design
Deep Storage and Spark vs Kafka and Spark Streaming
75
/
Batch Realtime
Historical Data
Train Model Apply Model
Realtime Data
Data Syndrome: Agile Data Science 2.0 76
jQuery in the web client submits a form to create the prediction request, and
then polls another url every few seconds until the prediction is ready. The
request generates a Kafka event, which a Spark Streaming worker processes
by applying the model we trained in batch. Having done so, it inserts a record
for the prediction in MongoDB, where the Flask app sends it to the web client
the next time it polls the server
Front End Design
/flights/delays/predict/classify_realtime/
Data Syndrome: Agile Data Science 2.0
User Interface
Where the user submits prediction requests
77
Data Syndrome: Agile Data Science 2.0
String Vectorization
From properties of items to vector format
78
Data Syndrome: Agile Data Science 2.0 79
scikit-learn was 166. Spark MLlib is very powerful!
http://bit.ly/train_model_spark
190 Line Model
# !/usr/bin/env python



import sys, os, re



# Pass date and base path to main() from airflow

def main(base_path):



# Default to "."

try: base_path

except NameError: base_path = "."

if not base_path:

base_path = "."



APP_NAME = "train_spark_mllib_model.py"



# If there is no SparkSession, create the environment

try:

sc and spark

except NameError as e:

import findspark

findspark.init()

import pyspark

import pyspark.sql



sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()



#

# {

# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",

# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,

# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"

# }

#

from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType

from pyspark.sql.types import StructType, StructField

from pyspark.sql.functions import udf



schema = StructType([

StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0

StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"

StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"

StructField("Carrier", StringType(), True), # "Carrier":"WN"

StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31

StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4

StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365

StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0

StructField("Dest", StringType(), True), # "Dest":"SAN"

StructField("Distance", DoubleType(), True), # "Distance":368.0

StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"

StructField("FlightNum", StringType(), True), # "FlightNum":"6109"

StructField("Origin", StringType(), True), # "Origin":"TUS"

])



input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(

base_path

)

features = spark.read.json(input_path, schema=schema)

features.first()



#

# Check for nulls in features before using Spark ML

#

null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]

cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)

print(list(cols_with_nulls))



#

# Add a Route variable to replace FlightNum

#

from pyspark.sql.functions import lit, concat

features_with_route = features.withColumn(

'Route',

concat(

features.Origin,

lit('-'),

features.Dest

)

)

features_with_route.show(6)



#

# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)

#

from pyspark.ml.feature import Bucketizer



# Setup the Bucketizer

splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]

arrival_bucketizer = Bucketizer(

splits=splits,

inputCol="ArrDelay",

outputCol="ArrDelayBucket"

)



# Save the bucketizer

arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)

arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)



# Apply the bucketizer

ml_bucketized_features = arrival_bucketizer.transform(features_with_route)

ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()



#

# Extract features tools in with pyspark.ml.feature

#

from pyspark.ml.feature import StringIndexer, VectorAssembler



# Turn category fields into indexes

for column in ["Carrier", "Origin", "Dest", "Route"]:

string_indexer = StringIndexer(

inputCol=column,

outputCol=column + "_index"

)



string_indexer_model = string_indexer.fit(ml_bucketized_features)

ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)



# Drop the original column

ml_bucketized_features = ml_bucketized_features.drop(column)



# Save the pipeline model

string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(

base_path,

column

)

string_indexer_model.write().overwrite().save(string_indexer_output_path)



# Combine continuous, numeric fields with indexes of nominal ones

# ...into one feature vector

numeric_columns = [

"DepDelay", "Distance",

"DayOfMonth", "DayOfWeek",

"DayOfYear"]

index_columns = ["Carrier_index", "Origin_index",

"Dest_index", "Route_index"]

vector_assembler = VectorAssembler(

inputCols=numeric_columns + index_columns,

outputCol="Features_vec"

)

final_vectorized_features = vector_assembler.transform(ml_bucketized_features)



# Save the numeric vector assembler

vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)

vector_assembler.write().overwrite().save(vector_assembler_path)



# Drop the index columns

for column in index_columns:

final_vectorized_features = final_vectorized_features.drop(column)



# Inspect the finalized features

final_vectorized_features.show()



# Instantiate and fit random forest classifier on all the data

from pyspark.ml.classification import RandomForestClassifier

rfc = RandomForestClassifier(

featuresCol="Features_vec",

labelCol="ArrDelayBucket",

predictionCol="Prediction",

maxBins=4657,

maxMemoryInMB=1024

)

model = rfc.fit(final_vectorized_features)



# Save the new model over the old one

model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(

base_path

)

model.write().overwrite().save(model_output_path)



# Evaluate model using test data

predictions = model.transform(final_vectorized_features)



from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(

predictionCol="Prediction",

labelCol="ArrDelayBucket",

metricName="accuracy"

)

accuracy = evaluator.evaluate(predictions)

print("Accuracy = {}".format(accuracy))



# Check the distribution of predictions

predictions.groupBy("Prediction").count().show()



# Check a sample

predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)



if __name__ == "__main__":

main(sys.argv[1])

Data Syndrome: Agile Data Science 2.0
Initializing the Environment
Setting up the environment…
80
# !/usr/bin/env python



import sys, os, re



# Pass date and base path to main() from airflow

def main(base_path):



# Default to "."

try: base_path

except NameError: base_path = "."

if not base_path:

base_path = "."



APP_NAME = "train_spark_mllib_model.py"



# If there is no SparkSession, create the environment

try:

sc and spark

except NameError as e:

import findspark

findspark.init()

import pyspark

import pyspark.sql



sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
Data Syndrome: Agile Data Science 2.0
Loading the Training Data
Using DataFrames to load structured data…
81
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType

from pyspark.sql.types import StructType, StructField

from pyspark.sql.functions import udf



schema = StructType([

StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0

StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"

StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"

StructField("Carrier", StringType(), True), # "Carrier":"WN"

StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31

StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4

StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365

StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0

StructField("Dest", StringType(), True), # "Dest":"SAN"

StructField("Distance", DoubleType(), True), # "Distance":368.0

StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"

StructField("FlightNum", StringType(), True), # "FlightNum":"6109"

StructField("Origin", StringType(), True), # "Origin":"TUS"

])



input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(

base_path

)

features = spark.read.json(input_path, schema=schema)

features.first()
Data Syndrome: Agile Data Science 2.0
Checking for Nulls
Checking the data for null values that would crash Spark MLlib
82
#

# Check for nulls in features before using Spark ML

#

null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]

cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)

print(list(cols_with_nulls))
Data Syndrome: Agile Data Science 2.0
Adding a Feature
Using DataFrame.withColumn to add a Route feature to the data…
83
#

# Add a Route variable to replace FlightNum

#

from pyspark.sql.functions import lit, concat

features_with_route = features.withColumn(

'Route',

concat(

features.Origin,

lit('-'),

features.Dest

)

)

features_with_route.show(6)
Data Syndrome: Agile Data Science 2.0
Bucketizing the Prediction Column
Using Bucketizer to convert a continuous variable to a nominal one…
84
#

# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)

#

from pyspark.ml.feature import Bucketizer



# Setup the Bucketizer

splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]

arrival_bucketizer = Bucketizer(

splits=splits,

inputCol="ArrDelay",

outputCol="ArrDelayBucket"

)



# Save the bucketizer

arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)

arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)



# Apply the bucketizer

ml_bucketized_features = arrival_bucketizer.transform(features_with_route)

ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
Data Syndrome: Agile Data Science 2.0
StringIndexing the String Columns
Using StringIndexer to convert nominal fields to numeric ones…
85
#

# Extract features tools in with pyspark.ml.feature

#

from pyspark.ml.feature import StringIndexer, VectorAssembler



# Turn category fields into indexes

for column in ["Carrier", "Origin", "Dest", "Route"]:

string_indexer = StringIndexer(

inputCol=column,

outputCol=column + "_index"

)



string_indexer_model = string_indexer.fit(ml_bucketized_features)

ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)



# Drop the original column

ml_bucketized_features = ml_bucketized_features.drop(column)



# Save the pipeline model

string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(

base_path,

column

)

string_indexer_model.write().overwrite().save(string_indexer_output_path)
Data Syndrome: Agile Data Science 2.0
Vectorizing the Numeric Columns
Combining the numeric fields with VectorAssembler…
86
# Combine continuous, numeric fields with indexes of nominal ones

# ...into one feature vector

numeric_columns = [

"DepDelay", "Distance",

"DayOfMonth", "DayOfWeek",

"DayOfYear"]

index_columns = ["Carrier_index", "Origin_index",

"Dest_index", "Route_index"]

vector_assembler = VectorAssembler(

inputCols=numeric_columns + index_columns,

outputCol="Features_vec"

)

final_vectorized_features = vector_assembler.transform(ml_bucketized_features)



# Save the numeric vector assembler

vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)

vector_assembler.write().overwrite().save(vector_assembler_path)



# Drop the index columns

for column in index_columns:

final_vectorized_features = final_vectorized_features.drop(column)



# Inspect the finalized features

final_vectorized_features.show()
Data Syndrome: Agile Data Science 2.0
Training the Classifier Model
Creating and training a RandomForestClassifier model
87
# Instantiate and fit random forest classifier on all the data

from pyspark.ml.classification import RandomForestClassifier

rfc = RandomForestClassifier(

featuresCol="Features_vec",

labelCol="ArrDelayBucket",

predictionCol="Prediction",

maxBins=4657,

maxMemoryInMB=1024

)

model = rfc.fit(final_vectorized_features)



# Save the new model over the old one

model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(

base_path

)

model.write().overwrite().save(model_output_path)
Data Syndrome: Agile Data Science 2.0
Evaluating the Classifier Model
Using MultiClassificationEvaluator to check the accuracy of the model…
88
# Evaluate model using test data

predictions = model.transform(final_vectorized_features)



from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(

predictionCol="Prediction",

labelCol="ArrDelayBucket",

metricName="accuracy"

)

accuracy = evaluator.evaluate(predictions)

print("Accuracy = {}".format(accuracy))



# Check the distribution of predictions

predictions.groupBy("Prediction").count().show()



# Check a sample

predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
Data Syndrome: Agile Data Science 2.0
Running Main
Just what it looks like…
89
if __name__ == "__main__":

main(sys.argv[1])
Data Syndrome: Agile Data Science 2.0 90
Using the model in realtime via Spark Streaming!
Deploying the Model
Data Syndrome: Agile Data Science 2.0
Deployment of Model Training to Airflow
Using Airflow to batch schedule the training of our model
91
training_dag = DAG(
'agile_data_science_batch_prediction_model_training',
default_args=default_args
)
# We use the same two commands for all our PySpark tasks
pyspark_bash_command = """
spark-submit --master {{ params.master }} 
{{ params.base_path }}/{{ params.filename }} 
{{ params.base_path }}
"""
pyspark_date_bash_command = """
spark-submit --master {{ params.master }} 
{{ params.base_path }}/{{ params.filename }} 
{{ ds }} {{ params.base_path }}
"""
# Gather the training data for our classifier
extract_features_operator = BashOperator(
task_id = "pyspark_extract_features",
bash_command = pyspark_bash_command,
params = {
"master": "local[8]",
"filename": "ch08/extract_features.py",
"base_path": "{}/".format(PROJECT_HOME)
},
dag=training_dag
)
# Train and persist the classifier model
train_classifier_model_operator = BashOperator(
task_id = "pyspark_train_classifier_model",
bash_command = pyspark_bash_command,
params = {
"master": "local[8]",
"filename": "ch08/train_spark_mllib_model.py",
"base_path": "{}/".format(PROJECT_HOME)
},
dag=training_dag
)
# The model training depends on the feature extraction
train_classifier_model_operator.set_upstream(extract_features_operator)
Data Syndrome: Agile Data Science 2.0
Deployment of Model Training to Airflow
Using Airflow to batch schedule the training of our model
92
Data Syndrome: Agile Data Science 2.0
Loading the Models
Loading the models we trained in batch to reproduce the data pipeline
93
# ch08/make_predictions_streaming.py
# Load the arrival delay bucketizer

from pyspark.ml.feature import Bucketizer

arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)

arrival_bucketizer = Bucketizer.load(arrival_bucketizer_path)



# Load all the string field vectorizer pipelines into a dict

from pyspark.ml.feature import StringIndexerModel



string_indexer_models = {}

for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",

"Origin", "Dest", "Route"]:

string_indexer_model_path = "{}/models/string_indexer_model_{}.bin".format(

base_path,

column

)

string_indexer_model = StringIndexerModel.load(string_indexer_model_path)

string_indexer_models[column] = string_indexer_model
Data Syndrome: Agile Data Science 2.0
Loading the Models
Loading the models we trained in batch to reproduce the data pipeline
94
# ch08/make_predictions_streaming.py
# Load the numeric vector assembler

from pyspark.ml.feature import VectorAssembler

vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)

vector_assembler = VectorAssembler.load(vector_assembler_path)



# Load the classifier model

from pyspark.ml.classification import RandomForestClassifier,
RandomForestClassificationModel

random_forest_model_path = "{}/models/spark_random_forest_classifier.flight_delays.
5.0.bin".format(

base_path

)

rfc = RandomForestClassificationModel.load(

random_forest_model_path
Data Syndrome: Agile Data Science 2.0
Connecting to Kafka
Creating a direct stream to the Kafka queue containing our prediction requests
95
#

# Process Prediction Requests in Streaming

#
from pyspark.streaming.kafka import KafkaUtils



stream = KafkaUtils.createDirectStream(

ssc,

[PREDICTION_TOPIC],

{

"metadata.broker.list": BROKERS,

"group.id": "0",

}

)



object_stream = stream.map(lambda x: json.loads(x[1]))

object_stream.pprint()
Data Syndrome: Agile Data Science 2.0
Repeating the Pipeline
Running the prediction requests through the same data flow as the training data
96
row_stream = object_stream.map(

lambda x: Row(

FlightDate=iso8601.parse_date(x['FlightDate']),

Origin=x['Origin'],

Distance=x['Distance'],

DayOfMonth=x['DayOfMonth'],

DayOfYear=x['DayOfYear'],

UUID=x['UUID'],

DepDelay=x['DepDelay'],

DayOfWeek=x['DayOfWeek'],

FlightNum=x['FlightNum'],

Dest=x['Dest'],

Timestamp=iso8601.parse_date(x['Timestamp']),

Carrier=x['Carrier']

)

)

row_stream.pprint()
# Do the classification and store to Mongo

row_stream.foreachRDD(classify_prediction_requests)



ssc.start()

ssc.awaitTermination()
Data Syndrome: Agile Data Science 2.0
Repeating the Pipeline
Running the prediction requests through the same data flow as the training data
97
def classify_prediction_requests(rdd):



from pyspark.sql.types import StringType, IntegerType, DoubleType, DateType, TimestampType

from pyspark.sql.types import StructType, StructField



prediction_request_schema = StructType([

StructField("Carrier", StringType(), True),

StructField("DayOfMonth", IntegerType(), True),

StructField("DayOfWeek", IntegerType(), True),

StructField("DayOfYear", IntegerType(), True),

StructField("DepDelay", DoubleType(), True),

StructField("Dest", StringType(), True),

StructField("Distance", DoubleType(), True),

StructField("FlightDate", DateType(), True),

StructField("FlightNum", StringType(), True),

StructField("Origin", StringType(), True),

StructField("Timestamp", TimestampType(), True),

StructField("UUID", StringType(), True),

])



prediction_requests_df = spark.createDataFrame(rdd, schema=prediction_request_schema)

prediction_requests_df.show()
from pyspark.sql.functions import lit, concat

prediction_requests_with_route = prediction_requests_df.withColumn(

'Route',

concat(

prediction_requests_df.Origin,

lit('-'),

prediction_requests_df.Dest

)

)

prediction_requests_with_route.show(6)
...
Data Syndrome: Agile Data Science 2.0
Repeating the Pipeline
Running the prediction requests through the same data flow as the training data
98
for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",

"Origin", "Dest", "Route"]:

string_indexer_model = string_indexer_models[column]

prediction_requests_with_route = string_indexer_model.transform(prediction_requests_with_route)



# Vectorize numeric columns: DepDelay, Distance and index columns

final_vectorized_features = vector_assembler.transform(prediction_requests_with_route)



# Inspect the vectors

final_vectorized_features.show()



# Drop the individual index columns

index_columns = ["Carrier_index", "DayOfMonth_index", "DayOfWeek_index", "DayOfYear_index",

"Origin_index", "Dest_index", "Route_index"]

for column in index_columns:

final_vectorized_features = final_vectorized_features.drop(column)



# Inspect the finalized features

final_vectorized_features.show()



# Make the prediction

predictions = rfc.transform(final_vectorized_features)
# Drop the features vector and prediction metadata to give the original fields

predictions = predictions.drop("Features_vec")

final_predictions = predictions.drop("indices").drop("values").drop("rawPrediction").drop("probability")



# Inspect the output

final_predictions.show()
Data Syndrome: Agile Data Science 2.0
Storing to Mongo
Putting the result where our web application can access it
99
# Store to Mongo

if final_predictions.count() > 0:

final_predictions.rdd.map(lambda x: x.asDict()).saveToMongoDB(

"mongodb://localhost:27017/agile_data_science.flight_delay_classification_response"

)
Data Syndrome: Agile Data Science 2.0
Final Result of Deployed Prediction
The bang for the buck!
100
Data Syndrome: Agile Data Science 2.0 101
Experimental setup for iteratively improving the predictive model
Improving the Model
Data Syndrome: Agile Data Science 2.0
Experiment Setup
Necessary to improve model
102
Data Syndrome: Agile Data Science 2.0 103
155 additional lines to setup an experiment
and add 3 new features to improvement the model
http://bit.ly/improved_model_spark
345 L.O.C.
# !/usr/bin/env python



import sys, os, re

import json

import datetime, iso8601

from tabulate import tabulate



# Pass date and base path to main() from airflow

def main(base_path):

APP_NAME = "train_spark_mllib_model.py"



# If there is no SparkSession, create the environment

try:

sc and spark

except NameError as e:

import findspark

findspark.init()

import pyspark

import pyspark.sql



sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()



#

# {

# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",

# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,

# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"

# }

#

from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType

from pyspark.sql.types import StructType, StructField

from pyspark.sql.functions import udf



schema = StructType([

StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0

StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"

StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"

StructField("Carrier", StringType(), True), # "Carrier":"WN"

StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31

StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4

StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365

StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0

StructField("Dest", StringType(), True), # "Dest":"SAN"

StructField("Distance", DoubleType(), True), # "Distance":368.0

StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"

StructField("FlightNum", StringType(), True), # "FlightNum":"6109"

StructField("Origin", StringType(), True), # "Origin":"TUS"

])



input_path = "{}/data/simple_flight_delay_features.json".format(

base_path

)

features = spark.read.json(input_path, schema=schema)

features.first()



#

# Add a Route variable to replace FlightNum

#

from pyspark.sql.functions import lit, concat

features_with_route = features.withColumn(

'Route',

concat(

features.Origin,

lit('-'),

features.Dest

)

)

features_with_route.show(6)



#

# Add the hour of day of scheduled arrival/departure

#

from pyspark.sql.functions import hour

features_with_hour = features_with_route.withColumn(

"CRSDepHourOfDay",

hour(features.CRSDepTime)

)

features_with_hour = features_with_hour.withColumn(

"CRSArrHourOfDay",

hour(features.CRSArrTime)

)

features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show()



#

# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)

#

from pyspark.ml.feature import Bucketizer



# Setup the Bucketizer

splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]

arrival_bucketizer = Bucketizer(

splits=splits,

inputCol="ArrDelay",

outputCol="ArrDelayBucket"

)



# Save the model

arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)

arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)



# Apply the model

ml_bucketized_features = arrival_bucketizer.transform(features_with_hour)

ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()



#

# Extract features tools in with pyspark.ml.feature

#

from pyspark.ml.feature import StringIndexer, VectorAssembler



# Turn category fields into indexes

for column in ["Carrier", "Origin", "Dest", "Route"]:

string_indexer = StringIndexer(

inputCol=column,

outputCol=column + "_index"

)



string_indexer_model = string_indexer.fit(ml_bucketized_features)

ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)

# Save the pipeline model

string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format(

base_path,

column

)

string_indexer_model.write().overwrite().save(string_indexer_output_path)



# Combine continuous, numeric fields with indexes of nominal ones

# ...into one feature vector

numeric_columns = [

"DepDelay", "Distance",

"DayOfMonth", "DayOfWeek",

"DayOfYear", "CRSDepHourOfDay",

"CRSArrHourOfDay"]

index_columns = ["Carrier_index", "Origin_index",

"Dest_index", "Route_index"]

vector_assembler = VectorAssembler(

inputCols=numeric_columns + index_columns,

outputCol="Features_vec"

)

final_vectorized_features = vector_assembler.transform(ml_bucketized_features)



# Save the numeric vector assembler

vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)

vector_assembler.write().overwrite().save(vector_assembler_path)



# Drop the index columns

for column in index_columns:

final_vectorized_features = final_vectorized_features.drop(column)



# Inspect the finalized features

final_vectorized_features.show()



#

# Cross validate, train and evaluate classifier: loop 5 times for 4 metrics

#



from collections import defaultdict

scores = defaultdict(list)

feature_importances = defaultdict(list)

metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]

split_count = 3



for i in range(1, split_count + 1):

print("nRun {} out of {} of test/train splits in cross validation...".format(

i,

split_count,

)

)



# Test/train split

training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])



# Instantiate and fit random forest classifier on all the data

from pyspark.ml.classification import RandomForestClassifier

rfc = RandomForestClassifier(

featuresCol="Features_vec",

labelCol="ArrDelayBucket",

predictionCol="Prediction",

maxBins=4657,

)

model = rfc.fit(training_data)



# Save the new model over the old one

model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(

base_path

)

model.write().overwrite().save(model_output_path)



# Evaluate model using test data

predictions = model.transform(test_data)



# Evaluate this split's results for each metric

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

for metric_name in metric_names:

evaluator = MulticlassClassificationEvaluator(

labelCol="ArrDelayBucket",

predictionCol="Prediction",

metricName=metric_name

)

score = evaluator.evaluate(predictions)



scores[metric_name].append(score)

print("{} = {}".format(metric_name, score))



#

# Collect feature importances

#

feature_names = vector_assembler.getInputCols()

feature_importance_list = model.featureImportances

for feature_name, feature_importance in zip(feature_names, feature_importance_list):

feature_importances[feature_name].append(feature_importance)



#

# Evaluate average and STD of each metric and print a table

#

import numpy as np

score_averages = defaultdict(float)



# Compute the table data

average_stds = [] # ha

for metric_name in metric_names:

metric_scores = scores[metric_name]



average_accuracy = sum(metric_scores) / len(metric_scores)

score_averages[metric_name] = average_accuracy



std_accuracy = np.std(metric_scores)



average_stds.append((metric_name, average_accuracy, std_accuracy))



# Print the table

print("nExperiment Log")

print("--------------")

print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))



#

# Persist the score to a sccore log that exists between runs

#

import pickle
# Load the score log or initialize an empty one

try:

score_log_filename = "{}/models/score_log.pickle".format(base_path)

score_log = pickle.load(open(score_log_filename, "rb"))

if not isinstance(score_log, list):

score_log = []

except IOError:

score_log = []



# Compute the existing score log entry

score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}



# Compute and display the change in score for each metric

try:

last_log = score_log[-1]

except (IndexError, TypeError, AttributeError):

last_log = score_log_entry



experiment_report = []

for metric_name in metric_names:

run_delta = score_log_entry[metric_name] - last_log[metric_name]

experiment_report.append((metric_name, run_delta))



print("nExperiment Report")

print("-----------------")

print(tabulate(experiment_report, headers=["Metric", "Score"]))



# Append the existing average scores to the log

score_log.append(score_log_entry)



# Persist the log for next run

pickle.dump(score_log, open(score_log_filename, "wb"))



#

# Analyze and report feature importance changes

#



# Compute averages for each feature

feature_importance_entry = defaultdict(float)

for feature_name, value_list in feature_importances.items():

average_importance = sum(value_list) / len(value_list)

feature_importance_entry[feature_name] = average_importance



# Sort the feature importances in descending order and print

import operator

sorted_feature_importances = sorted(

feature_importance_entry.items(),

key=operator.itemgetter(1),

reverse=True

)



print("nFeature Importances")

print("-------------------")

print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))



#

# Compare this run's feature importances with the previous run's

#



# Load the feature importance log or initialize an empty one

try:

feature_log_filename = "{}/models/feature_log.pickle".format(base_path)

feature_log = pickle.load(open(feature_log_filename, "rb"))

if not isinstance(feature_log, list):

feature_log = []

except IOError:

feature_log = []



# Compute and display the change in score for each feature

try:

last_feature_log = feature_log[-1]

except (IndexError, TypeError, AttributeError):

last_feature_log = defaultdict(float)

for feature_name, importance in feature_importance_entry.items():

last_feature_log[feature_name] = importance



# Compute the deltas

feature_deltas = {}

for feature_name in feature_importances.keys():

run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]

feature_deltas[feature_name] = run_delta



# Sort feature deltas, biggest change first

import operator

sorted_feature_deltas = sorted(

feature_deltas.items(),

key=operator.itemgetter(1),

reverse=True

)



# Display sorted feature deltas

print("nFeature Importance Delta Report")

print("-------------------------------")

print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))



# Append the existing average deltas to the log

feature_log.append(feature_importance_entry)



# Persist the log for next run

pickle.dump(feature_log, open(feature_log_filename, "wb"))



if __name__ == "__main__":

main(sys.argv[1])

Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
104
from collections import defaultdict
scores = defaultdict(list)
feature_importances = defaultdict(list)
metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]
split_count = 3
for i in range(1, split_count + 1):
print("nRun {} out of {} of test/train splits in cross validation...".format(
i,
split_count,
)
)
# Test/train split
training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
)
model = rfc.fit(training_data)
Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
105
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
# Evaluate model using test data
predictions = model.transform(test_data)
# Evaluate this split's results for each metric
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
for metric_name in metric_names:
evaluator = MulticlassClassificationEvaluator(
labelCol="ArrDelayBucket",
predictionCol="Prediction",
metricName=metric_name
)
score = evaluator.evaluate(predictions)
scores[metric_name].append(score)
print("{} = {}".format(metric_name, score))
Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
106
#
# Collect feature importances
#
feature_names = vector_assembler.getInputCols()
feature_importance_list = model.featureImportances
for feature_name, feature_importance in zip(feature_names, feature_importance_list):
feature_importances[feature_name].append(feature_importance)
#
# Evaluate average and STD of each metric and print a table
#
import numpy as np
score_averages = defaultdict(float)
# Compute the table data
average_stds = [] # ha
for metric_name in metric_names:
metric_scores = scores[metric_name]
average_accuracy = sum(metric_scores) / len(metric_scores)
score_averages[metric_name] = average_accuracy
std_accuracy = np.std(metric_scores)
average_stds.append((metric_name, average_accuracy, std_accuracy))
Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
107
# Print the table
print("nExperiment Log")
print("--------------")
print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))
#
# Persist the score to a sccore log that exists between runs
#
import pickle
# Load the score log or initialize an empty one
try:
score_log_filename = "{}/models/score_log.pickle".format(base_path)
score_log = pickle.load(open(score_log_filename, "rb"))
if not isinstance(score_log, list):
score_log = []
except IOError:
score_log = []
# Compute the existing score log entry
score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}
Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
108
# Compute and display the change in score for each metric
try:
last_log = score_log[-1]
except (IndexError, TypeError, AttributeError):
last_log = score_log_entry
experiment_report = []
for metric_name in metric_names:
run_delta = score_log_entry[metric_name] - last_log[metric_name]
experiment_report.append((metric_name, run_delta))
print("nExperiment Report")
print("-----------------")
print(tabulate(experiment_report, headers=["Metric", "Score"]))
# Append the existing average scores to the log
score_log.append(score_log_entry)
# Persist the log for next run
pickle.dump(score_log, open(score_log_filename, "wb"))
Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
109
#
# Analyze and report feature importance changes
#
# Compute averages for each feature
feature_importance_entry = defaultdict(float)
for feature_name, value_list in feature_importances.items():
average_importance = sum(value_list) / len(value_list)
feature_importance_entry[feature_name] = average_importance
# Sort the feature importances in descending order and print
import operator
sorted_feature_importances = sorted(
feature_importance_entry.items(),
key=operator.itemgetter(1),
reverse=True
)
print("nFeature Importances")
print("-------------------")
print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))
Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
110
#
# Compare this run's feature importances with the previous run's
#
# Load the feature importance log or initialize an empty one
try:
feature_log_filename = "{}/models/feature_log.pickle".format(base_path)
feature_log = pickle.load(open(feature_log_filename, "rb"))
if not isinstance(feature_log, list):
feature_log = []
except IOError:
feature_log = []
# Compute and display the change in score for each feature
try:
last_feature_log = feature_log[-1]
except (IndexError, TypeError, AttributeError):
last_feature_log = defaultdict(float)
for feature_name, importance in feature_importance_entry.items():
last_feature_log[feature_name] = importance
Data Syndrome: Agile Data Science 2.0
Creating an Experiment
Cross validate, train and evaluate classifier
111
# Compute the deltas
feature_deltas = {}
for feature_name in feature_importances.keys():
run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]
feature_deltas[feature_name] = run_delta
# Sort feature deltas, biggest change first
import operator
sorted_feature_deltas = sorted(
feature_deltas.items(),
key=operator.itemgetter(1),
reverse=True
)
# Display sorted feature deltas
print("nFeature Importance Delta Report")
print("-------------------------------")
print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))
# Append the existing average deltas to the log
feature_log.append(feature_importance_entry)
# Persist the log for next run
pickle.dump(feature_log, open(feature_log_filename, "wb"))
Data Syndrome: Agile Data Science 2.0
Running an Experiment
Cross validate, train and evaluate classifier
112
Experiment Log
--------------
Metric Average STD
----------------- --------- -----------
accuracy 0.594443 0.000382382
weightedPrecision 0.642419 0.00352101
weightedRecall 0.594443 0.000382382
f1 0.522397 0.000438121
Data Syndrome: Agile Data Science 2.0
Comparing Experiments
Cross validate, train and evaluate classifier
113
Experiment Report
-----------------
Metric Score
----------------- -----------
accuracy 0.00300548
weightedPrecision -0.00592227
weightedRecall 0.00300548
f1 -0.0105553
Data Syndrome: Agile Data Science 2.0
Feature Importances
Cross validate, train and evaluate classifier
114
Feature Importances
-------------------
Name Importance
--------------- ------------
DepDelay 0.882216
Route_index 0.0571401
Origin_index 0.0142741
Distance 0.0134583
Dest_index 0.00745796
DayOfYear 0.00544761
Carrier_index 0.00454088
DayOfMonth 9.31109e-05
DayOfWeek 5.2597e-05
Data Syndrome: Agile Data Science 2.0
Comparing Feature Importances
Cross validate, train and evaluate classifier
115
Feature Importance Delta Report
-------------------------------
Feature Delta
--------------- ------------
CRSArrHourOfDay 0.00982592
DepDelay 0.00671702
CRSDepHourOfDay 0.0012717
DayOfWeek -2.72792e-05
DayOfMonth -8.81635e-05
Origin_index -0.00152788
Distance -0.00188322
Route_index -0.00214609
Dest_index -0.0032327
DayOfYear -0.00377184
Carrier_index -0.00513747
Data Syndrome: Agile Data Science 2.0 116
Next steps for learning more about Agile Data Science 2.0
Next Steps
Building Full-Stack Data Analytics Applications with Spark
http://bit.ly/agile_data_science
Available Now on O’Reilly Safari: http://bit.ly/agile_data_safari
Agile Data Science 2.0
Agile Data Science 2.0 118
Realtime Predictive
Analytics
Rapidly learn to build entire predictive systems driven by
Kafka, PySpark, Speak Streaming, Spark MLlib and with a web
front-end using Python/Flask and JQuery.
Available for purchase at http://datasyndrome.com/video
Data Syndrome Russell Jurney
Principal Consultant
Email : rjurney@datasyndrome.com
Web : datasyndrome.com
Data Syndrome, LLC
Product Consulting
We build analytics products
and systems consisting of
big data viz, predictions,
recommendations, reports
and search.
Corporate Training
We offer training courses
for data scientists and
engineers and data
science teams,
Video Training
We offer video training
courses that rapidly
acclimate you with a
technology and technique.

More Related Content

What's hot

Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopRussell Jurney
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainRussell Jurney
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionAdnan Masood
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkEamonn Maguire
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracyDataWorks Summit
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsHong-Linh Truong
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordPuppet
 
NoSQL: The first New Jakarta EE Specification (DWX 2019)
NoSQL: The first New Jakarta EE Specification (DWX 2019)NoSQL: The first New Jakarta EE Specification (DWX 2019)
NoSQL: The first New Jakarta EE Specification (DWX 2019)Werner Keil
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
 

What's hot (18)

Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem Domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
 
HEPData workshop talk
HEPData workshop talkHEPData workshop talk
HEPData workshop talk
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 Talk
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy
 
HEPData
HEPDataHEPData
HEPData
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflows
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
NoSQL: The first New Jakarta EE Specification (DWX 2019)
NoSQL: The first New Jakarta EE Specification (DWX 2019)NoSQL: The first New Jakarta EE Specification (DWX 2019)
NoSQL: The first New Jakarta EE Specification (DWX 2019)
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
 
Demo Eclipse Science
Demo Eclipse ScienceDemo Eclipse Science
Demo Eclipse Science
 

Similar to Agile Data Science 2.0

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkRussell Jurney
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Ben ford intro
Ben ford introBen ford intro
Ben ford introPuppet
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Open Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and HistogramsOpen Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and HistogramsFrederic Descamps
 
Intershop Commerce Management with Microsoft SQL Server
Intershop Commerce Management with Microsoft SQL ServerIntershop Commerce Management with Microsoft SQL Server
Intershop Commerce Management with Microsoft SQL ServerMauro Boffardi
 
RivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and HistogramsRivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and HistogramsFrederic Descamps
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptxFedoRam1
 
Discover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryDiscover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryNew Delhi Salesforce Developer Group
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simplellangit
 

Similar to Agile Data Science 2.0 (20)

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
User 2013-oracle-big-data-analytics-1971985
User 2013-oracle-big-data-analytics-1971985User 2013-oracle-big-data-analytics-1971985
User 2013-oracle-big-data-analytics-1971985
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Open Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and HistogramsOpen Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and Histograms
 
Intershop Commerce Management with Microsoft SQL Server
Intershop Commerce Management with Microsoft SQL ServerIntershop Commerce Management with Microsoft SQL Server
Intershop Commerce Management with Microsoft SQL Server
 
RivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and HistogramsRivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and Histograms
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
 
Discover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryDiscover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and Discovery
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
 

Recently uploaded

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 

Recently uploaded (20)

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 

Agile Data Science 2.0

  • 1. Building Full Stack Data Analytics Applications with Kafka and Spark Agile Data Science 2.0 http://bit.ly/agile_data_slides_6
  • 2. Agile Data Science 2.0 Russell Jurney 2 Data Engineer Data Scientist Visualization Software Engineer 85% 85% 85% Writer 85% Teacher 50% Russell Jurney is a veteran data scientist and thought leader. He coined the term Agile Data Science in the book of that name from O’Reilly in 2012, which outlines the first agile development methodology for data science. Russell has constructed numerous full-stack analytics products over the past ten years and now works with clients helping them extract value from their data assets. Russell Jurney Skill Principal Consultant at Data Syndrome Russell Jurney Data Syndrome, LLC Email : russell.jurney@gmail.com Web : datasyndrome.com Principal Consultant
  • 3. Lorem Ipsum dolor siamet suame this placeholder for text can simply random text. It has roots in a piece of classical. variazioni deiwords which whichhtly. ven on your zuniga merida della is not denis. Product Consulting We build analytics products and systems consisting of big data viz, predictions, recommendations, reports and search. Corporate Training We offer training courses for data scientists and engineers and data science teams, Video Training We offer video training courses that rapidly acclimate you with a technology and technique.
  • 4. Agile Data Science 2.0 4 What makes data science “agile data science”? Theory
  • 5. Agile Data Science 2.0 Agile Manifesto 5 Four principles for software engineering
  • 6. Agile Data Science 2.0 6 Yes. Building applications is a fundamental skill for today’s data scientist. Data Products or Data Science?
  • 7. Agile Data Science 2.0 7 Data Products or Data Science?
  • 8. Agile Data Science 2.0 8 If someone else has to start over and rebuild it, it ain’t agile. Big Data or Data Science?
  • 9. Agile Data Science 2.0 9 Goal of Methodology The goal of agile data science in <140 characters: to document and guide exploratory data analysis to discover and follow the critical path to a compelling product.
  • 10. Agile Data Science 2.0 Agile Data Science Manifesto 10 Seven Principles for Agile Data Science Discover and pursue the critical path to a killer product Iterate, iterate, iterate: tables, charts, reports, predictions1. Integrate the tyrannical opinion of data in product management4. Get Meta. Describe the process, not just the end-state7. Ship intermediate output. Even failed experiments have output2. Climb up and down the data-value pyramid as we work5. Prototype experiments over implementing tasks3. 6.
  • 11. Agile Data Science 2.0 Iterate, iterate, iterate: tables, charts, reports, predictions 11 Seven Principles for Agile Data Science
  • 12. Agile Data Science 2.0 Ship intermediate output. Even failed experiments have output. 12 Seven Principles for Agile Data Science —>
  • 13. Agile Data Science 2.0 Prototype experiments over implementing tasks 13 Seven Principles for Agile Data Science
  • 14. Agile Data Science 2.0 Climb up and down the data-value pyramid 14 Seven Principles for Agile Data Science People will pay more for the things towards the top, but you need the things on the bottom to have the things above. They are foundational. See: Maslow’s Theory of Needs. Data Value Pyramid
  • 15. Agile Data Science 2.0 Discover and pursue the critical path to a killer product 15 Seven Principles for Agile Data Science In analytics, the end-goal moves or is complex in nature, so we model as a network of tasks rather than as a strictly linear process. Critical Path
  • 16. Agile Data Science 2.0 Get Meta. Describe the process, not just the end state 16 Seven Principles for Agile Data Science
  • 17. Agile Data Science 2.0 Doing Agile Wrong 17 iterating between sprints, not within them is a bad idea.
  • 18. Agile Data Science 2.0 Data Science Teams 18 The roles in a data science team are many, necessitating generalists
  • 19. Agile Data Science 2.0 Data Science Releases 19 Snowballing releases towards a broader audience
  • 20. Agile Data Science 2.0 20 Decorate your environment by visualizing your dataset! Collage
  • 21. Agile Data Science 2.0 21 Things we use to build the apps Tools
  • 22. Agile Data Science 2.0 Agile Data Science 2.0 Stack 22 Apache Spark Apache Kafka MongoDB Batch and Realtime Realtime Queue Document Store Flask Simple Web App Example of a high productivity stack for “big” data applications ElasticSearch Search d3.js Visualization
  • 23. Agile Data Science 2.0 Flow of Data Processing 23 Tools and processes in collecting, refining, publishing and decorating data {“hello”: “world”}
  • 24. Agile Data Science 2.0 Architecture of Agile Data Science 24 How the system comes together
  • 25. Data Syndrome: Agile Data Science 2.0 Apache Spark Ecosystem 25 HDFS, Amazon S3, Spark, Spark SQL, Spark MLlib, Spark Streaming /
  • 26. Agile Data Science 2.0 26 SQL or dataflow programming? Programming Models
  • 27. Agile Data Science 2.0 27 Describing what you want and letting the planner figure out how SQL SELECT associations2.object_id, associations2.term_id, associations2.cat_ID, associations2.term_taxonomy_id
 FROM (SELECT objects_tags.object_id, objects_tags.term_id, wp_cb_tags2cats.cat_ID, categories.term_taxonomy_id
 FROM (SELECT wp_term_relationships.object_id, wp_term_taxonomy.term_id, wp_term_taxonomy.term_taxonomy_id
 FROM wp_term_relationships
 LEFT JOIN wp_term_taxonomy ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
 ORDER BY object_id ASC, term_id ASC) 
 AS objects_tags
 LEFT JOIN wp_cb_tags2cats ON objects_tags.term_id = wp_cb_tags2cats.tag_ID
 LEFT JOIN (SELECT wp_term_relationships.object_id, wp_term_taxonomy.term_id as cat_ID, wp_term_taxonomy.term_taxonomy_id
 FROM wp_term_relationships
 LEFT JOIN wp_term_taxonomy ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
 WHERE wp_term_taxonomy.taxonomy = 'category'
 GROUP BY object_id, cat_ID, term_taxonomy_id
 ORDER BY object_id, cat_ID, term_taxonomy_id) 
 AS categories on wp_cb_tags2cats.cat_ID = categories.term_id
 WHERE objects_tags.term_id = wp_cb_tags2cats.tag_ID
 GROUP BY object_id, term_id, cat_ID, term_taxonomy_id
 ORDER BY object_id ASC, term_id ASC, cat_ID ASC) 
 AS associations2
 LEFT JOIN categories ON associations2.object_id = categories.object_id
 WHERE associations2.cat_ID <> categories.cat_ID
 GROUP BY object_id, term_id, cat_ID, term_taxonomy_id
 ORDER BY object_id, term_id, cat_ID, term_taxonomy_id
  • 28. Agile Data Science 2.0 28 Flowing data through operations to effect change Dataflow Programming
  • 29. Agile Data Science 2.0 29 The best of both worlds! SQL AND Dataflow Programming # Flights that were late arriving...
 late_arrivals = on_time_dataframe.filter(on_time_dataframe.ArrD elayMinutes > 0)
 total_late_arrivals = late_arrivals.count()
 
 # Flights that left late but made up time to arrive on time...
 on_time_heros = on_time_dataframe.filter(
 (on_time_dataframe.DepDelayMinutes > 0)
 &
 (on_time_dataframe.ArrDelayMinutes <= 0)
 )
 total_on_time_heros = on_time_heros.count()
 
 # Get the percentage of flights that are late, rounded to 1 decimal place
 pct_late = round((total_late_arrivals / (total_flights * 1.0)) * 100, 1)
 
 print("Total flights: {:,}".format(total_flights))
 print("Late departures: {:,}".format(total_late_departures))
 print("Late arrivals: {:,}".format(total_late_arrivals))
 print("Recoveries: {:,}".format(total_on_time_heros))
 print("Percentage Late: {}%".format(pct_late))
 
 # Why are flights late? Lets look at some delayed flights and the delay causes
 late_flights = spark.sql("""
 SELECT
 ArrDelayMinutes,
 WeatherDelay,
 CarrierDelay,
 NASDelay,
 SecurityDelay,
 LateAircraftDelay
 FROM
 on_time_performance
 WHERE
 WeatherDelay IS NOT NULL
 OR
 CarrierDelay IS NOT NULL
 OR
 NASDelay IS NOT NULL
 OR
 SecurityDelay IS NOT NULL
 OR
 LateAircraftDelay IS NOT NULL
 ORDER BY
 FlightDate
 """)
 late_flights.sample(False, 0.01).show() # Calculate the percentage contribution to delay for each source
 total_delays = spark.sql("""
 SELECT
 ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_weather_delay,
 ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_carrier_delay,
 ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay,
 ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_security_delay,
 ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_late_aircraft_delay
 FROM on_time_performance
 """)
 total_delays.show()
 
 # Generate a histogram of the weather and carrier delays
 weather_delay_histogram = on_time_dataframe
 .select("WeatherDelay")
 .rdd
 .flatMap(lambda x: x)
 .histogram(10)
 
 print("{}n{}".format(weather_delay_histogram[0], weather_delay_histogram[1]))
 
 # Eyeball the first to define our buckets
 weather_delay_histogram = on_time_dataframe
 .select("WeatherDelay")
 .rdd
 .flatMap(lambda x: x)
 .histogram([1, 15, 30, 60, 120, 240, 480, 720, 24*60.0])
 print(weather_delay_histogram) # Transform the data into something easily consumed by d3
 record = {'key': 1, 'data': []}
 for label, count in zip(weather_delay_histogram[0], weather_delay_histogram[1]):
 record['data'].append(
 {
 'label': label,
 'count': count
 }
 )
 
 # Save to Mongo directly, since this is a Tuple not a dataframe or RDD
 from pymongo import MongoClient
 client = MongoClient()
 client.relato.weather_delay_histogram.insert_one(record)
  • 30. Agile Data Science 2.0 30 Batch processing still dominant for charts, reports, search indexes and training predictive models Apache Airflow for Batch Scheduling
  • 31. Agile Data Science 2.0 31 FAA on-time performance data Data
  • 32. Data Syndrome: Agile Data Science 2.0 Collect and Serialize Events in JSON I never regret using JSON 32
  • 33. Data Syndrome: Agile Data Science 2.0 FAA On-Time Performance Records 95% of commercial flights 33http://www.transtats.bts.gov/Fields.asp?table_id=236
  • 34. Data Syndrome: Agile Data Science 2.0 FAA On-Time Performance Records 95% of commercial flights 34 "Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum", "OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips", "OriginStateName","OriginWac","DestAirportID","DestAirportSeqID","DestCityMarketID","Dest","DestCityName","DestState", "DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups", "DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15", "ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime", "Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay", "FirstDepTime","TotalAddGTime","LongestAddGTime","DivAirportLandings","DivReachedDest","DivActualElapsedTime","DivArrDelay", "DivDistance","Div1Airport","Div1AirportID","Div1AirportSeqID","Div1WheelsOn","Div1TotalGTime","Div1LongestGTime", "Div1WheelsOff","Div1TailNum","Div2Airport","Div2AirportID","Div2AirportSeqID","Div2WheelsOn","Div2TotalGTime", "Div2LongestGTime","Div2WheelsOff","Div2TailNum","Div3Airport","Div3AirportID","Div3AirportSeqID","Div3WheelsOn", "Div3TotalGTime","Div3LongestGTime","Div3WheelsOff","Div3TailNum","Div4Airport","Div4AirportID","Div4AirportSeqID", "Div4WheelsOn","Div4TotalGTime","Div4LongestGTime","Div4WheelsOff","Div4TailNum","Div5Airport","Div5AirportID", "Div5AirportSeqID","Div5WheelsOn","Div5TotalGTime","Div5LongestGTime","Div5WheelsOff","Div5TailNum"
  • 35. Data Syndrome: Agile Data Science 2.0 openflights.org Database Airports, Airlines, Routes 35
  • 36. Data Syndrome: Agile Data Science 2.0 Scraping the FAA Registry Airplane Data by Tail Number 36
  • 37. Data Syndrome: Agile Data Science 2.0 Wikipedia Airlines Entries Descriptions of Airlines 37
  • 38. Data Syndrome: Agile Data Science 2.0 National Centers for Environmental Information Historical Weather Observations 38
  • 39. Agile Data Science 2.0 39 How to store it and how to fetch it… Data Structure and Access Patterns
  • 40. Data Syndrome: Agile Data Science 2.0 First Order Form Storing groups of documents in key/value stores 40 Key/Value and some Document Stores Partition Tolerance No Single Master
  • 41. Data Syndrome: Agile Data Science 2.0 First Order Form Storing groups of documents in key/value stores 41
  • 42. Data Syndrome: Agile Data Science 2.0 Second Order Form Using column stores and compound keys to enable range scans 42 Range Scans for Fun and Profit One or more Masters
  • 43. Data Syndrome: Agile Data Science 2.0 Second Order Form Using column stores and compound keys to enable range scans 43 # Achieve everything many access patterns via key composition: Requirement: SELECT FLIGHTS WHERE FlightDate < X AND FlightDate> Y Compose Key: TailNum + FlightDate, ex. N16954-2015-12-16 Range Scan: FROM: N16954-2015-12-01 TO: N16954-2015-01-01
  • 44. Data Syndrome: Agile Data Science 2.0 Third Order Form Using databases with B-Trees to query and analyze raw records 44 SELECT COUNT(*), SUM(*) … FROM … GROUP BY … WHERE …
  • 45. Agile Data Science 2.0 45 Working our way up the data value pyramid Climbing the Stack
  • 46. Agile Data Science 2.0 46 Starting by “plumbing” the system from end to end Plumbing
  • 47. Data Syndrome: Agile Data Science 2.0 Publishing Flight Records Plumbing our master records through to the web 47
  • 48. Data Syndrome: Agile Data Science 2.0 Publishing Flight Records to MongoDB Plumbing our master records through to the web 48 import pymongo
 import pymongo_spark
 # Important: activate pymongo_spark.
 pymongo_spark.activate()
 # Load the parquet file on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 # Convert to RDD of dicts and save to MongoDB
 as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict()) as_dict.saveToMongoDB(‘mongodb://localhost:27017/agile_data_science.on_time_performance')
  • 49. Data Syndrome: Agile Data Science 2.0 Publishing Flight Records to ElasticSearch Plumbing our master records through to the web 49 # Load the parquet file
 on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 
 # Save the DataFrame to Elasticsearch
 on_time_dataframe.write.format("org.elasticsearch.spark.sql")
 .option("es.resource","agile_data_science/on_time_performance")
 .option("es.batch.size.entries","100")
 .mode("overwrite")
 .save()
  • 50. Data Syndrome: Agile Data Science 2.0 Putting Records on the Web Plumbing our master records through to the web 50 from flask import Flask, render_template, request
 from pymongo import MongoClient
 from bson import json_util
 
 # Set up Flask and Mongo
 app = Flask(__name__)
 client = MongoClient()
 
 # Controller: Fetch an email and display it
 @app.route("/on_time_performance")
 def on_time_performance():
 
 carrier = request.args.get('Carrier')
 flight_date = request.args.get('FlightDate')
 flight_num = request.args.get('FlightNum')
 
 flight = client.agile_data_science.on_time_performance.find_one({
 'Carrier': carrier,
 'FlightDate': flight_date,
 'FlightNum': int(flight_num)
 })
 
 return json_util.dumps(flight)
 
 if __name__ == "__main__":
 app.run(debug=True)
  • 51. Data Syndrome: Agile Data Science 2.0 Putting Records on the Web Plumbing our master records through to the web 51
  • 52. Data Syndrome: Agile Data Science 2.0 Putting Records on the Web Plumbing our master records through to the web 52
  • 53. Agile Data Science 2.0 53 Getting to know your data Tables and Charts
  • 54. Data Syndrome: Agile Data Science 2.0 Tables in PySpark Back end development in PySpark 54 # Load the parquet file
 on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
 
 # Use SQL to look at the total flights by month across 2015
 on_time_dataframe.registerTempTable("on_time_dataframe")
 total_flights_by_month = spark.sql(
 """SELECT Month, Year, COUNT(*) AS total_flights
 FROM on_time_dataframe
 GROUP BY Year, Month
 ORDER BY Year, Month"""
 )
 
 # This map/asDict trick makes the rows print a little prettier. It is optional.
 flights_chart_data = total_flights_by_month.map(lambda row: row.asDict())
 flights_chart_data.collect()
 
 # Save chart to MongoDB
 import pymongo_spark
 pymongo_spark.activate()
 flights_chart_data.saveToMongoDB(
 'mongodb://localhost:27017/agile_data_science.flights_by_month'
 )
  • 55. Data Syndrome: Agile Data Science 2.0 Tables in Flask and Jinja2 Front end development in Flask: controller and template 55 # Controller: Fetch a flight table
 @app.route("/total_flights")
 def total_flights():
 total_flights = client.agile_data_science.flights_by_month.find({}, 
 sort = [
 ('Year', 1),
 ('Month', 1)
 ])
 return render_template('total_flights.html', total_flights=total_flights) {% extends "layout.html" %}
 {% block body %}
 <div>
 <p class="lead">Total Flights by Month</p>
 <table class="table table-condensed table-striped" style="width: 200px;">
 <thead>
 <th>Month</th>
 <th>Total Flights</th>
 </thead>
 <tbody>
 {% for month in total_flights %}
 <tr>
 <td>{{month.Month}}</td>
 <td>{{month.total_flights}}</td>
 </tr>
 {% endfor %}
 </tbody>
 </table>
 </div>
 {% endblock %}
  • 56. Data Syndrome: Agile Data Science 2.0 Tables Visualizing data 56
  • 57. Data Syndrome: Agile Data Science 2.0 Charts in Flask and d3.js Visualizing data with JSON and d3.js 57 # Serve the chart's data via an asynchronous request (formerly known as 'AJAX')
 @app.route("/total_flights.json")
 def total_flights_json():
 total_flights = client.agile_data_science.flights_by_month.find({}, 
 sort = [
 ('Year', 1),
 ('Month', 1)
 ])
 return json_util.dumps(total_flights, ensure_ascii=False) var width = 960,
 height = 350;
 
 var y = d3.scale.linear()
 .range([height, 0]);
 // We define the domain once we get our data in d3.json, below
 
 var chart = d3.select(".chart")
 .attr("width", width)
 .attr("height", height);
 
 d3.json("/total_flights.json", function(data) {
 
 var defaultColor = 'steelblue';
 var modeColor = '#4CA9F5';
 
 var maxY = d3.max(data, function(d) { return d.total_flights; });
 y.domain([0, maxY]);
 
 var varColor = function(d, i) {
 if(d['total_flights'] == maxY) { return modeColor; }
 else { return defaultColor; }
 }
 var barWidth = width / data.length;
 var bar = chart.selectAll("g")
 .data(data)
 .enter()
 .append("g")
 .attr("transform", function(d, i) { return "translate(" + i * barWidth + ",0)"; });
 
 bar.append("rect")
 .attr("y", function(d) { return y(d.total_flights); })
 .attr("height", function(d) { return height - y(d.total_flights); })
 .attr("width", barWidth - 1)
 .style("fill", varColor);
 
 bar.append("text")
 .attr("x", barWidth / 2)
 .attr("y", function(d) { return y(d.total_flights) + 3; })
 .attr("dy", ".75em")
 .text(function(d) { return d.total_flights; });
 });
  • 58. Data Syndrome: Agile Data Science 2.0 Charts Visualizing data 58
  • 59. Data Syndrome: Agile Data Science 2.0 Charts in Flask and d3.js Visualizing data with JSON and d3.js 59 # Serve the chart's data via an asynchronous request (formerly known as 'AJAX')
 @app.route("/total_flights.json")
 def total_flights_json():
 total_flights = client.agile_data_science.flights_by_month.find({}, 
 sort = [
 ('Year', 1),
 ('Month', 1)
 ])
 return json_util.dumps(total_flights, ensure_ascii=False) var width = 960, height = 350; var y = d3.scale.linear() .range([height, 0]); // We define the domain once we get our data in d3.json, below var chart = d3.select(".chart") .attr("width", width) .attr("height", height); d3.json("/total_flights.json", function(data) { y.domain([0, d3.max(data, function(d) { return d.total_flights; })]); var barWidth = width / data.length; var bar = chart.selectAll("g") .data(data) .enter() .append("g") .attr("transform", function(d, i) { return "translate(" + i * barWidth + ",0)"; }); bar.append("rect") .attr("y", function(d) { return y(d.total_flights); }) .attr("height", function(d) { return height - y(d.total_flights); }) .attr("width", barWidth - 1); bar.append("text") .attr("x", barWidth / 2) .attr("y", function(d) { return y(d.total_flights) + 3; }) .attr("dy", ".75em") .text(function(d) { return d.total_flights; }); });
  • 60. Data Syndrome: Agile Data Science 2.0 Charts Visualizing data 60
  • 61. Data Syndrome: Agile Data Science 2.0 Charts in Flask and d3.js Visualizing data with JSON and d3.js 61 # Serve the chart's data via an asynchronous request (formerly known as 'AJAX')
 @app.route("/total_flights.json")
 def total_flights_json():
 total_flights = client.agile_data_science.flights_by_month.find({}, 
 sort = [
 ('Year', 1),
 ('Month', 1)
 ])
 return json_util.dumps(total_flights, ensure_ascii=False) var width = 960,
 height = 350;
 
 var y = d3.scale.linear()
 .range([height, 0]);
 // We define the domain once we get our data in d3.json, below
 
 var chart = d3.select(".chart")
 .attr("width", width)
 .attr("height", height);
 
 d3.json("/total_flights.json", function(data) {
 
 var defaultColor = 'steelblue';
 var modeColor = '#4CA9F5';
 
 var maxY = d3.max(data, function(d) { return d.total_flights; });
 y.domain([0, maxY]);
 
 var varColor = function(d, i) {
 if(d['total_flights'] == maxY) { return modeColor; }
 else { return defaultColor; }
 }
 var barWidth = width / data.length;
 var bar = chart.selectAll("g")
 .data(data)
 .enter()
 .append("g")
 .attr("transform", function(d, i) { return "translate(" + i * barWidth + ",0)"; });
 
 bar.append("rect")
 .attr("y", function(d) { return y(d.total_flights); })
 .attr("height", function(d) { return height - y(d.total_flights); })
 .attr("width", barWidth - 1)
 .style("fill", varColor);
 
 bar.append("text")
 .attr("x", barWidth / 2)
 .attr("y", function(d) { return y(d.total_flights) + 3; })
 .attr("dy", ".75em")
 .text(function(d) { return d.total_flights; });
 });
  • 62. Data Syndrome: Agile Data Science 2.0 Charts Visualizing data 62
  • 63. Agile Data Science 2.0 63 Exploring your data through interaction Reports
  • 64. Data Syndrome: Agile Data Science 2.0 Creating Interactive Ontologies from Semi-Structured Data Extracting and visualizing entities 64
  • 65. Data Syndrome: Agile Data Science 2.0 Home Page Extracting and decorating entities 65
  • 66. Data Syndrome: Agile Data Science 2.0 Airline Entity @ /airline/DL Extracting and decorating entities 66
  • 67. Data Syndrome: Agile Data Science 2.0 Summarizing all Airline Entities @ /airplanes Describing entities in aggregate 67
  • 68. Data Syndrome: Agile Data Science 2.0 Summarizing Airlines 2.0 Describing entities in aggregate 68
  • 69. Data Syndrome: Agile Data Science 2.0 Summarizing Airlines 3.0 Describing entities in aggregate 69
  • 70. Data Syndrome: Agile Data Science 2.0 Summarizing Airlines 3.0 - Entity Resolution Problem Describing entities in aggregate 70 ... AIRBUS AIRBUS INDUSTRIE ... EMBRAER EMBRAER S A ... GULFSTREAM AEROSPACE GULFSTREAM AEROSPACE CORP ... MCDONNELL DOUGLAS MCDONNELL DOUGLAS AIRCRAFT CO MCDONNELL DOUGLAS CORPORATION ...
  • 71. Data Syndrome: Agile Data Science 2.0 Summarizing Airlines 3.0 - Entity Resolution Hack Describing entities in aggregate 71 # Detect the longest common beginning string in a pair of strings def longest_common_beginning(s1, s2): if s1 == s2: return s1 min_length = min(len(s1), len(s2)) i = 0 while i < min_length: if s1[i] == s2[i]: i += 1 else: break return s1[0:i] # Compare two manufacturers, returning a tuple describing the result def compare_manufacturers(mfrs): mfr1 = mfrs[0] mfr2 = mfrs[1] lcb = longest_common_beginning(mfr1, mfr2) len_lcb = len(lcb) record = { 'mfr1': mfr1, 'mfr2': mfr2, 'lcb': lcb, 'len_lcb': len_lcb, 'eq': mfr1 == mfr2 } return record # Pair every unique instance of Manufacturer field with every # other for comparison comparison_pairs = manufacturer_variety.join(manufacturer_variety) # Do the comparisons comparisons = comparison_pairs.rdd.map(compare_manufacturers) # Matches have > 5 starting chars in common matches = comparisons.filter(lambda f: f['eq'] == False and f['len_lcb'] > 5) # # Now we create a mapping of duplicate keys from their raw value # to the one we're going to use # # 1) Group the matches by the longest common beginning ('lcb') common_lcbs = matches.groupBy(lambda x: x['lcb']) # 2) Emit the raw value for each side of the match along with the key, our 'lcb' mfr1_map = common_lcbs.map( lambda x: [(y['mfr1'], x[0]) for y in x[1]]).flatMap(lambda x: x) mfr2_map = common_lcbs.map( lambda x: [(y['mfr2'], x[0]) for y in x[1]]).flatMap(lambda x: x) # 3) Combine the two sides of the comparison's records map_with_dupes = mfr1_map.union(mfr2_map) # 4) Remove duplicates mfr_dedupe_mapping = map_with_dupes.distinct() # 5) Convert mapping to dataframe to join to airplanes dataframe mapping_dataframe = mfr_dedupe_mapping.toDF() # 6) Give the mapping column names mapping_dataframe.registerTempTable("mapping_dataframe") mapping_dataframe = spark.sql( "SELECT _1 AS Raw, _2 AS NewManufacturer FROM mapping_dataframe" )
  • 72. Data Syndrome: Agile Data Science 2.0 Summarizing Airlines 4.0 Describing entities in aggregate 72
  • 73. Data Syndrome: Agile Data Science 2.0 Searching Flights Using ElasticSearch to search for flights 73
  • 74. Agile Data Science 2.0 74 Predicting the future for fun and profit Predictions
  • 75. Data Syndrome: Agile Data Science 2.0 Back End Design Deep Storage and Spark vs Kafka and Spark Streaming 75 / Batch Realtime Historical Data Train Model Apply Model Realtime Data
  • 76. Data Syndrome: Agile Data Science 2.0 76 jQuery in the web client submits a form to create the prediction request, and then polls another url every few seconds until the prediction is ready. The request generates a Kafka event, which a Spark Streaming worker processes by applying the model we trained in batch. Having done so, it inserts a record for the prediction in MongoDB, where the Flask app sends it to the web client the next time it polls the server Front End Design /flights/delays/predict/classify_realtime/
  • 77. Data Syndrome: Agile Data Science 2.0 User Interface Where the user submits prediction requests 77
  • 78. Data Syndrome: Agile Data Science 2.0 String Vectorization From properties of items to vector format 78
  • 79. Data Syndrome: Agile Data Science 2.0 79 scikit-learn was 166. Spark MLlib is very powerful! http://bit.ly/train_model_spark 190 Line Model # !/usr/bin/env python
 
 import sys, os, re
 
 # Pass date and base path to main() from airflow
 def main(base_path):
 
 # Default to "."
 try: base_path
 except NameError: base_path = "."
 if not base_path:
 base_path = "."
 
 APP_NAME = "train_spark_mllib_model.py"
 
 # If there is no SparkSession, create the environment
 try:
 sc and spark
 except NameError as e:
 import findspark
 findspark.init()
 import pyspark
 import pyspark.sql
 
 sc = pyspark.SparkContext()
 spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
 
 #
 # {
 # "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
 # "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
 # "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
 # }
 #
 from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
 from pyspark.sql.types import StructType, StructField
 from pyspark.sql.functions import udf
 
 schema = StructType([
 StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
 StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
 StructField("Carrier", StringType(), True), # "Carrier":"WN"
 StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
 StructField("Dest", StringType(), True), # "Dest":"SAN"
 StructField("Distance", DoubleType(), True), # "Distance":368.0
 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
 StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
 StructField("Origin", StringType(), True), # "Origin":"TUS"
 ])
 
 input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(
 base_path
 )
 features = spark.read.json(input_path, schema=schema)
 features.first()
 
 #
 # Check for nulls in features before using Spark ML
 #
 null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
 cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
 print(list(cols_with_nulls))
 
 #
 # Add a Route variable to replace FlightNum
 #
 from pyspark.sql.functions import lit, concat
 features_with_route = features.withColumn(
 'Route',
 concat(
 features.Origin,
 lit('-'),
 features.Dest
 )
 )
 features_with_route.show(6)
 
 #
 # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
 #
 from pyspark.ml.feature import Bucketizer
 
 # Setup the Bucketizer
 splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
 arrival_bucketizer = Bucketizer(
 splits=splits,
 inputCol="ArrDelay",
 outputCol="ArrDelayBucket"
 )
 
 # Save the bucketizer
 arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
 arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
 
 # Apply the bucketizer
 ml_bucketized_features = arrival_bucketizer.transform(features_with_route)
 ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
 
 #
 # Extract features tools in with pyspark.ml.feature
 #
 from pyspark.ml.feature import StringIndexer, VectorAssembler
 
 # Turn category fields into indexes
 for column in ["Carrier", "Origin", "Dest", "Route"]:
 string_indexer = StringIndexer(
 inputCol=column,
 outputCol=column + "_index"
 )
 
 string_indexer_model = string_indexer.fit(ml_bucketized_features)
 ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
 
 # Drop the original column
 ml_bucketized_features = ml_bucketized_features.drop(column)
 
 # Save the pipeline model
 string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(
 base_path,
 column
 )
 string_indexer_model.write().overwrite().save(string_indexer_output_path)
 
 # Combine continuous, numeric fields with indexes of nominal ones
 # ...into one feature vector
 numeric_columns = [
 "DepDelay", "Distance",
 "DayOfMonth", "DayOfWeek",
 "DayOfYear"]
 index_columns = ["Carrier_index", "Origin_index",
 "Dest_index", "Route_index"]
 vector_assembler = VectorAssembler(
 inputCols=numeric_columns + index_columns,
 outputCol="Features_vec"
 )
 final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
 
 # Save the numeric vector assembler
 vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
 vector_assembler.write().overwrite().save(vector_assembler_path)
 
 # Drop the index columns
 for column in index_columns:
 final_vectorized_features = final_vectorized_features.drop(column)
 
 # Inspect the finalized features
 final_vectorized_features.show()
 
 # Instantiate and fit random forest classifier on all the data
 from pyspark.ml.classification import RandomForestClassifier
 rfc = RandomForestClassifier(
 featuresCol="Features_vec",
 labelCol="ArrDelayBucket",
 predictionCol="Prediction",
 maxBins=4657,
 maxMemoryInMB=1024
 )
 model = rfc.fit(final_vectorized_features)
 
 # Save the new model over the old one
 model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(
 base_path
 )
 model.write().overwrite().save(model_output_path)
 
 # Evaluate model using test data
 predictions = model.transform(final_vectorized_features)
 
 from pyspark.ml.evaluation import MulticlassClassificationEvaluator
 evaluator = MulticlassClassificationEvaluator(
 predictionCol="Prediction",
 labelCol="ArrDelayBucket",
 metricName="accuracy"
 )
 accuracy = evaluator.evaluate(predictions)
 print("Accuracy = {}".format(accuracy))
 
 # Check the distribution of predictions
 predictions.groupBy("Prediction").count().show()
 
 # Check a sample
 predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
 
 if __name__ == "__main__":
 main(sys.argv[1])

  • 80. Data Syndrome: Agile Data Science 2.0 Initializing the Environment Setting up the environment… 80 # !/usr/bin/env python
 
 import sys, os, re
 
 # Pass date and base path to main() from airflow
 def main(base_path):
 
 # Default to "."
 try: base_path
 except NameError: base_path = "."
 if not base_path:
 base_path = "."
 
 APP_NAME = "train_spark_mllib_model.py"
 
 # If there is no SparkSession, create the environment
 try:
 sc and spark
 except NameError as e:
 import findspark
 findspark.init()
 import pyspark
 import pyspark.sql
 
 sc = pyspark.SparkContext()
 spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
  • 81. Data Syndrome: Agile Data Science 2.0 Loading the Training Data Using DataFrames to load structured data… 81 from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
 from pyspark.sql.types import StructType, StructField
 from pyspark.sql.functions import udf
 
 schema = StructType([
 StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
 StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
 StructField("Carrier", StringType(), True), # "Carrier":"WN"
 StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
 StructField("Dest", StringType(), True), # "Dest":"SAN"
 StructField("Distance", DoubleType(), True), # "Distance":368.0
 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
 StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
 StructField("Origin", StringType(), True), # "Origin":"TUS"
 ])
 
 input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(
 base_path
 )
 features = spark.read.json(input_path, schema=schema)
 features.first()
  • 82. Data Syndrome: Agile Data Science 2.0 Checking for Nulls Checking the data for null values that would crash Spark MLlib 82 #
 # Check for nulls in features before using Spark ML
 #
 null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
 cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
 print(list(cols_with_nulls))
  • 83. Data Syndrome: Agile Data Science 2.0 Adding a Feature Using DataFrame.withColumn to add a Route feature to the data… 83 #
 # Add a Route variable to replace FlightNum
 #
 from pyspark.sql.functions import lit, concat
 features_with_route = features.withColumn(
 'Route',
 concat(
 features.Origin,
 lit('-'),
 features.Dest
 )
 )
 features_with_route.show(6)
  • 84. Data Syndrome: Agile Data Science 2.0 Bucketizing the Prediction Column Using Bucketizer to convert a continuous variable to a nominal one… 84 #
 # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
 #
 from pyspark.ml.feature import Bucketizer
 
 # Setup the Bucketizer
 splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
 arrival_bucketizer = Bucketizer(
 splits=splits,
 inputCol="ArrDelay",
 outputCol="ArrDelayBucket"
 )
 
 # Save the bucketizer
 arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
 arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
 
 # Apply the bucketizer
 ml_bucketized_features = arrival_bucketizer.transform(features_with_route)
 ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
  • 85. Data Syndrome: Agile Data Science 2.0 StringIndexing the String Columns Using StringIndexer to convert nominal fields to numeric ones… 85 #
 # Extract features tools in with pyspark.ml.feature
 #
 from pyspark.ml.feature import StringIndexer, VectorAssembler
 
 # Turn category fields into indexes
 for column in ["Carrier", "Origin", "Dest", "Route"]:
 string_indexer = StringIndexer(
 inputCol=column,
 outputCol=column + "_index"
 )
 
 string_indexer_model = string_indexer.fit(ml_bucketized_features)
 ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
 
 # Drop the original column
 ml_bucketized_features = ml_bucketized_features.drop(column)
 
 # Save the pipeline model
 string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(
 base_path,
 column
 )
 string_indexer_model.write().overwrite().save(string_indexer_output_path)
  • 86. Data Syndrome: Agile Data Science 2.0 Vectorizing the Numeric Columns Combining the numeric fields with VectorAssembler… 86 # Combine continuous, numeric fields with indexes of nominal ones
 # ...into one feature vector
 numeric_columns = [
 "DepDelay", "Distance",
 "DayOfMonth", "DayOfWeek",
 "DayOfYear"]
 index_columns = ["Carrier_index", "Origin_index",
 "Dest_index", "Route_index"]
 vector_assembler = VectorAssembler(
 inputCols=numeric_columns + index_columns,
 outputCol="Features_vec"
 )
 final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
 
 # Save the numeric vector assembler
 vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
 vector_assembler.write().overwrite().save(vector_assembler_path)
 
 # Drop the index columns
 for column in index_columns:
 final_vectorized_features = final_vectorized_features.drop(column)
 
 # Inspect the finalized features
 final_vectorized_features.show()
  • 87. Data Syndrome: Agile Data Science 2.0 Training the Classifier Model Creating and training a RandomForestClassifier model 87 # Instantiate and fit random forest classifier on all the data
 from pyspark.ml.classification import RandomForestClassifier
 rfc = RandomForestClassifier(
 featuresCol="Features_vec",
 labelCol="ArrDelayBucket",
 predictionCol="Prediction",
 maxBins=4657,
 maxMemoryInMB=1024
 )
 model = rfc.fit(final_vectorized_features)
 
 # Save the new model over the old one
 model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(
 base_path
 )
 model.write().overwrite().save(model_output_path)
  • 88. Data Syndrome: Agile Data Science 2.0 Evaluating the Classifier Model Using MultiClassificationEvaluator to check the accuracy of the model… 88 # Evaluate model using test data
 predictions = model.transform(final_vectorized_features)
 
 from pyspark.ml.evaluation import MulticlassClassificationEvaluator
 evaluator = MulticlassClassificationEvaluator(
 predictionCol="Prediction",
 labelCol="ArrDelayBucket",
 metricName="accuracy"
 )
 accuracy = evaluator.evaluate(predictions)
 print("Accuracy = {}".format(accuracy))
 
 # Check the distribution of predictions
 predictions.groupBy("Prediction").count().show()
 
 # Check a sample
 predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
  • 89. Data Syndrome: Agile Data Science 2.0 Running Main Just what it looks like… 89 if __name__ == "__main__":
 main(sys.argv[1])
  • 90. Data Syndrome: Agile Data Science 2.0 90 Using the model in realtime via Spark Streaming! Deploying the Model
  • 91. Data Syndrome: Agile Data Science 2.0 Deployment of Model Training to Airflow Using Airflow to batch schedule the training of our model 91 training_dag = DAG( 'agile_data_science_batch_prediction_model_training', default_args=default_args ) # We use the same two commands for all our PySpark tasks pyspark_bash_command = """ spark-submit --master {{ params.master }} {{ params.base_path }}/{{ params.filename }} {{ params.base_path }} """ pyspark_date_bash_command = """ spark-submit --master {{ params.master }} {{ params.base_path }}/{{ params.filename }} {{ ds }} {{ params.base_path }} """ # Gather the training data for our classifier extract_features_operator = BashOperator( task_id = "pyspark_extract_features", bash_command = pyspark_bash_command, params = { "master": "local[8]", "filename": "ch08/extract_features.py", "base_path": "{}/".format(PROJECT_HOME) }, dag=training_dag ) # Train and persist the classifier model train_classifier_model_operator = BashOperator( task_id = "pyspark_train_classifier_model", bash_command = pyspark_bash_command, params = { "master": "local[8]", "filename": "ch08/train_spark_mllib_model.py", "base_path": "{}/".format(PROJECT_HOME) }, dag=training_dag ) # The model training depends on the feature extraction train_classifier_model_operator.set_upstream(extract_features_operator)
  • 92. Data Syndrome: Agile Data Science 2.0 Deployment of Model Training to Airflow Using Airflow to batch schedule the training of our model 92
  • 93. Data Syndrome: Agile Data Science 2.0 Loading the Models Loading the models we trained in batch to reproduce the data pipeline 93 # ch08/make_predictions_streaming.py # Load the arrival delay bucketizer
 from pyspark.ml.feature import Bucketizer
 arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
 arrival_bucketizer = Bucketizer.load(arrival_bucketizer_path)
 
 # Load all the string field vectorizer pipelines into a dict
 from pyspark.ml.feature import StringIndexerModel
 
 string_indexer_models = {}
 for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",
 "Origin", "Dest", "Route"]:
 string_indexer_model_path = "{}/models/string_indexer_model_{}.bin".format(
 base_path,
 column
 )
 string_indexer_model = StringIndexerModel.load(string_indexer_model_path)
 string_indexer_models[column] = string_indexer_model
  • 94. Data Syndrome: Agile Data Science 2.0 Loading the Models Loading the models we trained in batch to reproduce the data pipeline 94 # ch08/make_predictions_streaming.py # Load the numeric vector assembler
 from pyspark.ml.feature import VectorAssembler
 vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
 vector_assembler = VectorAssembler.load(vector_assembler_path)
 
 # Load the classifier model
 from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
 random_forest_model_path = "{}/models/spark_random_forest_classifier.flight_delays. 5.0.bin".format(
 base_path
 )
 rfc = RandomForestClassificationModel.load(
 random_forest_model_path
  • 95. Data Syndrome: Agile Data Science 2.0 Connecting to Kafka Creating a direct stream to the Kafka queue containing our prediction requests 95 #
 # Process Prediction Requests in Streaming
 # from pyspark.streaming.kafka import KafkaUtils
 
 stream = KafkaUtils.createDirectStream(
 ssc,
 [PREDICTION_TOPIC],
 {
 "metadata.broker.list": BROKERS,
 "group.id": "0",
 }
 )
 
 object_stream = stream.map(lambda x: json.loads(x[1]))
 object_stream.pprint()
  • 96. Data Syndrome: Agile Data Science 2.0 Repeating the Pipeline Running the prediction requests through the same data flow as the training data 96 row_stream = object_stream.map(
 lambda x: Row(
 FlightDate=iso8601.parse_date(x['FlightDate']),
 Origin=x['Origin'],
 Distance=x['Distance'],
 DayOfMonth=x['DayOfMonth'],
 DayOfYear=x['DayOfYear'],
 UUID=x['UUID'],
 DepDelay=x['DepDelay'],
 DayOfWeek=x['DayOfWeek'],
 FlightNum=x['FlightNum'],
 Dest=x['Dest'],
 Timestamp=iso8601.parse_date(x['Timestamp']),
 Carrier=x['Carrier']
 )
 )
 row_stream.pprint() # Do the classification and store to Mongo
 row_stream.foreachRDD(classify_prediction_requests)
 
 ssc.start()
 ssc.awaitTermination()
  • 97. Data Syndrome: Agile Data Science 2.0 Repeating the Pipeline Running the prediction requests through the same data flow as the training data 97 def classify_prediction_requests(rdd):
 
 from pyspark.sql.types import StringType, IntegerType, DoubleType, DateType, TimestampType
 from pyspark.sql.types import StructType, StructField
 
 prediction_request_schema = StructType([
 StructField("Carrier", StringType(), True),
 StructField("DayOfMonth", IntegerType(), True),
 StructField("DayOfWeek", IntegerType(), True),
 StructField("DayOfYear", IntegerType(), True),
 StructField("DepDelay", DoubleType(), True),
 StructField("Dest", StringType(), True),
 StructField("Distance", DoubleType(), True),
 StructField("FlightDate", DateType(), True),
 StructField("FlightNum", StringType(), True),
 StructField("Origin", StringType(), True),
 StructField("Timestamp", TimestampType(), True),
 StructField("UUID", StringType(), True),
 ])
 
 prediction_requests_df = spark.createDataFrame(rdd, schema=prediction_request_schema)
 prediction_requests_df.show() from pyspark.sql.functions import lit, concat
 prediction_requests_with_route = prediction_requests_df.withColumn(
 'Route',
 concat(
 prediction_requests_df.Origin,
 lit('-'),
 prediction_requests_df.Dest
 )
 )
 prediction_requests_with_route.show(6) ...
  • 98. Data Syndrome: Agile Data Science 2.0 Repeating the Pipeline Running the prediction requests through the same data flow as the training data 98 for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",
 "Origin", "Dest", "Route"]:
 string_indexer_model = string_indexer_models[column]
 prediction_requests_with_route = string_indexer_model.transform(prediction_requests_with_route)
 
 # Vectorize numeric columns: DepDelay, Distance and index columns
 final_vectorized_features = vector_assembler.transform(prediction_requests_with_route)
 
 # Inspect the vectors
 final_vectorized_features.show()
 
 # Drop the individual index columns
 index_columns = ["Carrier_index", "DayOfMonth_index", "DayOfWeek_index", "DayOfYear_index",
 "Origin_index", "Dest_index", "Route_index"]
 for column in index_columns:
 final_vectorized_features = final_vectorized_features.drop(column)
 
 # Inspect the finalized features
 final_vectorized_features.show()
 
 # Make the prediction
 predictions = rfc.transform(final_vectorized_features) # Drop the features vector and prediction metadata to give the original fields
 predictions = predictions.drop("Features_vec")
 final_predictions = predictions.drop("indices").drop("values").drop("rawPrediction").drop("probability")
 
 # Inspect the output
 final_predictions.show()
  • 99. Data Syndrome: Agile Data Science 2.0 Storing to Mongo Putting the result where our web application can access it 99 # Store to Mongo
 if final_predictions.count() > 0:
 final_predictions.rdd.map(lambda x: x.asDict()).saveToMongoDB(
 "mongodb://localhost:27017/agile_data_science.flight_delay_classification_response"
 )
  • 100. Data Syndrome: Agile Data Science 2.0 Final Result of Deployed Prediction The bang for the buck! 100
  • 101. Data Syndrome: Agile Data Science 2.0 101 Experimental setup for iteratively improving the predictive model Improving the Model
  • 102. Data Syndrome: Agile Data Science 2.0 Experiment Setup Necessary to improve model 102
  • 103. Data Syndrome: Agile Data Science 2.0 103 155 additional lines to setup an experiment and add 3 new features to improvement the model http://bit.ly/improved_model_spark 345 L.O.C. # !/usr/bin/env python
 
 import sys, os, re
 import json
 import datetime, iso8601
 from tabulate import tabulate
 
 # Pass date and base path to main() from airflow
 def main(base_path):
 APP_NAME = "train_spark_mllib_model.py"
 
 # If there is no SparkSession, create the environment
 try:
 sc and spark
 except NameError as e:
 import findspark
 findspark.init()
 import pyspark
 import pyspark.sql
 
 sc = pyspark.SparkContext()
 spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
 
 #
 # {
 # "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
 # "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
 # "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
 # }
 #
 from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
 from pyspark.sql.types import StructType, StructField
 from pyspark.sql.functions import udf
 
 schema = StructType([
 StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
 StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
 StructField("Carrier", StringType(), True), # "Carrier":"WN"
 StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
 StructField("Dest", StringType(), True), # "Dest":"SAN"
 StructField("Distance", DoubleType(), True), # "Distance":368.0
 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
 StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
 StructField("Origin", StringType(), True), # "Origin":"TUS"
 ])
 
 input_path = "{}/data/simple_flight_delay_features.json".format(
 base_path
 )
 features = spark.read.json(input_path, schema=schema)
 features.first()
 
 #
 # Add a Route variable to replace FlightNum
 #
 from pyspark.sql.functions import lit, concat
 features_with_route = features.withColumn(
 'Route',
 concat(
 features.Origin,
 lit('-'),
 features.Dest
 )
 )
 features_with_route.show(6)
 
 #
 # Add the hour of day of scheduled arrival/departure
 #
 from pyspark.sql.functions import hour
 features_with_hour = features_with_route.withColumn(
 "CRSDepHourOfDay",
 hour(features.CRSDepTime)
 )
 features_with_hour = features_with_hour.withColumn(
 "CRSArrHourOfDay",
 hour(features.CRSArrTime)
 )
 features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show()
 
 #
 # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
 #
 from pyspark.ml.feature import Bucketizer
 
 # Setup the Bucketizer
 splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
 arrival_bucketizer = Bucketizer(
 splits=splits,
 inputCol="ArrDelay",
 outputCol="ArrDelayBucket"
 )
 
 # Save the model
 arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
 arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
 
 # Apply the model
 ml_bucketized_features = arrival_bucketizer.transform(features_with_hour)
 ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
 
 #
 # Extract features tools in with pyspark.ml.feature
 #
 from pyspark.ml.feature import StringIndexer, VectorAssembler
 
 # Turn category fields into indexes
 for column in ["Carrier", "Origin", "Dest", "Route"]:
 string_indexer = StringIndexer(
 inputCol=column,
 outputCol=column + "_index"
 )
 
 string_indexer_model = string_indexer.fit(ml_bucketized_features)
 ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
 # Save the pipeline model
 string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format(
 base_path,
 column
 )
 string_indexer_model.write().overwrite().save(string_indexer_output_path)
 
 # Combine continuous, numeric fields with indexes of nominal ones
 # ...into one feature vector
 numeric_columns = [
 "DepDelay", "Distance",
 "DayOfMonth", "DayOfWeek",
 "DayOfYear", "CRSDepHourOfDay",
 "CRSArrHourOfDay"]
 index_columns = ["Carrier_index", "Origin_index",
 "Dest_index", "Route_index"]
 vector_assembler = VectorAssembler(
 inputCols=numeric_columns + index_columns,
 outputCol="Features_vec"
 )
 final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
 
 # Save the numeric vector assembler
 vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)
 vector_assembler.write().overwrite().save(vector_assembler_path)
 
 # Drop the index columns
 for column in index_columns:
 final_vectorized_features = final_vectorized_features.drop(column)
 
 # Inspect the finalized features
 final_vectorized_features.show()
 
 #
 # Cross validate, train and evaluate classifier: loop 5 times for 4 metrics
 #
 
 from collections import defaultdict
 scores = defaultdict(list)
 feature_importances = defaultdict(list)
 metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]
 split_count = 3
 
 for i in range(1, split_count + 1):
 print("nRun {} out of {} of test/train splits in cross validation...".format(
 i,
 split_count,
 )
 )
 
 # Test/train split
 training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])
 
 # Instantiate and fit random forest classifier on all the data
 from pyspark.ml.classification import RandomForestClassifier
 rfc = RandomForestClassifier(
 featuresCol="Features_vec",
 labelCol="ArrDelayBucket",
 predictionCol="Prediction",
 maxBins=4657,
 )
 model = rfc.fit(training_data)
 
 # Save the new model over the old one
 model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(
 base_path
 )
 model.write().overwrite().save(model_output_path)
 
 # Evaluate model using test data
 predictions = model.transform(test_data)
 
 # Evaluate this split's results for each metric
 from pyspark.ml.evaluation import MulticlassClassificationEvaluator
 for metric_name in metric_names:
 evaluator = MulticlassClassificationEvaluator(
 labelCol="ArrDelayBucket",
 predictionCol="Prediction",
 metricName=metric_name
 )
 score = evaluator.evaluate(predictions)
 
 scores[metric_name].append(score)
 print("{} = {}".format(metric_name, score))
 
 #
 # Collect feature importances
 #
 feature_names = vector_assembler.getInputCols()
 feature_importance_list = model.featureImportances
 for feature_name, feature_importance in zip(feature_names, feature_importance_list):
 feature_importances[feature_name].append(feature_importance)
 
 #
 # Evaluate average and STD of each metric and print a table
 #
 import numpy as np
 score_averages = defaultdict(float)
 
 # Compute the table data
 average_stds = [] # ha
 for metric_name in metric_names:
 metric_scores = scores[metric_name]
 
 average_accuracy = sum(metric_scores) / len(metric_scores)
 score_averages[metric_name] = average_accuracy
 
 std_accuracy = np.std(metric_scores)
 
 average_stds.append((metric_name, average_accuracy, std_accuracy))
 
 # Print the table
 print("nExperiment Log")
 print("--------------")
 print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))
 
 #
 # Persist the score to a sccore log that exists between runs
 #
 import pickle # Load the score log or initialize an empty one
 try:
 score_log_filename = "{}/models/score_log.pickle".format(base_path)
 score_log = pickle.load(open(score_log_filename, "rb"))
 if not isinstance(score_log, list):
 score_log = []
 except IOError:
 score_log = []
 
 # Compute the existing score log entry
 score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}
 
 # Compute and display the change in score for each metric
 try:
 last_log = score_log[-1]
 except (IndexError, TypeError, AttributeError):
 last_log = score_log_entry
 
 experiment_report = []
 for metric_name in metric_names:
 run_delta = score_log_entry[metric_name] - last_log[metric_name]
 experiment_report.append((metric_name, run_delta))
 
 print("nExperiment Report")
 print("-----------------")
 print(tabulate(experiment_report, headers=["Metric", "Score"]))
 
 # Append the existing average scores to the log
 score_log.append(score_log_entry)
 
 # Persist the log for next run
 pickle.dump(score_log, open(score_log_filename, "wb"))
 
 #
 # Analyze and report feature importance changes
 #
 
 # Compute averages for each feature
 feature_importance_entry = defaultdict(float)
 for feature_name, value_list in feature_importances.items():
 average_importance = sum(value_list) / len(value_list)
 feature_importance_entry[feature_name] = average_importance
 
 # Sort the feature importances in descending order and print
 import operator
 sorted_feature_importances = sorted(
 feature_importance_entry.items(),
 key=operator.itemgetter(1),
 reverse=True
 )
 
 print("nFeature Importances")
 print("-------------------")
 print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))
 
 #
 # Compare this run's feature importances with the previous run's
 #
 
 # Load the feature importance log or initialize an empty one
 try:
 feature_log_filename = "{}/models/feature_log.pickle".format(base_path)
 feature_log = pickle.load(open(feature_log_filename, "rb"))
 if not isinstance(feature_log, list):
 feature_log = []
 except IOError:
 feature_log = []
 
 # Compute and display the change in score for each feature
 try:
 last_feature_log = feature_log[-1]
 except (IndexError, TypeError, AttributeError):
 last_feature_log = defaultdict(float)
 for feature_name, importance in feature_importance_entry.items():
 last_feature_log[feature_name] = importance
 
 # Compute the deltas
 feature_deltas = {}
 for feature_name in feature_importances.keys():
 run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]
 feature_deltas[feature_name] = run_delta
 
 # Sort feature deltas, biggest change first
 import operator
 sorted_feature_deltas = sorted(
 feature_deltas.items(),
 key=operator.itemgetter(1),
 reverse=True
 )
 
 # Display sorted feature deltas
 print("nFeature Importance Delta Report")
 print("-------------------------------")
 print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))
 
 # Append the existing average deltas to the log
 feature_log.append(feature_importance_entry)
 
 # Persist the log for next run
 pickle.dump(feature_log, open(feature_log_filename, "wb"))
 
 if __name__ == "__main__":
 main(sys.argv[1])

  • 104. Data Syndrome: Agile Data Science 2.0 Creating an Experiment Cross validate, train and evaluate classifier 104 from collections import defaultdict scores = defaultdict(list) feature_importances = defaultdict(list) metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"] split_count = 3 for i in range(1, split_count + 1): print("nRun {} out of {} of test/train splits in cross validation...".format( i, split_count, ) ) # Test/train split training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2]) # Instantiate and fit random forest classifier on all the data from pyspark.ml.classification import RandomForestClassifier rfc = RandomForestClassifier( featuresCol="Features_vec", labelCol="ArrDelayBucket", predictionCol="Prediction", maxBins=4657, ) model = rfc.fit(training_data)
  • 105. Data Syndrome: Agile Data Science 2.0 Creating an Experiment Cross validate, train and evaluate classifier 105 # Save the new model over the old one model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format( base_path ) model.write().overwrite().save(model_output_path) # Evaluate model using test data predictions = model.transform(test_data) # Evaluate this split's results for each metric from pyspark.ml.evaluation import MulticlassClassificationEvaluator for metric_name in metric_names: evaluator = MulticlassClassificationEvaluator( labelCol="ArrDelayBucket", predictionCol="Prediction", metricName=metric_name ) score = evaluator.evaluate(predictions) scores[metric_name].append(score) print("{} = {}".format(metric_name, score))
  • 106. Data Syndrome: Agile Data Science 2.0 Creating an Experiment Cross validate, train and evaluate classifier 106 # # Collect feature importances # feature_names = vector_assembler.getInputCols() feature_importance_list = model.featureImportances for feature_name, feature_importance in zip(feature_names, feature_importance_list): feature_importances[feature_name].append(feature_importance) # # Evaluate average and STD of each metric and print a table # import numpy as np score_averages = defaultdict(float) # Compute the table data average_stds = [] # ha for metric_name in metric_names: metric_scores = scores[metric_name] average_accuracy = sum(metric_scores) / len(metric_scores) score_averages[metric_name] = average_accuracy std_accuracy = np.std(metric_scores) average_stds.append((metric_name, average_accuracy, std_accuracy))
  • 107. Data Syndrome: Agile Data Science 2.0 Creating an Experiment Cross validate, train and evaluate classifier 107 # Print the table print("nExperiment Log") print("--------------") print(tabulate(average_stds, headers=["Metric", "Average", "STD"])) # # Persist the score to a sccore log that exists between runs # import pickle # Load the score log or initialize an empty one try: score_log_filename = "{}/models/score_log.pickle".format(base_path) score_log = pickle.load(open(score_log_filename, "rb")) if not isinstance(score_log, list): score_log = [] except IOError: score_log = [] # Compute the existing score log entry score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}
  • 108. Data Syndrome: Agile Data Science 2.0 Creating an Experiment Cross validate, train and evaluate classifier 108 # Compute and display the change in score for each metric try: last_log = score_log[-1] except (IndexError, TypeError, AttributeError): last_log = score_log_entry experiment_report = [] for metric_name in metric_names: run_delta = score_log_entry[metric_name] - last_log[metric_name] experiment_report.append((metric_name, run_delta)) print("nExperiment Report") print("-----------------") print(tabulate(experiment_report, headers=["Metric", "Score"])) # Append the existing average scores to the log score_log.append(score_log_entry) # Persist the log for next run pickle.dump(score_log, open(score_log_filename, "wb"))
  • 109. Data Syndrome: Agile Data Science 2.0 Creating an Experiment Cross validate, train and evaluate classifier 109 # # Analyze and report feature importance changes # # Compute averages for each feature feature_importance_entry = defaultdict(float) for feature_name, value_list in feature_importances.items(): average_importance = sum(value_list) / len(value_list) feature_importance_entry[feature_name] = average_importance # Sort the feature importances in descending order and print import operator sorted_feature_importances = sorted( feature_importance_entry.items(), key=operator.itemgetter(1), reverse=True ) print("nFeature Importances") print("-------------------") print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))
  • 110. Data Syndrome: Agile Data Science 2.0 Creating an Experiment Cross validate, train and evaluate classifier 110 # # Compare this run's feature importances with the previous run's # # Load the feature importance log or initialize an empty one try: feature_log_filename = "{}/models/feature_log.pickle".format(base_path) feature_log = pickle.load(open(feature_log_filename, "rb")) if not isinstance(feature_log, list): feature_log = [] except IOError: feature_log = [] # Compute and display the change in score for each feature try: last_feature_log = feature_log[-1] except (IndexError, TypeError, AttributeError): last_feature_log = defaultdict(float) for feature_name, importance in feature_importance_entry.items(): last_feature_log[feature_name] = importance
  • 111. Data Syndrome: Agile Data Science 2.0 Creating an Experiment Cross validate, train and evaluate classifier 111 # Compute the deltas feature_deltas = {} for feature_name in feature_importances.keys(): run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name] feature_deltas[feature_name] = run_delta # Sort feature deltas, biggest change first import operator sorted_feature_deltas = sorted( feature_deltas.items(), key=operator.itemgetter(1), reverse=True ) # Display sorted feature deltas print("nFeature Importance Delta Report") print("-------------------------------") print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"])) # Append the existing average deltas to the log feature_log.append(feature_importance_entry) # Persist the log for next run pickle.dump(feature_log, open(feature_log_filename, "wb"))
  • 112. Data Syndrome: Agile Data Science 2.0 Running an Experiment Cross validate, train and evaluate classifier 112 Experiment Log -------------- Metric Average STD ----------------- --------- ----------- accuracy 0.594443 0.000382382 weightedPrecision 0.642419 0.00352101 weightedRecall 0.594443 0.000382382 f1 0.522397 0.000438121
  • 113. Data Syndrome: Agile Data Science 2.0 Comparing Experiments Cross validate, train and evaluate classifier 113 Experiment Report ----------------- Metric Score ----------------- ----------- accuracy 0.00300548 weightedPrecision -0.00592227 weightedRecall 0.00300548 f1 -0.0105553
  • 114. Data Syndrome: Agile Data Science 2.0 Feature Importances Cross validate, train and evaluate classifier 114 Feature Importances ------------------- Name Importance --------------- ------------ DepDelay 0.882216 Route_index 0.0571401 Origin_index 0.0142741 Distance 0.0134583 Dest_index 0.00745796 DayOfYear 0.00544761 Carrier_index 0.00454088 DayOfMonth 9.31109e-05 DayOfWeek 5.2597e-05
  • 115. Data Syndrome: Agile Data Science 2.0 Comparing Feature Importances Cross validate, train and evaluate classifier 115 Feature Importance Delta Report ------------------------------- Feature Delta --------------- ------------ CRSArrHourOfDay 0.00982592 DepDelay 0.00671702 CRSDepHourOfDay 0.0012717 DayOfWeek -2.72792e-05 DayOfMonth -8.81635e-05 Origin_index -0.00152788 Distance -0.00188322 Route_index -0.00214609 Dest_index -0.0032327 DayOfYear -0.00377184 Carrier_index -0.00513747
  • 116. Data Syndrome: Agile Data Science 2.0 116 Next steps for learning more about Agile Data Science 2.0 Next Steps
  • 117. Building Full-Stack Data Analytics Applications with Spark http://bit.ly/agile_data_science Available Now on O’Reilly Safari: http://bit.ly/agile_data_safari Agile Data Science 2.0
  • 118. Agile Data Science 2.0 118 Realtime Predictive Analytics Rapidly learn to build entire predictive systems driven by Kafka, PySpark, Speak Streaming, Spark MLlib and with a web front-end using Python/Flask and JQuery. Available for purchase at http://datasyndrome.com/video
  • 119. Data Syndrome Russell Jurney Principal Consultant Email : rjurney@datasyndrome.com Web : datasyndrome.com Data Syndrome, LLC Product Consulting We build analytics products and systems consisting of big data viz, predictions, recommendations, reports and search. Corporate Training We offer training courses for data scientists and engineers and data science teams, Video Training We offer video training courses that rapidly acclimate you with a technology and technique.