Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Call Girls in Naraina Delhi đŻCall Us đ8264348440đ
Â
Agile Data Science 2.0
1. Building Full Stack Data Analytics Applications with Kafka and Spark
Agile Data Science 2.0
http://www.slideshare.net/rjurney/agile-data-science-20
or
http://bit.ly/agile_data_slides
2. Agile Data Science 2.0
Russell Jurney
2
Data Engineer
Data Scientist
Visualization Software Engineer
85%
85%
85%
Writer
85%
Teacher
50%
Russell Jurney is a veteran data
scientist and thought leader. He
coined the term Agile Data Science in
the book of that name from OâReilly
in 2012, which outlines the first agile
development methodology for data
science. Russell has constructed
numerous full-stack analytics
products over the past ten years and
now works with clients helping them
extract value from their data assets.
Russell Jurney
Skill
Principal Consultant at Data Syndrome
Russell Jurney
Data Syndrome, LLC
Email : russell.jurney@gmail.com
Web : datasyndrome.com
Principal Consultant
3. Building Full-Stack Data Analytics Applications with Spark
http://bit.ly/agile_data_science
Agile Data Science 2.0
4. Agile Data Science 2.0 4
Realtime Predictive
Analytics
Rapidly learn to build entire predictive systems driven by
Kafka, PySpark, Speak Streaming, Spark MLlib and with a web
front-end using Python/Flask and JQuery.
Available for purchase at http://datasyndrome.com/video
5. Lorem Ipsum dolor siamet suame this placeholder for text can simply
random text. It has roots in a piece of classical. variazioni deiwords which
whichhtly. ven on your zuniga merida della is not denis.
Product Consulting
We build analytics products and systems
consisting of big data viz, predictions,
recommendations, reports and search.
Corporate Training
We offer training courses for data
scientists and engineers and data
science teams,
Video Training
We offer video training courses that rapidly
acclimate you with a technology and
technique.
6. Agile Data Science 2.0 6
What makes data science âagile data scienceâ?
Theory
7. Agile Data Science 2.0 7
Yes. Building applications is a fundamental skill for todayâs data scientist.
Data Products or Data Science?
9. Agile Data Science 2.0 9
If someone else has to start over and rebuild it, it ainât agile.
Big Data or Data Science?
10. Agile Data Science 2.0 10
Goal of
Methodology
The goal of agile data science in <140 characters: to
document and guide exploratory data analysis to
discover and follow the critical path to a compelling
product.
11. Agile Data Science 2.0 11
In analytics, the end-goal moves or is complex in nature, so we model as a
network of tasks rather than as a strictly linear process.
Critical Path
12. Agile Data Science 2.0
Agile Data Science Manifesto
12
Seven Principles for Agile Data Science
Discover and pursue the critical path to a killer product
Iterate, iterate, iterate: tables, charts, reports, predictions1.
Integrate the tyrannical opinion of data in product management4.
Get Meta. Describe the process, not just the end-state7.
Ship intermediate output. Even failed experiments have output2.
Climb up and down the data-value pyramid as we work5.
Prototype experiments over implementing tasks3.
6.
13. Agile Data Science 2.0 13
People will pay more for the things towards the top, but you need the things
on the bottom to have the things above. They are foundational. See:
Maslowâs Theory of Needs.
Data Value Pyramid
15. Agile Data Science 2.0
Agile Data Science 2.0 Stack
15
Apache Spark Apache Kafka MongoDB
Batch and Realtime
Realtime Queue Document Store
Flask
Simple Web App
Example of a high productivity stack for âbigâ data applications
ElasticSearch
Search
16. Agile Data Science 2.0
Flow of Data Processing
16
Tools and processes in collecting, refining, publishing and decorating data
{âhelloâ: âworldâ}
18. Agile Data Science 2.0 18
SQL or dataflow programming?
Programming Models
19. Agile Data Science 2.0 19
Describing what you want and letting the planner figure out how
SQL
SELECT associations2.object_id,
associations2.term_id, associations2.cat_ID,
associations2.term_taxonomy_idâ¨
FROM (SELECT objects_tags.object_id,
objects_tags.term_id, wp_cb_tags2cats.cat_ID,
categories.term_taxonomy_idâ¨
FROM (SELECT
wp_term_relationships.object_id,
wp_term_taxonomy.term_id,
wp_term_taxonomy.term_taxonomy_idâ¨
FROM wp_term_relationshipsâ¨
LEFT JOIN wp_term_taxonomy ON
wp_term_relationships.term_taxonomy_id =
wp_term_taxonomy.term_taxonomy_idâ¨
ORDER BY object_id ASC, term_id ASC) â¨
AS objects_tagsâ¨
LEFT JOIN wp_cb_tags2cats ON
objects_tags.term_id = wp_cb_tags2cats.tag_IDâ¨
LEFT JOIN (SELECT
wp_term_relationships.object_id,
wp_term_taxonomy.term_id as cat_ID,
wp_term_taxonomy.term_taxonomy_idâ¨
FROM wp_term_relationshipsâ¨
LEFT JOIN wp_term_taxonomy ON
wp_term_relationships.term_taxonomy_id =
wp_term_taxonomy.term_taxonomy_idâ¨
WHERE wp_term_taxonomy.taxonomy =
'category'â¨
GROUP BY object_id, cat_ID,
term_taxonomy_idâ¨
ORDER BY object_id, cat_ID,
term_taxonomy_id) â¨
AS categories on wp_cb_tags2cats.cat_ID
= categories.term_idâ¨
WHERE objects_tags.term_id =
wp_cb_tags2cats.tag_IDâ¨
GROUP BY object_id, term_id, cat_ID,
term_taxonomy_idâ¨
ORDER BY object_id ASC, term_id ASC, cat_ID
ASC) â¨
AS associations2â¨
LEFT JOIN categories ON associations2.object_id
= categories.object_idâ¨
WHERE associations2.cat_ID <> categories.cat_IDâ¨
GROUP BY object_id, term_id, cat_ID,
term_taxonomy_idâ¨
ORDER BY object_id, term_id, cat_ID,
term_taxonomy_id
20. Agile Data Science 2.0 20
Flowing data through operations to effect change
Dataflow Programming
21. Agile Data Science 2.0 21
The best of both worlds!
SQL AND
Dataflow
Programming
# Flights that were late arriving...â¨
late_arrivals =
on_time_dataframe.filter(on_time_dataframe.ArrD
elayMinutes > 0)â¨
total_late_arrivals = late_arrivals.count()â¨
â¨
# Flights that left late but made up time to
arrive on time...â¨
on_time_heros = on_time_dataframe.filter(â¨
(on_time_dataframe.DepDelayMinutes > 0)â¨
&â¨
(on_time_dataframe.ArrDelayMinutes <= 0)â¨
)â¨
total_on_time_heros = on_time_heros.count()â¨
â¨
# Get the percentage of flights that are late,
rounded to 1 decimal placeâ¨
pct_late = round((total_late_arrivals /
(total_flights * 1.0)) * 100, 1)â¨
â¨
print("Total flights:
{:,}".format(total_flights))â¨
print("Late departures:
{:,}".format(total_late_departures))â¨
print("Late arrivals:
{:,}".format(total_late_arrivals))â¨
print("Recoveries:
{:,}".format(total_on_time_heros))â¨
print("Percentage Late: {}%".format(pct_late))â¨
â¨
# Why are flights late? Lets look at some
delayed flights and the delay causesâ¨
late_flights = spark.sql("""â¨
SELECTâ¨
ArrDelayMinutes,â¨
WeatherDelay,â¨
CarrierDelay,â¨
NASDelay,â¨
SecurityDelay,â¨
LateAircraftDelayâ¨
FROMâ¨
on_time_performanceâ¨
WHEREâ¨
WeatherDelay IS NOT NULLâ¨
ORâ¨
CarrierDelay IS NOT NULLâ¨
ORâ¨
NASDelay IS NOT NULLâ¨
ORâ¨
SecurityDelay IS NOT NULLâ¨
ORâ¨
LateAircraftDelay IS NOT NULLâ¨
ORDER BYâ¨
FlightDateâ¨
""")â¨
late_flights.sample(False, 0.01).show()
# Calculate the percentage contribution to delay for each sourceâ¨
total_delays = spark.sql("""â¨
SELECTâ¨
ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_weather_delay,â¨
ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_carrier_delay,â¨
ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay,â¨
ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_security_delay,â¨
ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS
pct_late_aircraft_delayâ¨
FROM on_time_performanceâ¨
""")â¨
total_delays.show()â¨
â¨
# Generate a histogram of the weather and carrier delaysâ¨
weather_delay_histogram = on_time_dataframeâ¨
.select("WeatherDelay")â¨
.rddâ¨
.flatMap(lambda x: x)â¨
.histogram(10)â¨
â¨
print("{}n{}".format(weather_delay_histogram[0],
weather_delay_histogram[1]))â¨
â¨
# Eyeball the first to define our bucketsâ¨
weather_delay_histogram = on_time_dataframeâ¨
.select("WeatherDelay")â¨
.rddâ¨
.flatMap(lambda x: x)â¨
.histogram([1, 15, 30, 60, 120, 240, 480, 720, 24*60.0])â¨
print(weather_delay_histogram)
# Transform the data into something easily consumed by d3â¨
record = {'key': 1, 'data': []}â¨
for label, count in zip(weather_delay_histogram[0],
weather_delay_histogram[1]):â¨
record['data'].append(â¨
{â¨
'label': label,â¨
'count': countâ¨
}â¨
)â¨
â¨
# Save to Mongo directly, since this is a Tuple not a dataframe or RDDâ¨
from pymongo import MongoClientâ¨
client = MongoClient()â¨
client.relato.weather_delay_histogram.insert_one(record)
23. Data Syndrome: Agile Data Science 2.0
Collect and Serialize Events in JSON
I never regret using JSON
23
24. Data Syndrome: Agile Data Science 2.0
FAA On-Time Performance Records
95% of commercial flights
24http://www.transtats.bts.gov/Fields.asp?table_id=236
25. Data Syndrome: Agile Data Science 2.0
FAA On-Time Performance Records
95% of commercial flights
25
"Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum",
"OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips",
"OriginStateName","OriginWac","DestAirportID","DestAirportSeqID","DestCityMarketID","Dest","DestCityName","DestState",
"DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups",
"DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15",
"ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime",
"Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay",
"FirstDepTime","TotalAddGTime","LongestAddGTime","DivAirportLandings","DivReachedDest","DivActualElapsedTime","DivArrDelay",
"DivDistance","Div1Airport","Div1AirportID","Div1AirportSeqID","Div1WheelsOn","Div1TotalGTime","Div1LongestGTime",
"Div1WheelsOff","Div1TailNum","Div2Airport","Div2AirportID","Div2AirportSeqID","Div2WheelsOn","Div2TotalGTime",
"Div2LongestGTime","Div2WheelsOff","Div2TailNum","Div3Airport","Div3AirportID","Div3AirportSeqID","Div3WheelsOn",
"Div3TotalGTime","Div3LongestGTime","Div3WheelsOff","Div3TailNum","Div4Airport","Div4AirportID","Div4AirportSeqID",
"Div4WheelsOn","Div4TotalGTime","Div4LongestGTime","Div4WheelsOff","Div4TailNum","Div5Airport","Div5AirportID",
"Div5AirportSeqID","Div5WheelsOn","Div5TotalGTime","Div5LongestGTime","Div5WheelsOff","Div5TailNum"
26. Data Syndrome: Agile Data Science 2.0
openflights.org Database
Airports, Airlines, Routes
26
27. Data Syndrome: Agile Data Science 2.0
Scraping the FAA Registry
Airplane Data by Tail Number
27
28. Data Syndrome: Agile Data Science 2.0
Wikipedia Airlines Entries
Descriptions of Airlines
28
29. Data Syndrome: Agile Data Science 2.0
National Centers for Environmental Information
Historical Weather Observations
29
30. Agile Data Science 2.0 30
Working our way up the data value pyramid
Climbing the Stack
31. Agile Data Science 2.0 31
Starting by âplumbingâ the system from end to end
Plumbing
32. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records
Plumbing our master records through to the web
32
33. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records to MongoDB
Plumbing our master records through to the web
33
import pymongoâ¨
import pymongo_sparkâ¨
# Important: activate pymongo_spark.â¨
pymongo_spark.activate()â¨
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')â¨
# Convert to RDD of dicts and save to MongoDBâ¨
as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict())
as_dict.saveToMongoDB(âmongodb://localhost:27017/agile_data_science.on_time_performance')
34. Data Syndrome: Agile Data Science 2.0
Publishing Flight Records to ElasticSearch
Plumbing our master records through to the web
34
# Load the parquet fileâ¨
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')â¨
â¨
# Save the DataFrame to Elasticsearchâ¨
on_time_dataframe.write.format("org.elasticsearch.spark.sql")â¨
.option("es.resource","agile_data_science/on_time_performance")â¨
.option("es.batch.size.entries","100")â¨
.mode("overwrite")â¨
.save()
35. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
35
from flask import Flask, render_template, requestâ¨
from pymongo import MongoClientâ¨
from bson import json_utilâ¨
â¨
# Set up Flask and Mongoâ¨
app = Flask(__name__)â¨
client = MongoClient()â¨
â¨
# Controller: Fetch an email and display itâ¨
@app.route("/on_time_performance")â¨
def on_time_performance():â¨
â¨
carrier = request.args.get('Carrier')â¨
flight_date = request.args.get('FlightDate')â¨
flight_num = request.args.get('FlightNum')â¨
â¨
flight = client.agile_data_science.on_time_performance.find_one({â¨
'Carrier': carrier,â¨
'FlightDate': flight_date,â¨
'FlightNum': int(flight_num)â¨
})â¨
â¨
return json_util.dumps(flight)â¨
â¨
if __name__ == "__main__":â¨
app.run(debug=True)
36. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
36
37. Data Syndrome: Agile Data Science 2.0
Putting Records on the Web
Plumbing our master records through to the web
37
39. Data Syndrome: Agile Data Science 2.0
Tables in PySpark
Back end development in PySpark
39
# Load the parquet fileâ¨
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')â¨
â¨
# Use SQL to look at the total flights by month across 2015â¨
on_time_dataframe.registerTempTable("on_time_dataframe")â¨
total_flights_by_month = spark.sql(â¨
"""SELECT Month, Year, COUNT(*) AS total_flightsâ¨
FROM on_time_dataframeâ¨
GROUP BY Year, Monthâ¨
ORDER BY Year, Month"""â¨
)â¨
â¨
# This map/asDict trick makes the rows print a little prettier. It is optional.â¨
flights_chart_data = total_flights_by_month.map(lambda row: row.asDict())â¨
flights_chart_data.collect()â¨
â¨
# Save chart to MongoDBâ¨
import pymongo_sparkâ¨
pymongo_spark.activate()â¨
flights_chart_data.saveToMongoDB(â¨
'mongodb://localhost:27017/agile_data_science.flights_by_month'â¨
)
40. Data Syndrome: Agile Data Science 2.0
Tables in Flask and Jinja2
Front end development in Flask: controller and template
40
# Controller: Fetch a flight tableâ¨
@app.route("/total_flights")â¨
def total_flights():â¨
total_flights = client.agile_data_science.flights_by_month.find({}, â¨
sort = [â¨
('Year', 1),â¨
('Month', 1)â¨
])â¨
return render_template('total_flights.html', total_flights=total_flights)
{% extends "layout.html" %}â¨
{% block body %}â¨
<div>â¨
<p class="lead">Total Flights by Month</p>â¨
<table class="table table-condensed table-striped" style="width: 200px;">â¨
<thead>â¨
<th>Month</th>â¨
<th>Total Flights</th>â¨
</thead>â¨
<tbody>â¨
{% for month in total_flights %}â¨
<tr>â¨
<td>{{month.Month}}</td>â¨
<td>{{month.total_flights}}</td>â¨
</tr>â¨
{% endfor %}â¨
</tbody>â¨
</table>â¨
</div>â¨
{% endblock %}
44. Agile Data Science 2.0 44
Exploring your data through interaction
Reports
45. Data Syndrome: Agile Data Science 2.0
Creating Interactive Ontologies from Semi-Structured Data
Extracting and visualizing entities
45
46. Data Syndrome: Agile Data Science 2.0
Home Page
Extracting and decorating entities
46
47. Data Syndrome: Agile Data Science 2.0
Airline Entity
Extracting and decorating entities
47
48. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 1.0
Describing entities in aggregate
48
49. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 2.0
Describing entities in aggregate
49
50. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 3.0
Describing entities in aggregate
50
51. Data Syndrome: Agile Data Science 2.0
Summarizing Airlines 4.0
Describing entities in aggregate
51
52. Agile Data Science 2.0 52
Predicting the future for fun and profit
Predictions
53. Data Syndrome: Agile Data Science 2.0
Back End Design
Deep Storage and Spark vs Kafka and Spark Streaming
53
/
Batch Realtime
Historical Data
Train Model Apply Model
Realtime Data
54. Data Syndrome: Agile Data Science 2.0 54
jQuery in the web client submits a form to create the prediction request, and
then polls another url every few seconds until the prediction is ready. The
request generates a Kafka event, which a Spark Streaming worker processes
by applying the model we trained in batch. Having done so, it inserts a record
for the prediction in MongoDB, where the Flask app sends it to the web client
the next time it polls the server
Front End Design
/flights/delays/predict/classify_realtime/
55. Data Syndrome: Agile Data Science 2.0
String Vectorization
From properties of items to vector format
55
56. Data Syndrome: Agile Data Science 2.0 56
scikit-learn was 166. Spark MLlib is very powerful!
190 Line Model
# !/usr/bin/env pythonâ¨
â¨
import sys, os, reâ¨
â¨
# Pass date and base path to main() from airflowâ¨
def main(base_path):â¨
â¨
# Default to "."â¨
try: base_pathâ¨
except NameError: base_path = "."â¨
if not base_path:â¨
base_path = "."â¨
â¨
APP_NAME = "train_spark_mllib_model.py"â¨
â¨
# If there is no SparkSession, create the environmentâ¨
try:â¨
sc and sparkâ¨
except NameError as e:â¨
import findsparkâ¨
findspark.init()â¨
import pysparkâ¨
import pyspark.sqlâ¨
â¨
sc = pyspark.SparkContext()â¨
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()â¨
â¨
#â¨
# {â¨
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",â¨
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,â¨
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"â¨
# }â¨
#â¨
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampTypeâ¨
from pyspark.sql.types import StructType, StructFieldâ¨
from pyspark.sql.functions import udfâ¨
â¨
schema = StructType([â¨
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0â¨
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"â¨
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"â¨
StructField("Carrier", StringType(), True), # "Carrier":"WN"â¨
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31â¨
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4â¨
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365â¨
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0â¨
StructField("Dest", StringType(), True), # "Dest":"SAN"â¨
StructField("Distance", DoubleType(), True), # "Distance":368.0â¨
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"â¨
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"â¨
StructField("Origin", StringType(), True), # "Origin":"TUS"â¨
])â¨
â¨
input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(â¨
base_pathâ¨
)â¨
features = spark.read.json(input_path, schema=schema)â¨
features.first()â¨
â¨
#â¨
# Check for nulls in features before using Spark MLâ¨
#â¨
null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]â¨
cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)â¨
print(list(cols_with_nulls))â¨
â¨
#â¨
# Add a Route variable to replace FlightNumâ¨
#â¨
from pyspark.sql.functions import lit, concatâ¨
features_with_route = features.withColumn(â¨
'Route',â¨
concat(â¨
features.Origin,â¨
lit('-'),â¨
features.Destâ¨
)â¨
)â¨
features_with_route.show(6)â¨
â¨
#â¨
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)â¨
#â¨
from pyspark.ml.feature import Bucketizerâ¨
â¨
# Setup the Bucketizerâ¨
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]â¨
arrival_bucketizer = Bucketizer(â¨
splits=splits,â¨
inputCol="ArrDelay",â¨
outputCol="ArrDelayBucket"â¨
)â¨
â¨
# Save the bucketizerâ¨
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)â¨
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)â¨
â¨
# Apply the bucketizerâ¨
ml_bucketized_features = arrival_bucketizer.transform(features_with_route)â¨
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()â¨
â¨
#â¨
# Extract features tools in with pyspark.ml.featureâ¨
#â¨
from pyspark.ml.feature import StringIndexer, VectorAssemblerâ¨
â¨
# Turn category fields into indexesâ¨
for column in ["Carrier", "Origin", "Dest", "Route"]:â¨
string_indexer = StringIndexer(â¨
inputCol=column,â¨
outputCol=column + "_index"â¨
)â¨
â¨
string_indexer_model = string_indexer.fit(ml_bucketized_features)â¨
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)â¨
â¨
# Drop the original columnâ¨
ml_bucketized_features = ml_bucketized_features.drop(column)â¨
â¨
# Save the pipeline modelâ¨
string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(â¨
base_path,â¨
columnâ¨
)â¨
string_indexer_model.write().overwrite().save(string_indexer_output_path)â¨
â¨
# Combine continuous, numeric fields with indexes of nominal onesâ¨
# ...into one feature vectorâ¨
numeric_columns = [â¨
"DepDelay", "Distance",â¨
"DayOfMonth", "DayOfWeek",â¨
"DayOfYear"]â¨
index_columns = ["Carrier_index", "Origin_index",â¨
"Dest_index", "Route_index"]â¨
vector_assembler = VectorAssembler(â¨
inputCols=numeric_columns + index_columns,â¨
outputCol="Features_vec"â¨
)â¨
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)â¨
â¨
# Save the numeric vector assemblerâ¨
vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)â¨
vector_assembler.write().overwrite().save(vector_assembler_path)â¨
â¨
# Drop the index columnsâ¨
for column in index_columns:â¨
final_vectorized_features = final_vectorized_features.drop(column)â¨
â¨
# Inspect the finalized featuresâ¨
final_vectorized_features.show()â¨
â¨
# Instantiate and fit random forest classifier on all the dataâ¨
from pyspark.ml.classification import RandomForestClassifierâ¨
rfc = RandomForestClassifier(â¨
featuresCol="Features_vec",â¨
labelCol="ArrDelayBucket",â¨
predictionCol="Prediction",â¨
maxBins=4657,â¨
maxMemoryInMB=1024â¨
)â¨
model = rfc.fit(final_vectorized_features)â¨
â¨
# Save the new model over the old oneâ¨
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(â¨
base_pathâ¨
)â¨
model.write().overwrite().save(model_output_path)â¨
â¨
# Evaluate model using test dataâ¨
predictions = model.transform(final_vectorized_features)â¨
â¨
from pyspark.ml.evaluation import MulticlassClassificationEvaluatorâ¨
evaluator = MulticlassClassificationEvaluator(â¨
predictionCol="Prediction",â¨
labelCol="ArrDelayBucket",â¨
metricName="accuracy"â¨
)â¨
accuracy = evaluator.evaluate(predictions)â¨
print("Accuracy = {}".format(accuracy))â¨
â¨
# Check the distribution of predictionsâ¨
predictions.groupBy("Prediction").count().show()â¨
â¨
# Check a sampleâ¨
predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)â¨
â¨
if __name__ == "__main__":â¨
main(sys.argv[1])â¨
57. Data Syndrome: Agile Data Science 2.0
Experiment Setup
Necessary to improve model
57
58. Data Syndrome: Agile Data Science 2.0 58
155 additional lines to setup an experiment
and add improvements
345 L.O.C.
# !/usr/bin/env pythonâ¨
â¨
import sys, os, reâ¨
import jsonâ¨
import datetime, iso8601â¨
from tabulate import tabulateâ¨
â¨
# Pass date and base path to main() from airflowâ¨
def main(base_path):â¨
APP_NAME = "train_spark_mllib_model.py"â¨
â¨
# If there is no SparkSession, create the environmentâ¨
try:â¨
sc and sparkâ¨
except NameError as e:â¨
import findsparkâ¨
findspark.init()â¨
import pysparkâ¨
import pyspark.sqlâ¨
â¨
sc = pyspark.SparkContext()â¨
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()â¨
â¨
#â¨
# {â¨
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",â¨
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,â¨
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"â¨
# }â¨
#â¨
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampTypeâ¨
from pyspark.sql.types import StructType, StructFieldâ¨
from pyspark.sql.functions import udfâ¨
â¨
schema = StructType([â¨
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0â¨
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"â¨
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"â¨
StructField("Carrier", StringType(), True), # "Carrier":"WN"â¨
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31â¨
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4â¨
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365â¨
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0â¨
StructField("Dest", StringType(), True), # "Dest":"SAN"â¨
StructField("Distance", DoubleType(), True), # "Distance":368.0â¨
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"â¨
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"â¨
StructField("Origin", StringType(), True), # "Origin":"TUS"â¨
])â¨
â¨
input_path = "{}/data/simple_flight_delay_features.json".format(â¨
base_pathâ¨
)â¨
features = spark.read.json(input_path, schema=schema)â¨
features.first()â¨
â¨
#â¨
# Add a Route variable to replace FlightNumâ¨
#â¨
from pyspark.sql.functions import lit, concatâ¨
features_with_route = features.withColumn(â¨
'Route',â¨
concat(â¨
features.Origin,â¨
lit('-'),â¨
features.Destâ¨
)â¨
)â¨
features_with_route.show(6)â¨
â¨
#â¨
# Add the hour of day of scheduled arrival/departureâ¨
#â¨
from pyspark.sql.functions import hourâ¨
features_with_hour = features_with_route.withColumn(â¨
"CRSDepHourOfDay",â¨
hour(features.CRSDepTime)â¨
)â¨
features_with_hour = features_with_hour.withColumn(â¨
"CRSArrHourOfDay",â¨
hour(features.CRSArrTime)â¨
)â¨
features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show()â¨
â¨
#â¨
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)â¨
#â¨
from pyspark.ml.feature import Bucketizerâ¨
â¨
# Setup the Bucketizerâ¨
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]â¨
arrival_bucketizer = Bucketizer(â¨
splits=splits,â¨
inputCol="ArrDelay",â¨
outputCol="ArrDelayBucket"â¨
)â¨
â¨
# Save the modelâ¨
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)â¨
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)â¨
â¨
# Apply the modelâ¨
ml_bucketized_features = arrival_bucketizer.transform(features_with_hour)â¨
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()â¨
â¨
#â¨
# Extract features tools in with pyspark.ml.featureâ¨
#â¨
from pyspark.ml.feature import StringIndexer, VectorAssemblerâ¨
â¨
# Turn category fields into indexesâ¨
for column in ["Carrier", "Origin", "Dest", "Route"]:â¨
string_indexer = StringIndexer(â¨
inputCol=column,â¨
outputCol=column + "_index"â¨
)â¨
â¨
string_indexer_model = string_indexer.fit(ml_bucketized_features)â¨
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)â¨
# Save the pipeline modelâ¨
string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format(â¨
base_path,â¨
columnâ¨
)â¨
string_indexer_model.write().overwrite().save(string_indexer_output_path)â¨
â¨
# Combine continuous, numeric fields with indexes of nominal onesâ¨
# ...into one feature vectorâ¨
numeric_columns = [â¨
"DepDelay", "Distance",â¨
"DayOfMonth", "DayOfWeek",â¨
"DayOfYear", "CRSDepHourOfDay",â¨
"CRSArrHourOfDay"]â¨
index_columns = ["Carrier_index", "Origin_index",â¨
"Dest_index", "Route_index"]â¨
vector_assembler = VectorAssembler(â¨
inputCols=numeric_columns + index_columns,â¨
outputCol="Features_vec"â¨
)â¨
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)â¨
â¨
# Save the numeric vector assemblerâ¨
vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)â¨
vector_assembler.write().overwrite().save(vector_assembler_path)â¨
â¨
# Drop the index columnsâ¨
for column in index_columns:â¨
final_vectorized_features = final_vectorized_features.drop(column)â¨
â¨
# Inspect the finalized featuresâ¨
final_vectorized_features.show()â¨
â¨
#â¨
# Cross validate, train and evaluate classifier: loop 5 times for 4 metricsâ¨
#â¨
â¨
from collections import defaultdictâ¨
scores = defaultdict(list)â¨
feature_importances = defaultdict(list)â¨
metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]â¨
split_count = 3â¨
â¨
for i in range(1, split_count + 1):â¨
print("nRun {} out of {} of test/train splits in cross validation...".format(â¨
i,â¨
split_count,â¨
)â¨
)â¨
â¨
# Test/train splitâ¨
training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])â¨
â¨
# Instantiate and fit random forest classifier on all the dataâ¨
from pyspark.ml.classification import RandomForestClassifierâ¨
rfc = RandomForestClassifier(â¨
featuresCol="Features_vec",â¨
labelCol="ArrDelayBucket",â¨
predictionCol="Prediction",â¨
maxBins=4657,â¨
)â¨
model = rfc.fit(training_data)â¨
â¨
# Save the new model over the old oneâ¨
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(â¨
base_pathâ¨
)â¨
model.write().overwrite().save(model_output_path)â¨
â¨
# Evaluate model using test dataâ¨
predictions = model.transform(test_data)â¨
â¨
# Evaluate this split's results for each metricâ¨
from pyspark.ml.evaluation import MulticlassClassificationEvaluatorâ¨
for metric_name in metric_names:â¨
evaluator = MulticlassClassificationEvaluator(â¨
labelCol="ArrDelayBucket",â¨
predictionCol="Prediction",â¨
metricName=metric_nameâ¨
)â¨
score = evaluator.evaluate(predictions)â¨
â¨
scores[metric_name].append(score)â¨
print("{} = {}".format(metric_name, score))â¨
â¨
#â¨
# Collect feature importancesâ¨
#â¨
feature_names = vector_assembler.getInputCols()â¨
feature_importance_list = model.featureImportancesâ¨
for feature_name, feature_importance in zip(feature_names, feature_importance_list):â¨
feature_importances[feature_name].append(feature_importance)â¨
â¨
#â¨
# Evaluate average and STD of each metric and print a tableâ¨
#â¨
import numpy as npâ¨
score_averages = defaultdict(float)â¨
â¨
# Compute the table dataâ¨
average_stds = [] # haâ¨
for metric_name in metric_names:â¨
metric_scores = scores[metric_name]â¨
â¨
average_accuracy = sum(metric_scores) / len(metric_scores)â¨
score_averages[metric_name] = average_accuracyâ¨
â¨
std_accuracy = np.std(metric_scores)â¨
â¨
average_stds.append((metric_name, average_accuracy, std_accuracy))â¨
â¨
# Print the tableâ¨
print("nExperiment Log")â¨
print("--------------")â¨
print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))â¨
â¨
#â¨
# Persist the score to a sccore log that exists between runsâ¨
#â¨
import pickle
# Load the score log or initialize an empty oneâ¨
try:â¨
score_log_filename = "{}/models/score_log.pickle".format(base_path)â¨
score_log = pickle.load(open(score_log_filename, "rb"))â¨
if not isinstance(score_log, list):â¨
score_log = []â¨
except IOError:â¨
score_log = []â¨
â¨
# Compute the existing score log entryâ¨
score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}â¨
â¨
# Compute and display the change in score for each metricâ¨
try:â¨
last_log = score_log[-1]â¨
except (IndexError, TypeError, AttributeError):â¨
last_log = score_log_entryâ¨
â¨
experiment_report = []â¨
for metric_name in metric_names:â¨
run_delta = score_log_entry[metric_name] - last_log[metric_name]â¨
experiment_report.append((metric_name, run_delta))â¨
â¨
print("nExperiment Report")â¨
print("-----------------")â¨
print(tabulate(experiment_report, headers=["Metric", "Score"]))â¨
â¨
# Append the existing average scores to the logâ¨
score_log.append(score_log_entry)â¨
â¨
# Persist the log for next runâ¨
pickle.dump(score_log, open(score_log_filename, "wb"))â¨
â¨
#â¨
# Analyze and report feature importance changesâ¨
#â¨
â¨
# Compute averages for each featureâ¨
feature_importance_entry = defaultdict(float)â¨
for feature_name, value_list in feature_importances.items():â¨
average_importance = sum(value_list) / len(value_list)â¨
feature_importance_entry[feature_name] = average_importanceâ¨
â¨
# Sort the feature importances in descending order and printâ¨
import operatorâ¨
sorted_feature_importances = sorted(â¨
feature_importance_entry.items(),â¨
key=operator.itemgetter(1),â¨
reverse=Trueâ¨
)â¨
â¨
print("nFeature Importances")â¨
print("-------------------")â¨
print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))â¨
â¨
#â¨
# Compare this run's feature importances with the previous run'sâ¨
#â¨
â¨
# Load the feature importance log or initialize an empty oneâ¨
try:â¨
feature_log_filename = "{}/models/feature_log.pickle".format(base_path)â¨
feature_log = pickle.load(open(feature_log_filename, "rb"))â¨
if not isinstance(feature_log, list):â¨
feature_log = []â¨
except IOError:â¨
feature_log = []â¨
â¨
# Compute and display the change in score for each featureâ¨
try:â¨
last_feature_log = feature_log[-1]â¨
except (IndexError, TypeError, AttributeError):â¨
last_feature_log = defaultdict(float)â¨
for feature_name, importance in feature_importance_entry.items():â¨
last_feature_log[feature_name] = importanceâ¨
â¨
# Compute the deltasâ¨
feature_deltas = {}â¨
for feature_name in feature_importances.keys():â¨
run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]â¨
feature_deltas[feature_name] = run_deltaâ¨
â¨
# Sort feature deltas, biggest change firstâ¨
import operatorâ¨
sorted_feature_deltas = sorted(â¨
feature_deltas.items(),â¨
key=operator.itemgetter(1),â¨
reverse=Trueâ¨
)â¨
â¨
# Display sorted feature deltasâ¨
print("nFeature Importance Delta Report")â¨
print("-------------------------------")â¨
print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))â¨
â¨
# Append the existing average deltas to the logâ¨
feature_log.append(feature_importance_entry)â¨
â¨
# Persist the log for next runâ¨
pickle.dump(feature_log, open(feature_log_filename, "wb"))â¨
â¨
if __name__ == "__main__":â¨
main(sys.argv[1])â¨
59. Data Syndrome Russell Jurney
Principal Consultant
Email : rjurney@datasyndrome.com
Web : datasyndrome.com
Data Syndrome, LLC