SlideShare a Scribd company logo
1 of 98
Download to read offline
DIY ANALYTICS WITH
APACHE SPARK
ADAM ROBERTS
London, 22nd
June 2017: originally presented at Geecon
Important disclaimers
Copyright © 2017 by Internatonal Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written
permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Informaton in these presentatons (including informaton relatng to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of inital
publicaton and could include unintentonal technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS"
without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this informaton, including but not limited to, loss of data,
business interrupton, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they
are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers
have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operatng environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all
countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and
discussions are provided for informational purposes only, and are neither intended to, nor shall consttute legal or other guidance or advice to any individual partcipant or their
specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identficaton
and interpretaton of any relevant laws and regulatory requirements that may affect the customer’s business and any actons the customer may need to take to comply with such
laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.
Information within this presentation is accurate to the best of the author's knowledge as of the 4th
of June 2017
Informaton concerning non-IBM products was obtained from the suppliers of those products, their published announcements or
other publicly available sources. IBM has not tested those products in connecton with this publication and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilites of non-IBM
products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or
the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed
or implied, including but not limited to, the implied warrantes of merchantability and fitness for a partcular purpose.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM
patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™,
Global Business Services ®, Global Technology Services ®, Informaton on Demand, ILOG, LinuxONE™, Maximo®,
MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytcs™, PureApplicaton®, pureCluster™,
PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Ratonal®,
Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System
z® Z/OS, are trademarks of Internatonal Business Machines Corporaton, registered in many jurisdictons worldwide. Other
product and service names might be trademarks of IBM or other companies. Oracle and Java are registered trademarks of
Oracle and/or its afiliates. Other names may be trademarks of their respectve owners: and a current list of IBM trademarks is
available on the Web at "Copyright and trademark informaton" at www.ibm.com/legal/copytrade.shtml. Apache Spark,
Apache Cassandra, Apache Hadoop, Apache Maven, Apache Kafka and any other Apache project mentoned here and the
Apache product logos including the Spark logo are trademarks of The Apache Software Foundaton.
●
Showing you how to get started from scratch:
going from “I’ve heard about Spark” to “I can use it for...”
●
Worked examples aplenty: lots of code
●
Not intended to be scientfically accurate! Sharing ideas
●
Useful reference material
●
Slides will be hosted
Stick around for...
✔
Doing stuf yourself (within your
tmeframe and rules)
✔
Findings can be subject to bias: yours
don’t have to be
✔
Trust the data instead
Motivation!
✔
Finding aliens with the SETI insttute
✔
Genomics projects (GATK, Bluemix
Genomics)
✔
IBM Watson services
Cool projects involving Spark
✔
Powerful machine(s)
✔
Apache Spark and a JDK
✔
Scala (recommended)
✔
Optonal: visualisation library for Spark output e.g. Python with
✔
bokeh
✔
pandas
✔
Optonal but not covered here: a notebook bundled with Spark
like Zeppelin, or use Jupyter
Your DIY analytcs toolkit
Toolbox from wikimedia: Tanemori derivatve work: ‫י‬‫ק‬‫נ‬‫א‬'‫ג‬‫י‬‫ק‬‫יו‬
Why listen to me?
●
Worked on Apache Spark since 2014
●
Helping IBM customers use Spark for the first tme
●
Resolving problems, educatng service teams
●
Testng on lots of IBM platforms since Spark 1.2: x86, Power, Z systems,
all Java 8 deliverables...
●
Fixing bugs in Spark/Java: contributng code and helping others to do so
●
Working with performance tuning pros
●
Code provided here has an emphasis on readability!
●
What is it (why the hype)?
●
How to answer questons with Spark
●
Core spark functons (the “bread and butter” stuf),
plotting, correlatons, machine learning
●
Built-in utlity functons to make our lives easier (labels,
features, handling nulls)
●
Examples using data from wearables: two years of actvity
What I'll be covering today
Ask me later if you're interested in...
●
Spark on IBM hardware
●
IBM SDK for Java specifics
●
Notebooks
●
Spark using GPUs/GPUs from Java
●
Performance tuning
●
Comparison with other projects
●
War stories fixing Spark/Java bugs
●
You know how to write Java or Scala
●
You’ve heard about Spark but never used it
●
You have something to process!
What I assume...
This talk won’t make you a
superhero!
●
Know more about Spark – what it can/can’t do
●
Know more about machine learning in Spark
●
Know that machine learning’s stll hard but in
diferent ways
But you will...
Open source project (the most actve for big data)
offering distributed...
●
Machine learning
●
Graph processing
●
Core operatons (map, reduce, joins)
●
SQL syntax with DataFrames/Datasets
✔
Build it yourself from source (requiring
Git, Maven, a JDK) or
✔
Download a community built binary or
✔
Download our free Spark
development package (includes IBM's
SDK for Java)
Things you can process...
●
File formats you could use with Hadoop
●
Anything there’s a Spark package for
●
json, csv, parquet...
Things you can use with it...
●
Kafka for streaming
●
Hive tables
●
Cassandra as a database
●
Hadoop (using HDFS with Spark)
●
DB2!
“What’s so good about it then?”
●
Offers scalability and resiliency
●
Auto-compression, fast serialisaton, caching
●
Python, R, Scala and Java APIs: eligible for Java
based optmisations
●
Distributed machine learning!
“Why isn’t everyone using it?”
●
Can you get away with using spreadsheet software?
●
Have you really got a large amount of data?
●
Data preparation is very important!
How will you properly handle negative, null, or otherwise
strange values in your data?
●
Will you benefit from massive concurrency?
●
Is the data in a format you can work with?
●
Needs transforming first (and is it worth it)?
Not every problem is a Spark one!
●
Not really real-tme streaming (“micro-batching”)
●
Debugging in a largely distributed system with many
moving parts can be tough
●
Security: not really locked down out of the box (extra
steps required by knowledgable users: whole disk
encrypton or using other projects, SSL config to do...)
Implementation details...
Getting something up and
running quickly
Run any Spark example in “local mode” first (from “spark”)
bin/run-example org.apache.spark.examples.SparkPi 100
Then run it on a cluster you can set up yourself:
Add hostnames in conf/slaves
sbin/start-all.sh
bin/run-example –master <your_master:7077> ...
Check for running Java processes: looking for workers/executors coming and going
Spark UI (default port 8080 on the master)
See: http://spark.apache.org/docs/latest/spark-standalone.html
lib is only with the IBM package
Running something simple
And you can use Spark's Java/Scala APIs with
bin/spark-shell (a REPL!)
bin/spark-submit
java/scala -cp “$SPARK_HOME/jars/*”
PySpark not covered in this presentation – but fun to
experiment with and lots of good docs online for you
Increasing the number of threads available for Spark
processing in local mode (5.2gb text file) – actually works?
--master local[1]
real 3m45.328s
--master local[4]
real 1m31.889s
time {
echo "--master local[1]"
$SPARK_HOME/bin/spark-submit
--master local[1] --class MyClass
WordCount.jar
}
time {
echo "--master local[4]"
$SPARK_HOME/bin/spark-submit
--master local[4] –class MyClass
WordCount.jar
}
“Anything else good about Spark?”
●
Resiliency by replicaton and lineage tracking
●
Distribution of processing via (potentally many) workers that can
spawn (potentally many) executors
●
Caching! Keep data in memory, reuse later
●
Versatlity and interoperability
APIs include Spark core, ML, DataFrames and Datasets,
Streaming and Graphx ...
●
Read up on RDDs and ML material by Andrew Ng, Spark Summit
videos, deep dives on Catalyst/Tungsten if you want to really get
stuck in! This is a DIY talk
Recap – we know what it is
now...and want to do some
analytics!
●
Data I’ll process here is for educational
purposes only: road_accidents.csv
●
Kaggle is a good place to practice – lots of
datasets available for you
●
Data I'm using is licensed under the Open
Government License for public sector
information
"accident_index","vehicle_reference","vehicle_type","towing_and_articulation",
"vehicle_manoeuvre","vehicle_location”,restricted_lane","junction_location","skidding_and_overturning","hit_object_in_ca
rriageway","vehicle_leaving_carriageway","hit_object_off_carriageway","1st_point_of_impact","was_vehicle_left_hand_dri
ve?","journey_purpose_of_driver","sex_of_driver","age_of_driver","age_band_of_driver","engine_capacity_(cc)","propulsio
n_code","age_of_vehicle","driver_imd_decile","driver_home_area_type","vehicle_imd_decile","NUmber_of_Casualities_un
ique_to_accident_ind ex","No_of_Vehicles_involved_unique_to_accident_index","location_easting_osgr","location_north
ing_osgr","longitude","latitude","police_force","accident_severity","number_of_vehicles","number_of_casualties","date","da
y_of_week","time","local_authority_(district)","local_authority_(highway)","1st_road_class","1st_road_number","road_type",
"speed_limit","junction_detail","junction_control"," 2nd_road_class","2nd_road_number","pedestrian_crossing-
human_control","pedestrian_crossing-physical_facilities",
"light_conditions","weather_conditions","road_surface_conditions","special_conditions_at_site","carriageway_hazards","
urban_or_rural_area","did_police_officer_attend_scene_of_accident","lsoa_of_accident_location","casualty_reference","ca
sualty_class","sex_of_casualty","age_of_casualty","age_band_of_casualty","casualty_severity","pedestrian_location","pe
destrian_movement","car_passenger","bus_or_coach_passenger","pedestrian_road_maintenance_worker","casualty_type
","casualty_home_area_type","casualty_imd_decile"
Features of the data (“columns”)
"201506E098757",2,9,0,18,0,8,0,0,0,0,3,1,6,1,45,7,1794
,1,11,-1,1,-1,1,2,384980,394830,-
2.227629,53.450014,6,3,2,1,"42250",2,1899-12-30
12:56:00,102,"E08000003",5,0,6,30,3,4,6,0,0,0,1,1,1,0,
0,1,2,"E01005288",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA
"201506E098766",1,9,0,9,0,8,0,0,0,0,4,1,6,2,25,5,1582, 2,1,-1,-
1,-1,1,2,383870,394420,-
2.244322,53.446296,6,3,2,1,"14/03/2015",7,1899-12-30
15:55:00,102,"E08000003",3,5103,3,40,6,2,5,0,0,5,1,1,1
,0,0,1,1,"E01005178",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA,NA
Values (“rows”)
Spark way to figure this out?
groupBy* vehicle_type
sort** the results on count
vehicle_type maps to a code
First place: car
Distant second: pedal bike
Close third: van/goods HGV <= 3.5 T
Distant last: electric motorcycle
Type of vehicle involved in the most accidents?
Different column name this tme, weather_conditons
maps to a code again
First place: fine with no high winds
Second: raining, no high winds
Distant third: fine, with high winds
Distant last: snowing, high winds
groupBy* weather_conditions
sort** the results on count
weather_conditions maps to a code
What weather should I be avoiding?
First place: going ahead (!)
Distant second: turning right
Distant third: slowing or stopping
Last: reversing
Spark way...
groupBy* manoeuvre
sort** the results on count
manoeuvre maps to a code
Which manoeuvres should I be careful with?
“Why * and **?”
org.apache.spark functions that
can run in a distributed manner
Spark code example – I'm using Scala
●
Forced mutability consideration (val or var)
●
Not mandatory to declare types (or “return ...”)
●
Check out “Scala for the Intrigued” on YouTube
●
JVM based
Scala main method I’ll be using
object AccidentsExample {
def main(args: Array[String]) : Unit = {
}
}
Which age group gets in the most accidents?
Spark entrypoint
val session = SparkSession.builder().appName("Accidents").master("local[*]")
Creatng a DataFrame: API we’ll use to interact with data as
though it’s in an SQL table
val sqlContext = session.getOrCreate().sqlContext
val allAccidents = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true").
load(myHome + "/datasets/road_accidents.csv")
allAccidents.show would give us a table like...
accident_index vehicle_reference vehicle_type towing_and_articulation
201506E098757 2 9 0
201506E098766 1 9 0
Group our data and save the result
...
val myAgeDF = groupCountSortAndShow(allAccidents, "age_of_casualty", true)
myAgeDF.coalesce(1). write.option("header",
"true"). format("csv"). save("victims")
Runtime.getRuntime().exec("python plot_me.py" )
def groupCountSortAndShow(df: DataFrame, columnName: String, toShow:
val ourSortedData = df.groupBy(columnName).count().sort("count")
if(toShow)
ourSortedData.show()
ourSortedData
}
Boolean):DataFrame = {
“Hold on...
what’s that getRuntime().exec
stuff?!”
It’s calling my Python code to plot the CSV file
import glob, os, pandas
from bokeh.plotting import figure, output_file, show
path = r'victims'
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pandas.read_csv(f) for f in all_files)
df = pandas.concat(df_from_each_file, ignore_index=True)
plot = figure(plot_width=640,plot_height=640,title='Accident victims by age',
x_axis_label='Age of victim', y_axis_label='How many')
plot.title.text_font_size = '16pt'
plot.xaxis.axis_label_text_font_size = '16pt'
plot.yaxis.axis_label_text_font_size = '16pt'
plot.scatter(x=df.age_of_casualty, y=df['count'])
output_file('victims.html') show(plot)
Bokeh gives us a graph like this
“What else can I do?”
You’ve got some JSON files...•
•
•
•
“Best doom metal band please”
sqlContext.sql("SELECT name, average_rating from bands WHERE " +
"genre == 'doom_metal'").sort(desc("average_rating")).show(1)
+--------------------+--------------+
| name|average_rating|
+--------------------+--------------+
|Bugle Infantry| 5|
+--------------------+--------------+
only showing top 1 row
val bandsDF = sqlContext.read.json(myHome + "/datasets/bands.json")
bandsDF.createGlobalTempView("bands")
import org.apache.spark.sql.functions._
{"id":"2","name":"Louder Bill","average_rating":"4.1","genre":"ambient"}
{"id":"3","name":"Prey Fury","average_rating":"2","genre":"pop"}
{"id":"4","name":"Unbranded Newsroom","average_rating":"4","genre":"rap"}
{"id":"5","name":"Bugle Infantry","average_rating":"5", "genre": "doom_metal"}
{"id":"1","name":"Into Latch","average_rating":"4.9","genre":"doom_metal"}
Randomly generated band names as of May the 18th
2017, zero affiliation on my behalf or IBM’s for any of these names...entirely coincidental if they do exist
“Great, but you mentioned
some data collected with
wearables and machine
learning!”
Anonymised data gathered from Automatc,
Apple Health, Withings, Jawbone Up
●
Car journeys
●
Sleeping activity (start and end tme)
●
Daytme actvity (calories consumed, steps taken)
●
Weight and heart rate
●
Several CSV files
●
Anonymised by subject gatherer before uploading anywhere! Nothing identfiable
Exploring the datasets: driving actvity
val autoData = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load(myHome + "/datasets/geecon/automatic.csv").
withColumnRenamed("End Location Name", "Location").
withColumnRenamed("End Time", "Time")
Checking our data is sensible...
val colsWeCareAbout =
"Distance (mi)",
"Duration (min)",
"Fuel Cost (USD)")
for (col <- colsWeCareAbout) {
summarise(autoData, col)
}
Array(
def summarise(df: DataFrame, columnName: String)
{ averageByCol(df, columnName)
minByCol(df, columnName)
maxByCol(df, columnName)
}
def averageByCol(df: DataFrame, columnName: String)
{ println("Printing the average " + columnName)
df.agg(avg(df.col(columnName))).show()
}
def minByCol(df: DataFrame, columnName: String)
{ println("Printing the minimum " + columnName)
df.agg(min(df.col(columnName))).show()
}
def maxByCol(df: DataFrame, columnName: String)
{ println("Printing the maximum " + columnName)
df.agg(max(df.col(columnName))).show()
}
Average distance (in miles): 6.88, minimum: 0.01, maximum: 187.03
Average duration (in minutes): 14.87, minimum: 0.2, maximum: 186.92
Average fuel Cost (in USD): 0.58, minimum: 0.0, maximum: 14.35
Looks OK - what’s the rate of Mr X visiting a
certain place? Got a favourite gym day?
Slacking on certain days?
●
Using Spark to determine chance of the subject being there
●
Timestamps (the “Time” column need to become days of the
week instead)
●
The start of a common theme: data preparaton!
Explore the data first
|Vehicle|Start Location Name|Start Time|Location|Time| Distance (mi)|Duration
(min)|Fuel Cost (USD)|Average MPG|Fuel Volume (gal)|Hard Accelerations|Hard Brakes|
Duration Over 70 mph (secs)|Duration Over 75 mph (secs)| Duration Over 80 mph
(secs)|Start Location Accuracy (meters)|End Location Accuracy (meters)|Tags|
...
|2005 Nissan
0.27|
0|
Sentra| PokeStop 12|4/3/2016 15:06|PokeStop 12|4/3/2016
0.03|
0|
15:07|
1.52| 0.04| 13.64|
0|
0|
0|
5.0| 5.0|
null|
|2005 Nissan
0.1|
0|
Sentra| PokeStop 12|4/3/2016 15:17|PokeStop 12|4/3/2016
0.0|
0|
15:18|
0.71| 0.01| 17.64|
0|
0|
0|
5.0| 5.0|
null|
autoData.show() ...
val preparedAutoData = sqlContext.sql(
"SELECT TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS TIMESTAMP))
as Date, Location, “ +
“date_format(TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS
TIMESTAMP)), 'EEEE') as Day FROM auto_data")
preparedAutoData.show()
Timestamp fun: 4/03/2016 15:06 is no good!
----------+-----------+---------+
|2016-04-03|PokeStop 12|
|2016-04-03|PokeStop 12|
Sunday|
Sunday|
Sunday||2016-04-03| Michaels|
...
+----------+-----------+---------
+
Date| Location | Day|
def printChanceLocationOnDay(
sqlContext: SQLContext, day: String, location: String) {
val allDatesAndDaysLogged = sqlContext.sql(
"SELECT Date, Day " +
"FROM prepared_auto_data " +
"WHERE Day = '" + day + "'").distinct()
allDatesAndDaysLogged.show()
Scala function: give us all of the rows where
the day is what we specified
+----------+------+
| Date| Day|
+----------+------+
|2016-10-17|Monday|
|2016-10-24|Monday|
|2016-04-25|Monday|
|2017-03-27|Monday|
|2016-08-15|Monday|
...
+----------+--------+------+
| Date|Location| Day|
+----------+--------+------+|2016-04-04|
|2016-11-14|
|2017-01-09|
|2017-02-06|
Gym|Monday|
Gym|Monday|
Gym|Monday|
Gym|Monday|
var rate = Math.floor( (Double.valueOf(allDatesAndDaysLogged.count()) /
Double.valueOf(visits.count())) * 100)
println(rate + "% rate of being at the location '" + location + "' on " + day +
", activity logged for " + allDatesAndDaysLogged + " " + day + "s")
val visits = sqlContext.sql(
"SELECT * FROM prepared_auto_data " +
"WHERE Location = '" + location + "' AND Day = '"
visits.show()
+ day + "'")
Rows where the location and day matches
our query (passed in as parameters)
●
7% rate of being at the location 'Gym' on Monday, activity logged for 51 Mondays
●
1% rate of being at the location 'Gym' on Tuesday, activity logged for 51 Tuesdays
●
2% rate of being at the location 'Gym' on Wednesday, activity logged for 49 Wednesdays
●
6% rate of being at the location 'Gym' on Thursday, activity logged for 47 Thursdays
●
7% rate of being at the location 'Gym' on Saturday, activity logged for 41 Saturdays
●
9% rate of being at the location 'Gym' on Sunday, activity logged for 41 Sundays
val days = Array("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")
for (day <- days) {
printChanceLocationOnDay(sqlContext, autoData, day, "Gym")
}
Which feature(s) are closely related to another -
e.g. the time spent asleep?
Dataset has these features from Jawbone
●
s_duration (the sleep time as well...)
●
m_active_time
●
m_calories
●
m_distance
●
m_steps
●
m_total_calories
●
n_bedtime (hmm)
●
n_awake_time
How about correlations?
Very strong positive correlation for n_bedtime and s_asleep_time
Correlation between goal_body_weight and s_asleep time: -0.02
Val shouldBeLow = sleepData.stat.corr("goal_body_weight", "s_duration")
println("Correlation between goal body weight and sleep duration: " + shouldBeLow)
val compareToCol = "s_duration"
for (col <- sleepData.columns) {
If (! col.equals(compareToCol)) { // don’t compare to itself...
val corr = sleepData.stat.corr(col, compareToCol)
if (corr > 0.8) {
println("Very strong positive correlation for " + col + " and " +
compareToCol)
} else if (corr >= 0.5) {
println("Positive correlation for " + col + " and " + compareToCol)
}
}
}
And something we know isn’t related?
“...can Spark help me to get
a good sleep?”
Need to define a good sleep first
8 hours for this test subject
If duration is > 8 hours
good sleep = true, else false
I’m using 1 for true and 0 for false
We will label this data soon so remember this
Then we’ll determine the most influential features on the value being true
or false. This can reveal the interestng stuf!
Sanity check first: any good sleeps for Mr X?
Found 538 valid recorded sleep times and 129 were 8 or more
hours in duration
// Don't care if the sleep duration wasn't even recorded or it's 0
val onlyRecordedSleeps = onlyDurations.filter($"s_duration" > 0)
println("Found " + onlyRecordedSleeps.count() + " valid recorded " +
"sleep times and " + onlyGoodSleeps.count() + "
were " + NUM_HOURS + " or more hours in
duration")
THRESHOLD = 60 *
onlyGoodSleeps =
val onlyDurations = sleepData.select("s_duration")
val NUM_HOURS = 8
val
val
60 * NUM_HOURS
onlyDurations.filter($"s_duration" >= THRESHOLD)
We will use machine learning: but first...
1) What do we want to find out?
Main contributng factors to a good sleep
2) Pick an algorithm
3) Prepare the data
4) Separate into training and test data
5) Build a model with the training data (in parallel using Spark!)
6) Use that model on the test data
7) Evaluate the model
8) Experiment with parameters untl reasonably accurate e.g. N iteratons
Alternating Least Squares
K-means (unsupervised learning (no labels, cheap))
Classificaton algorithms such as
Clustering algorithms such as
●
Produce n clusters from data to determine which cluster a new item can be categorised as
●
Identfy anomalies: transaction fraud, erroneous data
Recommendaton algorithms such as
●
Movie recommendatons on Netlix?
●
Recommended purchases on Amazon?
●
Similar songs with Spotify?
●
Recommended videos on YouTube?
Logistic regression
●
Create model that we can use to predict where to plot the next item in a sequence (above or
below our line of best fit)
●
Healthcare: predict adverse drug reactons based on known interactons with similar drugs
●
Spam filter (binomial classification)
●
Naive Bayes
Which algorithms might be of use?
What does “Naive Bayes” have to do with
my sleep quality?
Using evidence provided, guess what a label will be (1 or 0) for
us: easy to use with some training data
0 = the label (category 0 or 1)
e.g. 0 = low scoring athlete, 1 = high scoring
1:x = the score for a sportng event 1
2:x = the score for a sportng event 2
3:x = the score for a sportng event 3
bayes_data.txt (libSVM format)
val model = new NaiveBayes().fit(trainingData)
val predictions = model.transform(testData)
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test set accuracy = " + accuracy)
Test set accuracy = 0.82
val bayesData = sqlContext.read.format("libsvm").load("bayes_data.txt")
val Array(trainingData, testData) = bayesData.randomSplit(Array(0.7, 0.3))
Read it in, split it, fit it, transform and
evaluate – all on one slide with Spark!
https://spark.apache.org/docs/2.1.0/mllib-naive-bayes.html
Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive
Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each
feature given label, and then it applies Bayes’ theorem to compute the conditional probability distribution of label given an observation
and use it for prediction.
Naive Bayes correctly classifies the data (giving it the right labels)
Feed some new data in for the model...
“Can I just use Naive Bayes
on all of the sleep data?”
1) didn’t label each row in the
DataFrame yet
2) Naive Bayes can’t handle
our data in the current form
3) too many useless features
Possibilites – bear in mind that DataFrames are immutable, can't modify elements
directly...
1) Spark has a .map functon,howaboutthat?
“map is a transformation that passes each dataset element through a function and returns a new
RDD representing the results” - http://spark.apache.org/docs/latest/programming-guide.html
●
Removes allothercolumns inmycase...(newDataFrame withjustthelabels!)
2) Running a user defined functon on each row?
●
Maybe, but can Spark’s internal SQL optmiser “Catalyst” see
and optmise it? Probably slow
Labelling each row according to our “good
sleep” criteria
Preparing the labels
Preparing the features is easier
val labelledSleepData = sleepData.
withColumn("s_duration", when(col("s_duration") > THRESHOLD, 1).
otherwise(0))
val assembler = new VectorAssembler()
.setInputCols(sleepData.columns)
.setOutputCol("features")
val preparedData = assembler.transform(labelledSleepData).
withColumnRenamed("s_duration", "good_sleep")
“If duration is > 8 hours
good sleep = true, else false
I’m using 1 for true and 0 for false”
✔
Labelled data now
1) didn’t label each row in the
DataFrame yet
2) Naive Bayes can’t handle
our data in the current form
3) too many useless features
Trying to fit a model to the DataFrame now leads to...
s_asleep_time and n_bedtime (integers)
API docs: “Time user fell asleep. Seconds to/from midnight. If negative,
subtract from midnight. If positive, add to midnight”
Solution in this example?
Change to positives only
Add the number of seconds in a day to whatever s_asleep_time's
value is. Think it through properly when you try this if you’re done
experimenting and want something reliable to use!
The problem...
New DataFrame where negative values are handled
toModel.createOrReplaceTempView("to_model_table")
val preparedSleepAsLabel = preparedData.withColumnRenamed("good_sleep", "label")
val secondsInDay = 24 * 60 * 60
val toModel = preparedSleepAsLabel.
withColumn("s_asleep_time", (col("s_asleep_time")) + secondsInDay).
withColumn("s_bedtime", (col("s_bedtime")) + secondsInDay)
1) didn’t label each row in the
DataFrame yet
2) Naive Bayes can’t handle
our data in the current form
3) too many useless features
Reducing your “feature space”
Spark’s ChiSqSelector algorithm will work here
We want labels and features to inspect
val selector = new ChiSqSelector()
.setNumTopFeatures(10)
.setFeaturesCol("features")
.setLabelCol("good_sleep")
.setOutputCol("selected_features")
val model = selector.fit(preparedData)
val topFeatureIndexes = model.selectedFeatures
for (i <- 1 to topFeatureIndexes.length - 1) {
// Get col names based on feature indexes
println(preparedData.columns(topFeatureIndexes(i)))
}
Using ChiSq selector to get the top features
Feature selection tries to identify relevant features for use in model construction. It reduces the size of the feature space, which can improve both speed and
statistical learning behavior. ChiSqSelector implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses
the Chi-Squared test of independence to decide which features to choose. It supports three selection methods: numTopFeatures, percentile, fpr:
numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html#chisqselector
Transform values into a “features” column and
only select columns we identified as influential
Earlier we did...
toModel.createOrReplaceTempView("to_model_table")
val onlyInterestingColumns = sqlContext.sql("SELECT label, " + colNames.toString()
to_model_table")
+ " FROM
val theAssembler = new VectorAssembler()
.setInputCols(onlyInterestingColumns.columns)
.setOutputCol("features")
val thePreparedData = theAssembler.transform(onlyInterestingColumns)
Top ten influental features (most to least influental)
Feature Description from Jawbone API docs
s_count Number of primary sleep entries logged
s_awake_time Time the user woke up
s_quality Proprietary formula, don't know
s_asleep_time Time when the user fell asleep
s_bedtime Seconds the device is in sleep mode
s_deep Seconds of main “sound sleep”
s_light Seconds of “light sleeps” during the sleep period
m_workout_time Length of logged workouts in seconds
n_light Seconds of light sleep during the nap
n_sound Seconds of sound sleep during the nap
1) didn’t label each row in the
DataFrame yet
2) Naive Bayes can’t handle
our data in the current form
3) too many useless features
And after all that...we can generate predictions!
val Array(trainingSleepData, testSleepData)=thePreparedData.randomSplit(Array(0.7, 0.3)
val sleepModel = new NaiveBayes().fit(trainingSleepData)
val predictions = sleepModel.transform(testSleepData)
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test set accuracy for labelled sleep data = " + accuracy)
Test set accuracy for labelled sleep data = 0.81 ...
Testing it with new data
val somethingNew = sqlContext.createDataFrame(Seq(
// Good sleep: high workout time, achieved a good amount of deep sleep, went to bed
after midnight and woke at almost noon!
(0, Vectors.dense(0, 1, 42600, 100, 87659, 85436, 16138, 22142, 4073, 0)),
// Bad sleep, woke up early (5 AM), didn't get much of a deep sleep, didn't workout,
bedtime 10.20 PM
(0, Vectors.dense(0, 0, 18925, 0, 80383, 80083, 6653, 17568, 0, 0))
)).toDF("label","features")
sleepModel.transform(somethingNew).show()
Sensible model created with outcomes we’d expect
Go to bed earlier, exercise more
I could have looked closer into removing the s_ variables so
they’re all m_ and diet informaton; exercise for the reader
Algorithms are producing these outcomes
without domain specific knowledge
Last example: “does weighing more result in a higher heart rate?”
Will get the average of all the heart rates logged on a day when
weight was measured
Lower heart rate day = weight was more?
Higher rate day = weight was less?
Maybe MLlib again? But all that preparation work...
How deeply involved with Spark do we usually
need to get?
More data preparaton needed, but there’s a twist
Here I use data from two tables: weights, activities
+----------+------+
| Date|weight|
+----------+------+
|2017-04-09|
|2017-04-08|
|2017-04-07|
220.4|
219.9|
221.0|+----------+------+
only showing top 3 rows
becomes
Times are removed as we only care about dates
Include only heart beat readings when we have
weight(s) measured: join on date used
+----------+------+----------------------+
| Date|weight|heart_beats_per_minute|
+----------+------+----------------------+
|2017-02-13|
|2017-02-13|
|2017-02-09|
|2017-02-09|
|2017-02-09|
220.3|
220.3|
215.9|
215.9|
215.9|
79.0|
77.0|
97.0|
104.0|
88.0|
+----------+------+----------------------
...
Average the rate and weight readings by day
+----------+------+----------------------+
| Date|weight|heart_beats_per_minute|
+----------+------+----------------------+
|2017-02-13| 220.3|
|2017-02-13| 220.7|
79.0|
77.0|
+----------+------+----------------------+
...
Should become this:
+----------+------+-----------------------------------+
| Date|avg weight |avg_heart_beats_per_minute |
+----------+------+-----------------------------------+
|2017-02-13| 220.5| 78 |
+----------+------+----------------------------------- +
...
DataFrame now looks like this...
+----------+--------------------------- +------------------+
|Date ||avg(heart_beats_per_minute)| avg(weight) |
+----------+----------------------------+------------------+
|2016-04-25|
|2017-01-06|
|2016-05-03|
|2016-07-26|
Something we can quickly plot!
|85.933... |196.46... |
|93.8125... |216.0 |
|83.647... |198.35... |
|84.411... |192.69... |
Bokeh used again, no more analysis required
Used the same functions as earlier (groupBy, formatting dates) and
also a join. Same plotting with different column names. No distinct
correlation identified so moved on
Still lots of questions we could answer with Spark using this data
●
Any impact on mpg when the driver weighs much less than before?
●
Which fuel provider gives me the best mpg?
●
Which visited places have a positive effect on subject’s weight?
●
Analytics doesn’t need to be complicated:
Spark’s good for the heavy lifting
●
Sometimes best to just plot as you go –
saves plenty of time
●
Other harder things to worry about
Writing a distributed machine learning
algorithm shouldn’t be one of them!
“Which tools can I use to answer
my questions?”
This question becomes easier
Infrastructure when you’re ready to scale beyond your laptop
●
Setting up a huge HA cluster: a talk on its own
●
Who sets up then maintains the machines? Automate it all?
●
How many machines do you need? RAM/CPUs?
●
Who ensures all software is up to date (CVEs?)
●
Access control lists?
●
Hosting costs/providers?
●
Reliability, fault tolerance, backup procedures...
Still got to think about...
●
Use GPUs to train models faster
●
DeepLearning4J?
●
Writing your own kernels/C/JNI code (or a Java API like CUDA4J/Aparapi?)
●
Use RDMA to reduce network transfer times
●
Zero copy: RoCE or InfiniBand?
●
Tune the JDK, the OS, the hardware
●
Continuously evaluate performance: Spark itself, use
●
-Xhealthcenter, your own metrics, various libraries...
●
Go tackle something huge – join the alien search
●
Combine Spark Streaming with MLlib to gain insights fast
●
More informed decision making
And if you want to really show off with Spark
●
Know more about Spark: what it can and can’t do (new
project ideas?)
●
Know more about machine learning in Spark
●
Know that machine learning’s stll hard but in diferent ways
Data preparaton, handling junk, knowing what to look for
Getting the data in the first place
Writng the algorithms to be used in Spark?
Recap – you should now...
●
Built-in Spark functons are aplenty – try and stck to these
●
You can plot your results by saving to a csv/json and using
your existng favourite plotting libraries easily
●
DataFrame (or Datasets) combined with ML = powerful APIs
●
Filter your data – decide how to handle nulls!
●
Pick and use a suitable ML algorithm
●
Plot results
Points to take home...
Final points to consider...
Where would Spark fit in to your systems? A replacement or
supplementary?
Give it a try with your own data and you might be surprised with
the outcome
It’s free and open source with a very actve community!
Contact me directly: aroberts@uk.ibm.com
Questons?
●
Automatic: log into the Automatc Dashboard https://dashboard.automatc.com/,
on the bottom right, click export, choose what data you want to export (e.g. All)
●
Fuelly: (Obtained Gas Cubby), log into the Fuelly Dashboard http://www.fuelly.co
m/dashboard, select your vehicle in Your Garage, scroll down to vehicle logs,
select Export Fuel-ups or Export Services, select duraton of export
●
Jawbone: sign into your account at https://jawbone.com/, click on your name on
the top right, choose Settings, click on the Accounts tab, scroll down to Download
UP Data, choose which year you'd like to download data for
How did I access the data to process?
●
Withings: log into the Withings Dashboard https://healthmate.withings.com
click Measurement table, click the tab corresponding to the data you want
to export, click download. You can go here to download all data instead:
https://account.withings.com/export/
●
Apple: launch the Health app, navigate to the Health Data tab, select
your account in the top right area of your screen, select Export Health
Data
●
Remember to remove any sensitive personal information before
sharing/showing/storing said data elsewhere! I am dealing with
“cleansed” datasets with no SPI

More Related Content

What's hot

Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR OverviewKhalid Salama
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021Sandesh Rao
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopDataWorks Summit
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning PlatformMk Kim
 
Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Jean Ihm
 
Deep Learning for Recommender Systems with Nick pentreath
Deep Learning for Recommender Systems with Nick pentreathDeep Learning for Recommender Systems with Nick pentreath
Deep Learning for Recommender Systems with Nick pentreathDatabricks
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningModusOptimum
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in ActionPivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in ActionEMC
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineJan Wiegelmann
 
How To Visualize Graphs
How To Visualize GraphsHow To Visualize Graphs
How To Visualize GraphsJean Ihm
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the BusinessDataWorks Summit
 
Challenges of Deep Learning in the Automotive Industry and Autonomous Driving
Challenges of Deep Learning in the Automotive Industry and Autonomous DrivingChallenges of Deep Learning in the Automotive Industry and Autonomous Driving
Challenges of Deep Learning in the Automotive Industry and Autonomous DrivingJan Wiegelmann
 

What's hot (20)

Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on Hadoop
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
 
Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Gain Insights with Graph Analytics
Gain Insights with Graph Analytics
 
Deep Learning for Recommender Systems with Nick pentreath
Deep Learning for Recommender Systems with Nick pentreathDeep Learning for Recommender Systems with Nick pentreath
Deep Learning for Recommender Systems with Nick pentreath
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine Learning
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1)
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in ActionPivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
How To Visualize Graphs
How To Visualize GraphsHow To Visualize Graphs
How To Visualize Graphs
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the Business
 
Challenges of Deep Learning in the Automotive Industry and Autonomous Driving
Challenges of Deep Learning in the Automotive Industry and Autonomous DrivingChallenges of Deep Learning in the Automotive Industry and Autonomous Driving
Challenges of Deep Learning in the Automotive Industry and Autonomous Driving
 

Similar to Here are a few other tips for getting started with Spark:- Start simple - Don't try to solve a huge complex problem right away. Try analyzing a small dataset first to get familiar with Spark.- Use the built-in examples - Spark comes with many example programs that demonstrate core concepts. Run them to see Spark in action.- Leverage notebooks - Tools like Zeppelin, Jupyter, and Databricks notebooks make it easy to interactively explore data with Spark. - Profile and tune - As you scale up, use Spark's built-in monitoring and profiling tools to optimize performance. - Leverage Spark packages - There are packages for common tasks like SQL, streaming, ML

Java and the GPU - Everything You Need To Know
Java and the GPU - Everything You Need To KnowJava and the GPU - Everything You Need To Know
Java and the GPU - Everything You Need To KnowAdam Roberts
 
Using GPUs to Achieve Massive Parallelism in Java 8
Using GPUs to Achieve Massive Parallelism in Java 8Using GPUs to Achieve Massive Parallelism in Java 8
Using GPUs to Achieve Massive Parallelism in Java 8Dev_Events
 
Highly successful performance tuning of an informix database
Highly successful performance tuning of an informix databaseHighly successful performance tuning of an informix database
Highly successful performance tuning of an informix databaseIBM_Info_Management
 
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...Chris Miller
 
Why z/OS is a great platform for developing and hosting APIs
Why z/OS is a great platform for developing and hosting APIsWhy z/OS is a great platform for developing and hosting APIs
Why z/OS is a great platform for developing and hosting APIsTeodoro Cipresso
 
DESY's new data taking and analysis infrastructure for PETRA III
DESY's new data taking and analysis infrastructure for PETRA IIIDESY's new data taking and analysis infrastructure for PETRA III
DESY's new data taking and analysis infrastructure for PETRA IIIUlf Troppens
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsIBM
 
Making People Flow in Cities Measurable and Analyzable
Making People Flow in Cities Measurable and AnalyzableMaking People Flow in Cities Measurable and Analyzable
Making People Flow in Cities Measurable and AnalyzableWeiwei Yang
 
Informix REST API Tutorial
Informix REST API TutorialInformix REST API Tutorial
Informix REST API TutorialBrian Hughes
 
Witness the Evolution of Teamwork
Witness the Evolution of TeamworkWitness the Evolution of Teamwork
Witness the Evolution of TeamworkMatt Holitza
 
Plan ahead and act proficiently for reporting - Lessons Learned
Plan ahead and act proficiently for reporting - Lessons LearnedPlan ahead and act proficiently for reporting - Lessons Learned
Plan ahead and act proficiently for reporting - Lessons LearnedEinar Karlsen
 
Codemotion Rome 2015 Bluemix Lab Tutorial
Codemotion Rome 2015 Bluemix Lab TutorialCodemotion Rome 2015 Bluemix Lab Tutorial
Codemotion Rome 2015 Bluemix Lab Tutorialgjuljo
 
Exposing auto-generated Swagger 2.0 documents from Liberty!
Exposing auto-generated Swagger 2.0 documents from Liberty!Exposing auto-generated Swagger 2.0 documents from Liberty!
Exposing auto-generated Swagger 2.0 documents from Liberty!Arthur De Magalhaes
 
Academic Discussion Group Workshop 2018 November 10 st 2018 Nimbix CAPI SNAP...
Academic Discussion  Group Workshop 2018 November 10 st 2018 Nimbix CAPI SNAP...Academic Discussion  Group Workshop 2018 November 10 st 2018 Nimbix CAPI SNAP...
Academic Discussion Group Workshop 2018 November 10 st 2018 Nimbix CAPI SNAP...Ganesan Narayanasamy
 
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISAnalyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISIBM Cloud Data Services
 
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892Torsten Steinbach
 
2449 rapid prototyping of innovative io t solutions
2449   rapid prototyping of innovative io t solutions2449   rapid prototyping of innovative io t solutions
2449 rapid prototyping of innovative io t solutionsEric Cattoir
 
OpenWhisk Introduction
OpenWhisk IntroductionOpenWhisk Introduction
OpenWhisk IntroductionIoana Baldini
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...DataWorks Summit
 

Similar to Here are a few other tips for getting started with Spark:- Start simple - Don't try to solve a huge complex problem right away. Try analyzing a small dataset first to get familiar with Spark.- Use the built-in examples - Spark comes with many example programs that demonstrate core concepts. Run them to see Spark in action.- Leverage notebooks - Tools like Zeppelin, Jupyter, and Databricks notebooks make it easy to interactively explore data with Spark. - Profile and tune - As you scale up, use Spark's built-in monitoring and profiling tools to optimize performance. - Leverage Spark packages - There are packages for common tasks like SQL, streaming, ML (20)

Java and the GPU - Everything You Need To Know
Java and the GPU - Everything You Need To KnowJava and the GPU - Everything You Need To Know
Java and the GPU - Everything You Need To Know
 
Using GPUs to Achieve Massive Parallelism in Java 8
Using GPUs to Achieve Massive Parallelism in Java 8Using GPUs to Achieve Massive Parallelism in Java 8
Using GPUs to Achieve Massive Parallelism in Java 8
 
Highly successful performance tuning of an informix database
Highly successful performance tuning of an informix databaseHighly successful performance tuning of an informix database
Highly successful performance tuning of an informix database
 
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
 
Why z/OS is a great platform for developing and hosting APIs
Why z/OS is a great platform for developing and hosting APIsWhy z/OS is a great platform for developing and hosting APIs
Why z/OS is a great platform for developing and hosting APIs
 
DESY's new data taking and analysis infrastructure for PETRA III
DESY's new data taking and analysis infrastructure for PETRA IIIDESY's new data taking and analysis infrastructure for PETRA III
DESY's new data taking and analysis infrastructure for PETRA III
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUs
 
Making People Flow in Cities Measurable and Analyzable
Making People Flow in Cities Measurable and AnalyzableMaking People Flow in Cities Measurable and Analyzable
Making People Flow in Cities Measurable and Analyzable
 
Informix REST API Tutorial
Informix REST API TutorialInformix REST API Tutorial
Informix REST API Tutorial
 
Witness the Evolution of Teamwork
Witness the Evolution of TeamworkWitness the Evolution of Teamwork
Witness the Evolution of Teamwork
 
Plan ahead and act proficiently for reporting - Lessons Learned
Plan ahead and act proficiently for reporting - Lessons LearnedPlan ahead and act proficiently for reporting - Lessons Learned
Plan ahead and act proficiently for reporting - Lessons Learned
 
Codemotion Rome 2015 Bluemix Lab Tutorial
Codemotion Rome 2015 Bluemix Lab TutorialCodemotion Rome 2015 Bluemix Lab Tutorial
Codemotion Rome 2015 Bluemix Lab Tutorial
 
Exposing auto-generated Swagger 2.0 documents from Liberty!
Exposing auto-generated Swagger 2.0 documents from Liberty!Exposing auto-generated Swagger 2.0 documents from Liberty!
Exposing auto-generated Swagger 2.0 documents from Liberty!
 
Academic Discussion Group Workshop 2018 November 10 st 2018 Nimbix CAPI SNAP...
Academic Discussion  Group Workshop 2018 November 10 st 2018 Nimbix CAPI SNAP...Academic Discussion  Group Workshop 2018 November 10 st 2018 Nimbix CAPI SNAP...
Academic Discussion Group Workshop 2018 November 10 st 2018 Nimbix CAPI SNAP...
 
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISAnalyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
 
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
 
2449 rapid prototyping of innovative io t solutions
2449   rapid prototyping of innovative io t solutions2449   rapid prototyping of innovative io t solutions
2449 rapid prototyping of innovative io t solutions
 
OpenWhisk Introduction
OpenWhisk IntroductionOpenWhisk Introduction
OpenWhisk Introduction
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
predictor
predictorpredictor
predictor
 

Recently uploaded

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Recently uploaded (20)

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 

Here are a few other tips for getting started with Spark:- Start simple - Don't try to solve a huge complex problem right away. Try analyzing a small dataset first to get familiar with Spark.- Use the built-in examples - Spark comes with many example programs that demonstrate core concepts. Run them to see Spark in action.- Leverage notebooks - Tools like Zeppelin, Jupyter, and Databricks notebooks make it easy to interactively explore data with Spark. - Profile and tune - As you scale up, use Spark's built-in monitoring and profiling tools to optimize performance. - Leverage Spark packages - There are packages for common tasks like SQL, streaming, ML

  • 1. DIY ANALYTICS WITH APACHE SPARK ADAM ROBERTS London, 22nd June 2017: originally presented at Geecon
  • 2. Important disclaimers Copyright © 2017 by Internatonal Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Informaton in these presentatons (including informaton relatng to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of inital publicaton and could include unintentonal technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS" without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this informaton, including but not limited to, loss of data, business interrupton, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operatng environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall consttute legal or other guidance or advice to any individual partcipant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identficaton and interpretaton of any relevant laws and regulatory requirements that may affect the customer’s business and any actons the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law. Information within this presentation is accurate to the best of the author's knowledge as of the 4th of June 2017
  • 3. Informaton concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connecton with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilites of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warrantes of merchantability and fitness for a partcular purpose. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™, Global Business Services ®, Global Technology Services ®, Informaton on Demand, ILOG, LinuxONE™, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytcs™, PureApplicaton®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Ratonal®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of Internatonal Business Machines Corporaton, registered in many jurisdictons worldwide. Other product and service names might be trademarks of IBM or other companies. Oracle and Java are registered trademarks of Oracle and/or its afiliates. Other names may be trademarks of their respectve owners: and a current list of IBM trademarks is available on the Web at "Copyright and trademark informaton" at www.ibm.com/legal/copytrade.shtml. Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Apache Kafka and any other Apache project mentoned here and the Apache product logos including the Spark logo are trademarks of The Apache Software Foundaton.
  • 4. ● Showing you how to get started from scratch: going from “I’ve heard about Spark” to “I can use it for...” ● Worked examples aplenty: lots of code ● Not intended to be scientfically accurate! Sharing ideas ● Useful reference material ● Slides will be hosted Stick around for...
  • 5. ✔ Doing stuf yourself (within your tmeframe and rules) ✔ Findings can be subject to bias: yours don’t have to be ✔ Trust the data instead Motivation!
  • 6. ✔ Finding aliens with the SETI insttute ✔ Genomics projects (GATK, Bluemix Genomics) ✔ IBM Watson services Cool projects involving Spark
  • 7. ✔ Powerful machine(s) ✔ Apache Spark and a JDK ✔ Scala (recommended) ✔ Optonal: visualisation library for Spark output e.g. Python with ✔ bokeh ✔ pandas ✔ Optonal but not covered here: a notebook bundled with Spark like Zeppelin, or use Jupyter Your DIY analytcs toolkit Toolbox from wikimedia: Tanemori derivatve work: ‫י‬‫ק‬‫נ‬‫א‬'‫ג‬‫י‬‫ק‬‫יו‬
  • 8. Why listen to me? ● Worked on Apache Spark since 2014 ● Helping IBM customers use Spark for the first tme ● Resolving problems, educatng service teams ● Testng on lots of IBM platforms since Spark 1.2: x86, Power, Z systems, all Java 8 deliverables... ● Fixing bugs in Spark/Java: contributng code and helping others to do so ● Working with performance tuning pros ● Code provided here has an emphasis on readability!
  • 9. ● What is it (why the hype)? ● How to answer questons with Spark ● Core spark functons (the “bread and butter” stuf), plotting, correlatons, machine learning ● Built-in utlity functons to make our lives easier (labels, features, handling nulls) ● Examples using data from wearables: two years of actvity What I'll be covering today
  • 10. Ask me later if you're interested in... ● Spark on IBM hardware ● IBM SDK for Java specifics ● Notebooks ● Spark using GPUs/GPUs from Java ● Performance tuning ● Comparison with other projects ● War stories fixing Spark/Java bugs
  • 11. ● You know how to write Java or Scala ● You’ve heard about Spark but never used it ● You have something to process! What I assume...
  • 12. This talk won’t make you a superhero!
  • 13. ● Know more about Spark – what it can/can’t do ● Know more about machine learning in Spark ● Know that machine learning’s stll hard but in diferent ways But you will...
  • 14. Open source project (the most actve for big data) offering distributed... ● Machine learning ● Graph processing ● Core operatons (map, reduce, joins) ● SQL syntax with DataFrames/Datasets
  • 15. ✔ Build it yourself from source (requiring Git, Maven, a JDK) or ✔ Download a community built binary or ✔ Download our free Spark development package (includes IBM's SDK for Java)
  • 16. Things you can process... ● File formats you could use with Hadoop ● Anything there’s a Spark package for ● json, csv, parquet... Things you can use with it... ● Kafka for streaming ● Hive tables ● Cassandra as a database ● Hadoop (using HDFS with Spark) ● DB2!
  • 17. “What’s so good about it then?”
  • 18. ● Offers scalability and resiliency ● Auto-compression, fast serialisaton, caching ● Python, R, Scala and Java APIs: eligible for Java based optmisations ● Distributed machine learning!
  • 19. “Why isn’t everyone using it?”
  • 20. ● Can you get away with using spreadsheet software? ● Have you really got a large amount of data? ● Data preparation is very important! How will you properly handle negative, null, or otherwise strange values in your data? ● Will you benefit from massive concurrency? ● Is the data in a format you can work with? ● Needs transforming first (and is it worth it)? Not every problem is a Spark one!
  • 21. ● Not really real-tme streaming (“micro-batching”) ● Debugging in a largely distributed system with many moving parts can be tough ● Security: not really locked down out of the box (extra steps required by knowledgable users: whole disk encrypton or using other projects, SSL config to do...) Implementation details...
  • 22. Getting something up and running quickly
  • 23. Run any Spark example in “local mode” first (from “spark”) bin/run-example org.apache.spark.examples.SparkPi 100 Then run it on a cluster you can set up yourself: Add hostnames in conf/slaves sbin/start-all.sh bin/run-example –master <your_master:7077> ... Check for running Java processes: looking for workers/executors coming and going Spark UI (default port 8080 on the master) See: http://spark.apache.org/docs/latest/spark-standalone.html lib is only with the IBM package Running something simple
  • 24. And you can use Spark's Java/Scala APIs with bin/spark-shell (a REPL!) bin/spark-submit java/scala -cp “$SPARK_HOME/jars/*” PySpark not covered in this presentation – but fun to experiment with and lots of good docs online for you
  • 25. Increasing the number of threads available for Spark processing in local mode (5.2gb text file) – actually works? --master local[1] real 3m45.328s --master local[4] real 1m31.889s time { echo "--master local[1]" $SPARK_HOME/bin/spark-submit --master local[1] --class MyClass WordCount.jar } time { echo "--master local[4]" $SPARK_HOME/bin/spark-submit --master local[4] –class MyClass WordCount.jar }
  • 26. “Anything else good about Spark?”
  • 27. ● Resiliency by replicaton and lineage tracking ● Distribution of processing via (potentally many) workers that can spawn (potentally many) executors ● Caching! Keep data in memory, reuse later ● Versatlity and interoperability APIs include Spark core, ML, DataFrames and Datasets, Streaming and Graphx ... ● Read up on RDDs and ML material by Andrew Ng, Spark Summit videos, deep dives on Catalyst/Tungsten if you want to really get stuck in! This is a DIY talk
  • 28. Recap – we know what it is now...and want to do some analytics!
  • 29. ● Data I’ll process here is for educational purposes only: road_accidents.csv ● Kaggle is a good place to practice – lots of datasets available for you ● Data I'm using is licensed under the Open Government License for public sector information
  • 30. "accident_index","vehicle_reference","vehicle_type","towing_and_articulation", "vehicle_manoeuvre","vehicle_location”,restricted_lane","junction_location","skidding_and_overturning","hit_object_in_ca rriageway","vehicle_leaving_carriageway","hit_object_off_carriageway","1st_point_of_impact","was_vehicle_left_hand_dri ve?","journey_purpose_of_driver","sex_of_driver","age_of_driver","age_band_of_driver","engine_capacity_(cc)","propulsio n_code","age_of_vehicle","driver_imd_decile","driver_home_area_type","vehicle_imd_decile","NUmber_of_Casualities_un ique_to_accident_ind ex","No_of_Vehicles_involved_unique_to_accident_index","location_easting_osgr","location_north ing_osgr","longitude","latitude","police_force","accident_severity","number_of_vehicles","number_of_casualties","date","da y_of_week","time","local_authority_(district)","local_authority_(highway)","1st_road_class","1st_road_number","road_type", "speed_limit","junction_detail","junction_control"," 2nd_road_class","2nd_road_number","pedestrian_crossing- human_control","pedestrian_crossing-physical_facilities", "light_conditions","weather_conditions","road_surface_conditions","special_conditions_at_site","carriageway_hazards"," urban_or_rural_area","did_police_officer_attend_scene_of_accident","lsoa_of_accident_location","casualty_reference","ca sualty_class","sex_of_casualty","age_of_casualty","age_band_of_casualty","casualty_severity","pedestrian_location","pe destrian_movement","car_passenger","bus_or_coach_passenger","pedestrian_road_maintenance_worker","casualty_type ","casualty_home_area_type","casualty_imd_decile" Features of the data (“columns”)
  • 32. Spark way to figure this out? groupBy* vehicle_type sort** the results on count vehicle_type maps to a code First place: car Distant second: pedal bike Close third: van/goods HGV <= 3.5 T Distant last: electric motorcycle Type of vehicle involved in the most accidents?
  • 33. Different column name this tme, weather_conditons maps to a code again First place: fine with no high winds Second: raining, no high winds Distant third: fine, with high winds Distant last: snowing, high winds groupBy* weather_conditions sort** the results on count weather_conditions maps to a code What weather should I be avoiding?
  • 34. First place: going ahead (!) Distant second: turning right Distant third: slowing or stopping Last: reversing Spark way... groupBy* manoeuvre sort** the results on count manoeuvre maps to a code Which manoeuvres should I be careful with?
  • 35. “Why * and **?” org.apache.spark functions that can run in a distributed manner
  • 36. Spark code example – I'm using Scala ● Forced mutability consideration (val or var) ● Not mandatory to declare types (or “return ...”) ● Check out “Scala for the Intrigued” on YouTube ● JVM based Scala main method I’ll be using object AccidentsExample { def main(args: Array[String]) : Unit = { } } Which age group gets in the most accidents?
  • 37. Spark entrypoint val session = SparkSession.builder().appName("Accidents").master("local[*]") Creatng a DataFrame: API we’ll use to interact with data as though it’s in an SQL table val sqlContext = session.getOrCreate().sqlContext val allAccidents = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true"). load(myHome + "/datasets/road_accidents.csv") allAccidents.show would give us a table like... accident_index vehicle_reference vehicle_type towing_and_articulation 201506E098757 2 9 0 201506E098766 1 9 0
  • 38. Group our data and save the result ... val myAgeDF = groupCountSortAndShow(allAccidents, "age_of_casualty", true) myAgeDF.coalesce(1). write.option("header", "true"). format("csv"). save("victims") Runtime.getRuntime().exec("python plot_me.py" ) def groupCountSortAndShow(df: DataFrame, columnName: String, toShow: val ourSortedData = df.groupBy(columnName).count().sort("count") if(toShow) ourSortedData.show() ourSortedData } Boolean):DataFrame = {
  • 39. “Hold on... what’s that getRuntime().exec stuff?!”
  • 40. It’s calling my Python code to plot the CSV file import glob, os, pandas from bokeh.plotting import figure, output_file, show path = r'victims' all_files = glob.glob(os.path.join(path, "*.csv")) df_from_each_file = (pandas.read_csv(f) for f in all_files) df = pandas.concat(df_from_each_file, ignore_index=True) plot = figure(plot_width=640,plot_height=640,title='Accident victims by age', x_axis_label='Age of victim', y_axis_label='How many') plot.title.text_font_size = '16pt' plot.xaxis.axis_label_text_font_size = '16pt' plot.yaxis.axis_label_text_font_size = '16pt' plot.scatter(x=df.age_of_casualty, y=df['count']) output_file('victims.html') show(plot)
  • 41. Bokeh gives us a graph like this
  • 42. “What else can I do?”
  • 43. You’ve got some JSON files...• • • • “Best doom metal band please” sqlContext.sql("SELECT name, average_rating from bands WHERE " + "genre == 'doom_metal'").sort(desc("average_rating")).show(1) +--------------------+--------------+ | name|average_rating| +--------------------+--------------+ |Bugle Infantry| 5| +--------------------+--------------+ only showing top 1 row val bandsDF = sqlContext.read.json(myHome + "/datasets/bands.json") bandsDF.createGlobalTempView("bands") import org.apache.spark.sql.functions._ {"id":"2","name":"Louder Bill","average_rating":"4.1","genre":"ambient"} {"id":"3","name":"Prey Fury","average_rating":"2","genre":"pop"} {"id":"4","name":"Unbranded Newsroom","average_rating":"4","genre":"rap"} {"id":"5","name":"Bugle Infantry","average_rating":"5", "genre": "doom_metal"} {"id":"1","name":"Into Latch","average_rating":"4.9","genre":"doom_metal"} Randomly generated band names as of May the 18th 2017, zero affiliation on my behalf or IBM’s for any of these names...entirely coincidental if they do exist
  • 44. “Great, but you mentioned some data collected with wearables and machine learning!”
  • 45. Anonymised data gathered from Automatc, Apple Health, Withings, Jawbone Up ● Car journeys ● Sleeping activity (start and end tme) ● Daytme actvity (calories consumed, steps taken) ● Weight and heart rate ● Several CSV files ● Anonymised by subject gatherer before uploading anywhere! Nothing identfiable
  • 46. Exploring the datasets: driving actvity val autoData = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(myHome + "/datasets/geecon/automatic.csv"). withColumnRenamed("End Location Name", "Location"). withColumnRenamed("End Time", "Time")
  • 47. Checking our data is sensible... val colsWeCareAbout = "Distance (mi)", "Duration (min)", "Fuel Cost (USD)") for (col <- colsWeCareAbout) { summarise(autoData, col) } Array( def summarise(df: DataFrame, columnName: String) { averageByCol(df, columnName) minByCol(df, columnName) maxByCol(df, columnName) } def averageByCol(df: DataFrame, columnName: String) { println("Printing the average " + columnName) df.agg(avg(df.col(columnName))).show() } def minByCol(df: DataFrame, columnName: String) { println("Printing the minimum " + columnName) df.agg(min(df.col(columnName))).show() } def maxByCol(df: DataFrame, columnName: String) { println("Printing the maximum " + columnName) df.agg(max(df.col(columnName))).show() } Average distance (in miles): 6.88, minimum: 0.01, maximum: 187.03 Average duration (in minutes): 14.87, minimum: 0.2, maximum: 186.92 Average fuel Cost (in USD): 0.58, minimum: 0.0, maximum: 14.35
  • 48. Looks OK - what’s the rate of Mr X visiting a certain place? Got a favourite gym day? Slacking on certain days? ● Using Spark to determine chance of the subject being there ● Timestamps (the “Time” column need to become days of the week instead) ● The start of a common theme: data preparaton!
  • 49. Explore the data first |Vehicle|Start Location Name|Start Time|Location|Time| Distance (mi)|Duration (min)|Fuel Cost (USD)|Average MPG|Fuel Volume (gal)|Hard Accelerations|Hard Brakes| Duration Over 70 mph (secs)|Duration Over 75 mph (secs)| Duration Over 80 mph (secs)|Start Location Accuracy (meters)|End Location Accuracy (meters)|Tags| ... |2005 Nissan 0.27| 0| Sentra| PokeStop 12|4/3/2016 15:06|PokeStop 12|4/3/2016 0.03| 0| 15:07| 1.52| 0.04| 13.64| 0| 0| 0| 5.0| 5.0| null| |2005 Nissan 0.1| 0| Sentra| PokeStop 12|4/3/2016 15:17|PokeStop 12|4/3/2016 0.0| 0| 15:18| 0.71| 0.01| 17.64| 0| 0| 0| 5.0| 5.0| null| autoData.show() ...
  • 50. val preparedAutoData = sqlContext.sql( "SELECT TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS TIMESTAMP)) as Date, Location, “ + “date_format(TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS TIMESTAMP)), 'EEEE') as Day FROM auto_data") preparedAutoData.show() Timestamp fun: 4/03/2016 15:06 is no good! ----------+-----------+---------+ |2016-04-03|PokeStop 12| |2016-04-03|PokeStop 12| Sunday| Sunday| Sunday||2016-04-03| Michaels| ... +----------+-----------+--------- + Date| Location | Day|
  • 51. def printChanceLocationOnDay( sqlContext: SQLContext, day: String, location: String) { val allDatesAndDaysLogged = sqlContext.sql( "SELECT Date, Day " + "FROM prepared_auto_data " + "WHERE Day = '" + day + "'").distinct() allDatesAndDaysLogged.show() Scala function: give us all of the rows where the day is what we specified +----------+------+ | Date| Day| +----------+------+ |2016-10-17|Monday| |2016-10-24|Monday| |2016-04-25|Monday| |2017-03-27|Monday| |2016-08-15|Monday| ...
  • 52. +----------+--------+------+ | Date|Location| Day| +----------+--------+------+|2016-04-04| |2016-11-14| |2017-01-09| |2017-02-06| Gym|Monday| Gym|Monday| Gym|Monday| Gym|Monday| var rate = Math.floor( (Double.valueOf(allDatesAndDaysLogged.count()) / Double.valueOf(visits.count())) * 100) println(rate + "% rate of being at the location '" + location + "' on " + day + ", activity logged for " + allDatesAndDaysLogged + " " + day + "s") val visits = sqlContext.sql( "SELECT * FROM prepared_auto_data " + "WHERE Location = '" + location + "' AND Day = '" visits.show() + day + "'") Rows where the location and day matches our query (passed in as parameters)
  • 53. ● 7% rate of being at the location 'Gym' on Monday, activity logged for 51 Mondays ● 1% rate of being at the location 'Gym' on Tuesday, activity logged for 51 Tuesdays ● 2% rate of being at the location 'Gym' on Wednesday, activity logged for 49 Wednesdays ● 6% rate of being at the location 'Gym' on Thursday, activity logged for 47 Thursdays ● 7% rate of being at the location 'Gym' on Saturday, activity logged for 41 Saturdays ● 9% rate of being at the location 'Gym' on Sunday, activity logged for 41 Sundays val days = Array("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") for (day <- days) { printChanceLocationOnDay(sqlContext, autoData, day, "Gym") }
  • 54. Which feature(s) are closely related to another - e.g. the time spent asleep? Dataset has these features from Jawbone ● s_duration (the sleep time as well...) ● m_active_time ● m_calories ● m_distance ● m_steps ● m_total_calories ● n_bedtime (hmm) ● n_awake_time How about correlations?
  • 55. Very strong positive correlation for n_bedtime and s_asleep_time Correlation between goal_body_weight and s_asleep time: -0.02 Val shouldBeLow = sleepData.stat.corr("goal_body_weight", "s_duration") println("Correlation between goal body weight and sleep duration: " + shouldBeLow) val compareToCol = "s_duration" for (col <- sleepData.columns) { If (! col.equals(compareToCol)) { // don’t compare to itself... val corr = sleepData.stat.corr(col, compareToCol) if (corr > 0.8) { println("Very strong positive correlation for " + col + " and " + compareToCol) } else if (corr >= 0.5) { println("Positive correlation for " + col + " and " + compareToCol) } } } And something we know isn’t related?
  • 56. “...can Spark help me to get a good sleep?”
  • 57. Need to define a good sleep first 8 hours for this test subject If duration is > 8 hours good sleep = true, else false I’m using 1 for true and 0 for false We will label this data soon so remember this Then we’ll determine the most influential features on the value being true or false. This can reveal the interestng stuf!
  • 58. Sanity check first: any good sleeps for Mr X? Found 538 valid recorded sleep times and 129 were 8 or more hours in duration // Don't care if the sleep duration wasn't even recorded or it's 0 val onlyRecordedSleeps = onlyDurations.filter($"s_duration" > 0) println("Found " + onlyRecordedSleeps.count() + " valid recorded " + "sleep times and " + onlyGoodSleeps.count() + " were " + NUM_HOURS + " or more hours in duration") THRESHOLD = 60 * onlyGoodSleeps = val onlyDurations = sleepData.select("s_duration") val NUM_HOURS = 8 val val 60 * NUM_HOURS onlyDurations.filter($"s_duration" >= THRESHOLD)
  • 59. We will use machine learning: but first... 1) What do we want to find out? Main contributng factors to a good sleep 2) Pick an algorithm 3) Prepare the data 4) Separate into training and test data 5) Build a model with the training data (in parallel using Spark!) 6) Use that model on the test data 7) Evaluate the model 8) Experiment with parameters untl reasonably accurate e.g. N iteratons
  • 60. Alternating Least Squares K-means (unsupervised learning (no labels, cheap)) Classificaton algorithms such as Clustering algorithms such as ● Produce n clusters from data to determine which cluster a new item can be categorised as ● Identfy anomalies: transaction fraud, erroneous data Recommendaton algorithms such as ● Movie recommendatons on Netlix? ● Recommended purchases on Amazon? ● Similar songs with Spotify? ● Recommended videos on YouTube? Logistic regression ● Create model that we can use to predict where to plot the next item in a sequence (above or below our line of best fit) ● Healthcare: predict adverse drug reactons based on known interactons with similar drugs ● Spam filter (binomial classification) ● Naive Bayes Which algorithms might be of use?
  • 61. What does “Naive Bayes” have to do with my sleep quality? Using evidence provided, guess what a label will be (1 or 0) for us: easy to use with some training data 0 = the label (category 0 or 1) e.g. 0 = low scoring athlete, 1 = high scoring 1:x = the score for a sportng event 1 2:x = the score for a sportng event 2 3:x = the score for a sportng event 3 bayes_data.txt (libSVM format)
  • 62. val model = new NaiveBayes().fit(trainingData) val predictions = model.transform(testData) val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions) println("Test set accuracy = " + accuracy) Test set accuracy = 0.82 val bayesData = sqlContext.read.format("libsvm").load("bayes_data.txt") val Array(trainingData, testData) = bayesData.randomSplit(Array(0.7, 0.3)) Read it in, split it, fit it, transform and evaluate – all on one slide with Spark! https://spark.apache.org/docs/2.1.0/mllib-naive-bayes.html Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes’ theorem to compute the conditional probability distribution of label given an observation and use it for prediction.
  • 63. Naive Bayes correctly classifies the data (giving it the right labels) Feed some new data in for the model...
  • 64. “Can I just use Naive Bayes on all of the sleep data?”
  • 65. 1) didn’t label each row in the DataFrame yet 2) Naive Bayes can’t handle our data in the current form 3) too many useless features
  • 66. Possibilites – bear in mind that DataFrames are immutable, can't modify elements directly... 1) Spark has a .map functon,howaboutthat? “map is a transformation that passes each dataset element through a function and returns a new RDD representing the results” - http://spark.apache.org/docs/latest/programming-guide.html ● Removes allothercolumns inmycase...(newDataFrame withjustthelabels!) 2) Running a user defined functon on each row? ● Maybe, but can Spark’s internal SQL optmiser “Catalyst” see and optmise it? Probably slow Labelling each row according to our “good sleep” criteria
  • 67. Preparing the labels Preparing the features is easier val labelledSleepData = sleepData. withColumn("s_duration", when(col("s_duration") > THRESHOLD, 1). otherwise(0)) val assembler = new VectorAssembler() .setInputCols(sleepData.columns) .setOutputCol("features") val preparedData = assembler.transform(labelledSleepData). withColumnRenamed("s_duration", "good_sleep") “If duration is > 8 hours good sleep = true, else false I’m using 1 for true and 0 for false”
  • 69. 1) didn’t label each row in the DataFrame yet 2) Naive Bayes can’t handle our data in the current form 3) too many useless features
  • 70. Trying to fit a model to the DataFrame now leads to...
  • 71. s_asleep_time and n_bedtime (integers) API docs: “Time user fell asleep. Seconds to/from midnight. If negative, subtract from midnight. If positive, add to midnight” Solution in this example? Change to positives only Add the number of seconds in a day to whatever s_asleep_time's value is. Think it through properly when you try this if you’re done experimenting and want something reliable to use! The problem...
  • 72. New DataFrame where negative values are handled toModel.createOrReplaceTempView("to_model_table") val preparedSleepAsLabel = preparedData.withColumnRenamed("good_sleep", "label") val secondsInDay = 24 * 60 * 60 val toModel = preparedSleepAsLabel. withColumn("s_asleep_time", (col("s_asleep_time")) + secondsInDay). withColumn("s_bedtime", (col("s_bedtime")) + secondsInDay)
  • 73. 1) didn’t label each row in the DataFrame yet 2) Naive Bayes can’t handle our data in the current form 3) too many useless features
  • 74. Reducing your “feature space” Spark’s ChiSqSelector algorithm will work here We want labels and features to inspect
  • 75. val selector = new ChiSqSelector() .setNumTopFeatures(10) .setFeaturesCol("features") .setLabelCol("good_sleep") .setOutputCol("selected_features") val model = selector.fit(preparedData) val topFeatureIndexes = model.selectedFeatures for (i <- 1 to topFeatureIndexes.length - 1) { // Get col names based on feature indexes println(preparedData.columns(topFeatureIndexes(i))) } Using ChiSq selector to get the top features Feature selection tries to identify relevant features for use in model construction. It reduces the size of the feature space, which can improve both speed and statistical learning behavior. ChiSqSelector implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports three selection methods: numTopFeatures, percentile, fpr: numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html#chisqselector
  • 76. Transform values into a “features” column and only select columns we identified as influential Earlier we did... toModel.createOrReplaceTempView("to_model_table") val onlyInterestingColumns = sqlContext.sql("SELECT label, " + colNames.toString() to_model_table") + " FROM val theAssembler = new VectorAssembler() .setInputCols(onlyInterestingColumns.columns) .setOutputCol("features") val thePreparedData = theAssembler.transform(onlyInterestingColumns)
  • 77. Top ten influental features (most to least influental) Feature Description from Jawbone API docs s_count Number of primary sleep entries logged s_awake_time Time the user woke up s_quality Proprietary formula, don't know s_asleep_time Time when the user fell asleep s_bedtime Seconds the device is in sleep mode s_deep Seconds of main “sound sleep” s_light Seconds of “light sleeps” during the sleep period m_workout_time Length of logged workouts in seconds n_light Seconds of light sleep during the nap n_sound Seconds of sound sleep during the nap
  • 78. 1) didn’t label each row in the DataFrame yet 2) Naive Bayes can’t handle our data in the current form 3) too many useless features
  • 79. And after all that...we can generate predictions! val Array(trainingSleepData, testSleepData)=thePreparedData.randomSplit(Array(0.7, 0.3) val sleepModel = new NaiveBayes().fit(trainingSleepData) val predictions = sleepModel.transform(testSleepData) val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions) println("Test set accuracy for labelled sleep data = " + accuracy) Test set accuracy for labelled sleep data = 0.81 ...
  • 80. Testing it with new data val somethingNew = sqlContext.createDataFrame(Seq( // Good sleep: high workout time, achieved a good amount of deep sleep, went to bed after midnight and woke at almost noon! (0, Vectors.dense(0, 1, 42600, 100, 87659, 85436, 16138, 22142, 4073, 0)), // Bad sleep, woke up early (5 AM), didn't get much of a deep sleep, didn't workout, bedtime 10.20 PM (0, Vectors.dense(0, 0, 18925, 0, 80383, 80083, 6653, 17568, 0, 0)) )).toDF("label","features") sleepModel.transform(somethingNew).show()
  • 81. Sensible model created with outcomes we’d expect Go to bed earlier, exercise more I could have looked closer into removing the s_ variables so they’re all m_ and diet informaton; exercise for the reader Algorithms are producing these outcomes without domain specific knowledge
  • 82. Last example: “does weighing more result in a higher heart rate?” Will get the average of all the heart rates logged on a day when weight was measured Lower heart rate day = weight was more? Higher rate day = weight was less? Maybe MLlib again? But all that preparation work... How deeply involved with Spark do we usually need to get?
  • 83. More data preparaton needed, but there’s a twist Here I use data from two tables: weights, activities +----------+------+ | Date|weight| +----------+------+ |2017-04-09| |2017-04-08| |2017-04-07| 220.4| 219.9| 221.0|+----------+------+ only showing top 3 rows becomes Times are removed as we only care about dates
  • 84. Include only heart beat readings when we have weight(s) measured: join on date used +----------+------+----------------------+ | Date|weight|heart_beats_per_minute| +----------+------+----------------------+ |2017-02-13| |2017-02-13| |2017-02-09| |2017-02-09| |2017-02-09| 220.3| 220.3| 215.9| 215.9| 215.9| 79.0| 77.0| 97.0| 104.0| 88.0| +----------+------+---------------------- ...
  • 85. Average the rate and weight readings by day +----------+------+----------------------+ | Date|weight|heart_beats_per_minute| +----------+------+----------------------+ |2017-02-13| 220.3| |2017-02-13| 220.7| 79.0| 77.0| +----------+------+----------------------+ ... Should become this: +----------+------+-----------------------------------+ | Date|avg weight |avg_heart_beats_per_minute | +----------+------+-----------------------------------+ |2017-02-13| 220.5| 78 | +----------+------+----------------------------------- + ...
  • 86. DataFrame now looks like this... +----------+--------------------------- +------------------+ |Date ||avg(heart_beats_per_minute)| avg(weight) | +----------+----------------------------+------------------+ |2016-04-25| |2017-01-06| |2016-05-03| |2016-07-26| Something we can quickly plot! |85.933... |196.46... | |93.8125... |216.0 | |83.647... |198.35... | |84.411... |192.69... |
  • 87. Bokeh used again, no more analysis required
  • 88. Used the same functions as earlier (groupBy, formatting dates) and also a join. Same plotting with different column names. No distinct correlation identified so moved on Still lots of questions we could answer with Spark using this data ● Any impact on mpg when the driver weighs much less than before? ● Which fuel provider gives me the best mpg? ● Which visited places have a positive effect on subject’s weight?
  • 89. ● Analytics doesn’t need to be complicated: Spark’s good for the heavy lifting ● Sometimes best to just plot as you go – saves plenty of time ● Other harder things to worry about Writing a distributed machine learning algorithm shouldn’t be one of them!
  • 90. “Which tools can I use to answer my questions?” This question becomes easier
  • 91. Infrastructure when you’re ready to scale beyond your laptop ● Setting up a huge HA cluster: a talk on its own ● Who sets up then maintains the machines? Automate it all? ● How many machines do you need? RAM/CPUs? ● Who ensures all software is up to date (CVEs?) ● Access control lists? ● Hosting costs/providers? ● Reliability, fault tolerance, backup procedures... Still got to think about...
  • 92. ● Use GPUs to train models faster ● DeepLearning4J? ● Writing your own kernels/C/JNI code (or a Java API like CUDA4J/Aparapi?) ● Use RDMA to reduce network transfer times ● Zero copy: RoCE or InfiniBand? ● Tune the JDK, the OS, the hardware ● Continuously evaluate performance: Spark itself, use ● -Xhealthcenter, your own metrics, various libraries... ● Go tackle something huge – join the alien search ● Combine Spark Streaming with MLlib to gain insights fast ● More informed decision making And if you want to really show off with Spark
  • 93. ● Know more about Spark: what it can and can’t do (new project ideas?) ● Know more about machine learning in Spark ● Know that machine learning’s stll hard but in diferent ways Data preparaton, handling junk, knowing what to look for Getting the data in the first place Writng the algorithms to be used in Spark? Recap – you should now...
  • 94. ● Built-in Spark functons are aplenty – try and stck to these ● You can plot your results by saving to a csv/json and using your existng favourite plotting libraries easily ● DataFrame (or Datasets) combined with ML = powerful APIs ● Filter your data – decide how to handle nulls! ● Pick and use a suitable ML algorithm ● Plot results Points to take home...
  • 95. Final points to consider... Where would Spark fit in to your systems? A replacement or supplementary? Give it a try with your own data and you might be surprised with the outcome It’s free and open source with a very actve community! Contact me directly: aroberts@uk.ibm.com
  • 97. ● Automatic: log into the Automatc Dashboard https://dashboard.automatc.com/, on the bottom right, click export, choose what data you want to export (e.g. All) ● Fuelly: (Obtained Gas Cubby), log into the Fuelly Dashboard http://www.fuelly.co m/dashboard, select your vehicle in Your Garage, scroll down to vehicle logs, select Export Fuel-ups or Export Services, select duraton of export ● Jawbone: sign into your account at https://jawbone.com/, click on your name on the top right, choose Settings, click on the Accounts tab, scroll down to Download UP Data, choose which year you'd like to download data for How did I access the data to process?
  • 98. ● Withings: log into the Withings Dashboard https://healthmate.withings.com click Measurement table, click the tab corresponding to the data you want to export, click download. You can go here to download all data instead: https://account.withings.com/export/ ● Apple: launch the Health app, navigate to the Health Data tab, select your account in the top right area of your screen, select Export Health Data ● Remember to remove any sensitive personal information before sharing/showing/storing said data elsewhere! I am dealing with “cleansed” datasets with no SPI