The road to AI is paved with pragmatic intentions

Jean-Georges Perrin
Jean-Georges PerrinSenior Enterprise Architect | Lifetime IBM Champion at The NPD Group
CONFIDENTIAL © 2018
The road to AI is paved with
pragmatic intentions
Jean Georges “JG" Perrin
August 22nd 2018
CONFIDENTIAL © 2018
JGP • Jean Georges Perrin
• @jgperrin
• Chapel Hill, NC
• I 🏗 SW • Since 1983
• #Knowledge = 

𝑓 ( ∑ (#SmallData, #BigData), #DataScience)

& #Software 
• #IBMChampion x10 • #KeepLearning
• @ http://jgp.net
CONFIDENTIAL © 2018
CONFIDENTIAL © 2018
Who are thou?
• Experience with Spark?
• Who is familiar with Data Quality?
• Who has already implemented Data Quality in Spark?
• Who is expecting to be an AI guru after this session?
• Who just came for the free food?
• Who is expecting to provide better insights faster after this
session?
CONFIDENTIAL © 2018
• What is ?
• What can I do with ?
• What is a app, anyway?
• What’s AI?
• Why is a great environment for AI?
• Meet Cactar, the Mongolian warlord of data quality
• Why data quality matters?
• Your first AI app with
• And finally a little surprise…
Agenda
CONFIDENTIAL © 2018
Analytics operating system
CONFIDENTIAL © 2018
Apps
Analytics
Distrib.
An analytics operating system?
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
Apps
CONFIDENTIAL © 2018
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
CONFIDENTIAL © 2018
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
CONFIDENTIAL © 2018
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
CONFIDENTIAL © 2018
use cases
• NCEatery.com
• Restaurant analytics
• 1.57×1021 datapoints analyzed (that’s about one zetta datapoints)
• (@ Lumeris)
• General compute
• Distributed data transfer
• IBM
• DSX (Data Science Experience)
• Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
• CERN
• Analysis of the science experiments in the LHC - Large Hadron Collider
CONFIDENTIAL © 2018
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark
CONFIDENTIAL © 2018
Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS
Node 1 -
Hardware
Node 2 -
Hardware
Node 3 -
Hardware
Node 4 -
Hardware
Unified API
Spark SQL Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
Node 5 - OS
Node 5 -
Hardware
Your Application
…
…
CONFIDENTIAL © 2018
Node 1 Node 2 Node 3 Node 4
Unified API
Spark SQL
Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
Node 5
Your Application
…
DataFrame
CONFIDENTIAL © 2018
Spark SQL
Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
DataFrame
CONFIDENTIAL © 2018
http://bit.ly/spark-clego
CONFIDENTIAL © 2018
What’s #AI?
CONFIDENTIAL © 2018
Popular beliefs
• Robot with human-like behavior
• HAL from 2001
• Isaac Asimov
• Potential ethic problems
• Lots of mathematics
• Heavy calculations
• Algorithms
• Self-driving cars
Current state-of-the-art
General AI Narrow AI
CONFIDENTIAL © 2018
In 2018…
I am an expert in
general AI
ARTIFICIAL INTELLIGENCE
is Machine Learning
CONFIDENTIAL © 2018
Machine learning
• Common algorithms
• Linear and logistic regressions
• Classification and regression trees
• K-nearest neighbors (KNN)
• Deep learning
• Subset of ML
• Artificial neural networks (ANNs)
• Super CPU intensive, use of GPU
CONFIDENTIAL © 2018
There are two kinds of data scientists:
1) Those who can extrapolate from incomplete

data.
-The Internet

and my personal dedication to Sam Christie
CONFIDENTIAL © 2018
DATA
Engineer
DATA
Scientist
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for
predictive models.
Explore data to find
hidden gems and
patterns.
Tells stories to key
stakeholders.
CONFIDENTIAL © 2018
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
CONFIDENTIAL © 2018
1
3
5
7
9
11
13
15
17
Jan-14 Jul-14 Jan-15 Jul-15 Jan-16 Jul-16 Jan-17 Jul-17 Jan-18
Scala Java Python R
Programming languages
RedMonk programming language rankings
40
50
60
70
80
90
100
2014 2015 2016 2017 2018
Scala Java Python R SQL
IEEE Spectrum, top programming languages
CONFIDENTIAL © 2018
xkcd
As goes the old adage:
Garbage in,
Garbage out
CONFIDENTIAL © 2018
If Everything Was As Simple…
Dinner
revenue per
number of
guests
CONFIDENTIAL © 2018
…as a Visual Representation
Anomaly #1
Anomaly #2
CONFIDENTIAL © 2018
I Love It When a Plan Comes Together
CONFIDENTIAL © 2018
Data is like a 

box of chocolates, 

you never know what
you're gonna get.
Jean ”Gump” Perrin
June 2017
CONFIDENTIAL © 2018
Data from everywhere
Databases
RDBMS, NoSQL
Files
CSV, XML, Json, Excel, Photos, Video
Machines
Services, REST, Streaming, IoT…
CONFIDENTIAL © 2018
CACTAR
is not a 

Mongolian warlord
CONFIDENTIAL © 2018
What is data quality?
Attributes (CACTAR):
• Consistency,
• Accuracy,
• Completeness,
• Timeliness,
• Accessibility,
• Reliability.
To allow:
• Operations,
• Decision making,
• Planning,
• Machine Learning,
• Artificial Intelligence.
CONFIDENTIAL © 2018
Ouch, bad data story
Hubble was blind
because of a
2.2nm error
Legend says it’s a
metric to imperial/
US conversion
CONFIDENTIAL © 2018
Now it hurts
• Challenger exploded in 1986 

because of a defective O-ring
• Root causes were:
• Invalid data for the 

O-ring parts
• Lack of data-lineage

& reporting



CONFIDENTIAL © 2018
#1 Scripts and the likes
• Source data are cleaned by
scripts (shell, Python, Java
app…)
• I/O intensive
• Storage space intensive
• No parallelization
CONFIDENTIAL © 2018
#2 Use Spark SQL
• All in memory!
• But limited to SQL and built-
in function
CONFIDENTIAL © 2018
#3 Use UDFs
• User Defined Function
• SQL can be extended with
UDFs
• UDFs benefit from the cluster
architecture and distributed
processing
CONFIDENTIAL © 2018
Enough!
Let me code!
CONFIDENTIAL © 2018
SPRU your UDF
• Service: 

build your code, it might be already existing!
• Plumbing: 

connect your existing business logic to Spark via an UDF
• Register: 

the UDF in Spark
• Use: 

the UDF is available in Spark SQL and via callUDF()
CONFIDENTIAL © 2018
Code sample #1.2 - plumbing
package net.jgp.labs.sparkdq4ml.dq.udf;


import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;


public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
If price is ok, returns price,
if price is ko, returns -1
CONFIDENTIAL © 2018
Code sample #1.3 - register
SparkSession spark = SparkSession
.builder().appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Code sample #1.4 - use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Dataset with anomalies
+-----+-----+
|guest|price|
+-----+-----+
|   1|23.24|
|    2|30.89|
|    2|33.74|
|    3|34.89|
|    3|29.91|
|    3| 38.0|
|    4| 40.0|
|    5|120.0|
|    6| 50.0|
|    6|112.0|
|    8| 60.0|
|    8|127.0|
|    8|120.0|
|    9|130.0|
+-----+-----+
CONFIDENTIAL © 2018
Code sample #1.4 - use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Highlight anomalies
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
|    1| 23.1|        23.1|
|    2| 30.0|        30.0|
|    2| 33.0|        33.0|
|    3| 34.0|        34.0|
|   24|142.0|       142.0|
|   24|138.0|       138.0|
|   25|  3.0|        -1.0|
|   26| 10.0|        -1.0|
|   25| 15.0|        -1.0|
|   26|  4.0|        -1.0|
|   28| 10.0|        -1.0|
|   28|158.0|       158.0|
|   30|170.0|       170.0|
|   31|180.0|       180.0|
+-----+-----+------------+
CONFIDENTIAL © 2018
Code sample #1.4 - use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Cleansed dataset
+-----+-----+
|guest|price|
+-----+-----+
|    1| 23.1|
|    2| 30.0|
|    2| 33.0|
|    3| 34.0|
|    3| 30.0|
|    4| 40.0|
|   19|110.0|
|   20|120.0|
|   22|131.0|
|   24|142.0|
|   24|138.0|
|   28|158.0|
|   30|170.0|
|   31|180.0|
+-----+-----+
CONFIDENTIAL © 2018
Data can now be used for ML
• Convert/Adapt dataset to Features and Label
• Required for Linear Regression in MLlib
• Needs a column called label of type double
• Needs a column called features of type VectorUDT
CONFIDENTIAL © 2018
Code sample #2 - register & use
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));


// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Prediction for 40 guests…
+-----+-----+-----+--------+------------------+
|guest|price|label|features|        prediction|
+-----+-----+-----+--------+------------------+
|    1| 23.1| 23.1|   [1.0]|24.563807596513133|
|    2| 30.0| 30.0|   [2.0]|29.595283312577884|
|    2| 33.0| 33.0|   [2.0]|29.595283312577884|
|    3| 34.0| 34.0|   [3.0]| 34.62675902864264|
|    3| 30.0| 30.0|   [3.0]| 34.62675902864264|
|    3| 38.0| 38.0|   [3.0]| 34.62675902864264|
|    4| 40.0| 40.0|   [4.0]| 39.65823474470739|
|   14| 89.0| 89.0|  [14.0]| 89.97299190535493|
|   16|102.0|102.0|  [16.0]|100.03594333748444|
|   20|120.0|120.0|  [20.0]|120.16184620174346|
|   22|131.0|131.0|  [22.0]|130.22479763387295|
|   24|142.0|142.0|  [24.0]|140.28774906600245|
+-----+-----+-----+--------+------------------+
Prediction for 40.0 guests is 220.79136052303852
CONFIDENTIAL © 2018
(the complex ML code)
LinearRegression lr = new LinearRegression()
.setMaxIter(40)
.setRegParam(1)
.setElasticNetParam(1);
LinearRegressionModel model = lr.fit(df);
Double feature = 40.0;
Vector features = Vectors.dense(40.0);
double p = model.predict(features);
/jgperrin/net.jgp.labs.sparkdq4ml
Define algorithms and its (hyper)parameters
Created a model from our data
Apply the model to a new dataset: predict
CONFIDENTIAL © 2018
Surprise!
CONFIDENTIAL © 2018
Using Spark to analyse the World Cup 2018
ScoreStatisticsApp:
Counting the goals and other statistics from World Cup 2018
Goals by country
+-------+----------+
|country|sum(score)|
+-------+----------+
|Belgium| 16|
| France| 14|
|Croatia| 14|
|England| 12|
| Russia| 11|
+-------+----------+
only showing top 5 rows
/jgperrin/net.jgp.labs.spark.football
CONFIDENTIAL © 2018
Using Spark to analyse all the World Cups
HistoricScoreStatisticsApp:
Analyzing the attendance for all World Cups and other stats
Attendance per year
+----+----------+
|Year|Attendance|
+----+----------+
|1930| 1181098|
|1934| 726000|
|1938| 751400|
|1950| 2090492|
|1954| 1537214|
|1958| 1639620|
/jgperrin/net.jgp.labs.spark.football
CONFIDENTIAL © 2018
Demo
Using Spark to predict the next World Cup winner
• Analysis of soccer tournaments during World Cup from 1930 to
2018
• Numerous data set from Kaggle, data.world
• Potentially billions of data points
• Heavy usage of Java
/jgperrin/net.jgp.labs.spark.football
CONFIDENTIAL © 2018
Conclusion
CONFIDENTIAL © 2018
Key takeaways
• Build your data lake in memory (and disk).
• Store first, ask questions later.
• Sample your data (to spot anomalies and more).
• Build (and reuse!) business rules with the business people.
• Use Spark dataframe, SQL, and UDFs to build a consistent &
coherent dataset.
• Rely on UDFs for prepping ML formats.
• Use Java.
CONFIDENTIAL © 2018
Going further
• Contact me @jgperrin
• Hands-on tutorial at All Things Open (October in Raleigh, NC)
• Join the Spark User mailing list
• Get help from Stack Overflow
• fb.com/TriangleSpark
• Buy my book on Spark with Java in MEAP (ok, really shameless
plug here)
CONFIDENTIAL © 2018
Going even further
Spark with Java (MEAP)
by Jean Georges Perrin (@jgperrin)
published by Manning
https://www.manning.com/books/spark-with-java
sparkwjava-B108 sparkwithjava
One free book 40% off
1 of 59

More Related Content

What's hot(20)

Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform Overview
David Chou3.9K views
Augmented OLAP for Big Data AnalyticsAugmented OLAP for Big Data Analytics
Augmented OLAP for Big Data Analytics
Tyler Wishnoff52 views
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
joshwills53.1K views

Similar to The road to AI is paved with pragmatic intentions(20)

Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance Computing
Abzetdin Adamov171 views
Forecast: Cloud-y with Azure SkiesForecast: Cloud-y with Azure Skies
Forecast: Cloud-y with Azure Skies
Charlie Oliver56 views
Supervised ManufacturingSupervised Manufacturing
Supervised Manufacturing
Duncan Purves64 views
WaveEngine Dotnet 2018WaveEngine Dotnet 2018
WaveEngine Dotnet 2018
Javier Cantón Ferrero425 views
Conclusion Connect state of IoT 2019 Review io t solutions world congress 2019Conclusion Connect state of IoT 2019 Review io t solutions world congress 2019
Conclusion Connect state of IoT 2019 Review io t solutions world congress 2019
Conclusion Connect enabling industry 4.0 with IoT 227 views
Simplify Data Analytics Over the CloudSimplify Data Analytics Over the Cloud
Simplify Data Analytics Over the Cloud
Tyler Wishnoff264 views
Cloud Best PracticesCloud Best Practices
Cloud Best Practices
enzoriv185 views

More from Jean-Georges Perrin(20)

It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
Jean-Georges Perrin305 views
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
Jean-Georges Perrin262 views
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
Jean-Georges Perrin376 views
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
Jean-Georges Perrin320 views
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
Jean-Georges Perrin689 views
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
Jean-Georges Perrin781 views
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
Jean-Georges Perrin488 views
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
Jean-Georges Perrin666 views
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
Jean-Georges Perrin894 views
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
Jean-Georges Perrin1.1K views
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
Jean-Georges Perrin984 views
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
Jean-Georges Perrin1.7K views
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
Jean-Georges Perrin366 views
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
Jean-Georges Perrin1.4K views

Recently uploaded(20)

Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia19 views
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIA
Federico Karagulian5 views
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann102 views
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 views
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela166 views
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar14 views
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 views
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 views
ColonyOSColonyOS
ColonyOS
JohanKristiansson69 views
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4918 views

The road to AI is paved with pragmatic intentions

  • 1. CONFIDENTIAL © 2018 The road to AI is paved with pragmatic intentions Jean Georges “JG" Perrin August 22nd 2018
  • 2. CONFIDENTIAL © 2018 JGP • Jean Georges Perrin • @jgperrin • Chapel Hill, NC • I 🏗 SW • Since 1983 • #Knowledge = 
 𝑓 ( ∑ (#SmallData, #BigData), #DataScience)
 & #Software  • #IBMChampion x10 • #KeepLearning • @ http://jgp.net
  • 4. CONFIDENTIAL © 2018 Who are thou? • Experience with Spark? • Who is familiar with Data Quality? • Who has already implemented Data Quality in Spark? • Who is expecting to be an AI guru after this session? • Who just came for the free food? • Who is expecting to provide better insights faster after this session?
  • 5. CONFIDENTIAL © 2018 • What is ? • What can I do with ? • What is a app, anyway? • What’s AI? • Why is a great environment for AI? • Meet Cactar, the Mongolian warlord of data quality • Why data quality matters? • Your first AI app with • And finally a little surprise… Agenda
  • 7. CONFIDENTIAL © 2018 Apps Analytics Distrib. An analytics operating system? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS Apps
  • 8. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 9. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 10. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 11. CONFIDENTIAL © 2018 use cases • NCEatery.com • Restaurant analytics • 1.57×1021 datapoints analyzed (that’s about one zetta datapoints) • (@ Lumeris) • General compute • Distributed data transfer • IBM • DSX (Data Science Experience) • Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ • CERN • Analysis of the science experiments in the LHC - Large Hadron Collider
  • 13. CONFIDENTIAL © 2018 Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - Hardware Node 2 - Hardware Node 3 - Hardware Node 4 - Hardware Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 - OS Node 5 - Hardware Your Application … …
  • 14. CONFIDENTIAL © 2018 Node 1 Node 2 Node 3 Node 4 Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 Your Application … DataFrame
  • 15. CONFIDENTIAL © 2018 Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX DataFrame
  • 18. CONFIDENTIAL © 2018 Popular beliefs • Robot with human-like behavior • HAL from 2001 • Isaac Asimov • Potential ethic problems • Lots of mathematics • Heavy calculations • Algorithms • Self-driving cars Current state-of-the-art General AI Narrow AI
  • 19. CONFIDENTIAL © 2018 In 2018… I am an expert in general AI ARTIFICIAL INTELLIGENCE is Machine Learning
  • 20. CONFIDENTIAL © 2018 Machine learning • Common algorithms • Linear and logistic regressions • Classification and regression trees • K-nearest neighbors (KNN) • Deep learning • Subset of ML • Artificial neural networks (ANNs) • Super CPU intensive, use of GPU
  • 21. CONFIDENTIAL © 2018 There are two kinds of data scientists: 1) Those who can extrapolate from incomplete
 data. -The Internet
 and my personal dedication to Sam Christie
  • 22. CONFIDENTIAL © 2018 DATA Engineer DATA Scientist Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders.
  • 23. CONFIDENTIAL © 2018 Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer DATA Engineer DATA Scientist SQL
  • 24. CONFIDENTIAL © 2018 1 3 5 7 9 11 13 15 17 Jan-14 Jul-14 Jan-15 Jul-15 Jan-16 Jul-16 Jan-17 Jul-17 Jan-18 Scala Java Python R Programming languages RedMonk programming language rankings 40 50 60 70 80 90 100 2014 2015 2016 2017 2018 Scala Java Python R SQL IEEE Spectrum, top programming languages
  • 25. CONFIDENTIAL © 2018 xkcd As goes the old adage: Garbage in, Garbage out
  • 26. CONFIDENTIAL © 2018 If Everything Was As Simple… Dinner revenue per number of guests
  • 27. CONFIDENTIAL © 2018 …as a Visual Representation Anomaly #1 Anomaly #2
  • 28. CONFIDENTIAL © 2018 I Love It When a Plan Comes Together
  • 29. CONFIDENTIAL © 2018 Data is like a 
 box of chocolates, 
 you never know what you're gonna get. Jean ”Gump” Perrin June 2017
  • 30. CONFIDENTIAL © 2018 Data from everywhere Databases RDBMS, NoSQL Files CSV, XML, Json, Excel, Photos, Video Machines Services, REST, Streaming, IoT…
  • 31. CONFIDENTIAL © 2018 CACTAR is not a 
 Mongolian warlord
  • 32. CONFIDENTIAL © 2018 What is data quality? Attributes (CACTAR): • Consistency, • Accuracy, • Completeness, • Timeliness, • Accessibility, • Reliability. To allow: • Operations, • Decision making, • Planning, • Machine Learning, • Artificial Intelligence.
  • 33. CONFIDENTIAL © 2018 Ouch, bad data story Hubble was blind because of a 2.2nm error Legend says it’s a metric to imperial/ US conversion
  • 34. CONFIDENTIAL © 2018 Now it hurts • Challenger exploded in 1986 
 because of a defective O-ring • Root causes were: • Invalid data for the 
 O-ring parts • Lack of data-lineage
 & reporting
 

  • 35. CONFIDENTIAL © 2018 #1 Scripts and the likes • Source data are cleaned by scripts (shell, Python, Java app…) • I/O intensive • Storage space intensive • No parallelization
  • 36. CONFIDENTIAL © 2018 #2 Use Spark SQL • All in memory! • But limited to SQL and built- in function
  • 37. CONFIDENTIAL © 2018 #3 Use UDFs • User Defined Function • SQL can be extended with UDFs • UDFs benefit from the cluster architecture and distributed processing
  • 39. CONFIDENTIAL © 2018 SPRU your UDF • Service: 
 build your code, it might be already existing! • Plumbing: 
 connect your existing business logic to Spark via an UDF • Register: 
 the UDF in Spark • Use: 
 the UDF is available in Spark SQL and via callUDF()
  • 40. CONFIDENTIAL © 2018 Code sample #1.2 - plumbing package net.jgp.labs.sparkdq4ml.dq.udf; 
 import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; 
 public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml If price is ok, returns price, if price is ko, returns -1
  • 41. CONFIDENTIAL © 2018 Code sample #1.3 - register SparkSession spark = SparkSession .builder().appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  • 42. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  • 43. CONFIDENTIAL © 2018 Dataset with anomalies +-----+-----+ |guest|price| +-----+-----+ |   1|23.24| |    2|30.89| |    2|33.74| |    3|34.89| |    3|29.91| |    3| 38.0| |    4| 40.0| |    5|120.0| |    6| 50.0| |    6|112.0| |    8| 60.0| |    8|127.0| |    8|120.0| |    9|130.0| +-----+-----+
  • 44. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 45. CONFIDENTIAL © 2018 Highlight anomalies +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ |    1| 23.1|        23.1| |    2| 30.0|        30.0| |    2| 33.0|        33.0| |    3| 34.0|        34.0| |   24|142.0|       142.0| |   24|138.0|       138.0| |   25|  3.0|        -1.0| |   26| 10.0|        -1.0| |   25| 15.0|        -1.0| |   26|  4.0|        -1.0| |   28| 10.0|        -1.0| |   28|158.0|       158.0| |   30|170.0|       170.0| |   31|180.0|       180.0| +-----+-----+------------+
  • 46. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 47. CONFIDENTIAL © 2018 Cleansed dataset +-----+-----+ |guest|price| +-----+-----+ |    1| 23.1| |    2| 30.0| |    2| 33.0| |    3| 34.0| |    3| 30.0| |    4| 40.0| |   19|110.0| |   20|120.0| |   22|131.0| |   24|142.0| |   24|138.0| |   28|158.0| |   30|170.0| |   31|180.0| +-----+-----+
  • 48. CONFIDENTIAL © 2018 Data can now be used for ML • Convert/Adapt dataset to Features and Label • Required for Linear Regression in MLlib • Needs a column called label of type double • Needs a column called features of type VectorUDT
  • 49. CONFIDENTIAL © 2018 Code sample #2 - register & use spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); 
 // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  • 50. CONFIDENTIAL © 2018 Prediction for 40 guests… +-----+-----+-----+--------+------------------+ |guest|price|label|features|        prediction| +-----+-----+-----+--------+------------------+ |    1| 23.1| 23.1|   [1.0]|24.563807596513133| |    2| 30.0| 30.0|   [2.0]|29.595283312577884| |    2| 33.0| 33.0|   [2.0]|29.595283312577884| |    3| 34.0| 34.0|   [3.0]| 34.62675902864264| |    3| 30.0| 30.0|   [3.0]| 34.62675902864264| |    3| 38.0| 38.0|   [3.0]| 34.62675902864264| |    4| 40.0| 40.0|   [4.0]| 39.65823474470739| |   14| 89.0| 89.0|  [14.0]| 89.97299190535493| |   16|102.0|102.0|  [16.0]|100.03594333748444| |   20|120.0|120.0|  [20.0]|120.16184620174346| |   22|131.0|131.0|  [22.0]|130.22479763387295| |   24|142.0|142.0|  [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852
  • 51. CONFIDENTIAL © 2018 (the complex ML code) LinearRegression lr = new LinearRegression() .setMaxIter(40) .setRegParam(1) .setElasticNetParam(1); LinearRegressionModel model = lr.fit(df); Double feature = 40.0; Vector features = Vectors.dense(40.0); double p = model.predict(features); /jgperrin/net.jgp.labs.sparkdq4ml Define algorithms and its (hyper)parameters Created a model from our data Apply the model to a new dataset: predict
  • 53. CONFIDENTIAL © 2018 Using Spark to analyse the World Cup 2018 ScoreStatisticsApp: Counting the goals and other statistics from World Cup 2018 Goals by country +-------+----------+ |country|sum(score)| +-------+----------+ |Belgium| 16| | France| 14| |Croatia| 14| |England| 12| | Russia| 11| +-------+----------+ only showing top 5 rows /jgperrin/net.jgp.labs.spark.football
  • 54. CONFIDENTIAL © 2018 Using Spark to analyse all the World Cups HistoricScoreStatisticsApp: Analyzing the attendance for all World Cups and other stats Attendance per year +----+----------+ |Year|Attendance| +----+----------+ |1930| 1181098| |1934| 726000| |1938| 751400| |1950| 2090492| |1954| 1537214| |1958| 1639620| /jgperrin/net.jgp.labs.spark.football
  • 55. CONFIDENTIAL © 2018 Demo Using Spark to predict the next World Cup winner • Analysis of soccer tournaments during World Cup from 1930 to 2018 • Numerous data set from Kaggle, data.world • Potentially billions of data points • Heavy usage of Java /jgperrin/net.jgp.labs.spark.football
  • 57. CONFIDENTIAL © 2018 Key takeaways • Build your data lake in memory (and disk). • Store first, ask questions later. • Sample your data (to spot anomalies and more). • Build (and reuse!) business rules with the business people. • Use Spark dataframe, SQL, and UDFs to build a consistent & coherent dataset. • Rely on UDFs for prepping ML formats. • Use Java.
  • 58. CONFIDENTIAL © 2018 Going further • Contact me @jgperrin • Hands-on tutorial at All Things Open (October in Raleigh, NC) • Join the Spark User mailing list • Get help from Stack Overflow • fb.com/TriangleSpark • Buy my book on Spark with Java in MEAP (ok, really shameless plug here)
  • 59. CONFIDENTIAL © 2018 Going even further Spark with Java (MEAP) by Jean Georges Perrin (@jgperrin) published by Manning https://www.manning.com/books/spark-with-java sparkwjava-B108 sparkwithjava One free book 40% off