It's painful how much data rules the world

Jean-Georges Perrin • @jgperrin
It's painful
how much
data rules the world
All Things Open Meetup
Raleigh Convention Center • Raleigh, NC
September 15th 2021

The opinions expressed in this presentation and on the
following slides are solely those of the presenter and not
necessarily those of The NPD Group. The NPD Group does
not guarantee the accuracy or reliability of the information
provided herein.

Jean-Georges “jgp" Perrin
Software since 1983 >$0 1995
Big Data since 1984 >$0 2006
AI since 1994 >$0 2010
x13

It’s a story
about
data
4 avril 1980

Find & process the
data, not in Excel
Display the data in
a palatable form
Source:
Pexels

Sources:
Bureau of Transportation Statistics: https://www.transtats.bts.gov/TRAFFIC/

+----------+----------------+-----------+-----+
|month |internationalPax|domesticPax|pax |
+----------+----------------+-----------+-----+
|2000-01-01|5394 |41552 |46946|
|2000-02-01|5249 |43724 |48973|
|2000-03-01|6447 |52984 |59431|
|2000-04-01|6062 |50349 |56411|
|2000-05-01|6342 |52320 |58662|
+----------+----------------+-----------+-----+
only showing top 5 rows
root
|-- month: date (nullable = true)
|-- internationalPax: integer (nullable = true)
|-- domesticPax: integer (nullable = true)
|-- pax: integer (nullable = true)
I !

दृिष्ट • dṛṣṭi
Open source, React & IBM Carbon-based
data visualization framework
Download at https://jgp.ai/drsti

Apply light data quality
Create a session
Create a schema
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local[*]")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(“month", DataTypes.DateType, false),
DataTypes.createStructField(“pax", DataTypes.IntegerType, true) });
Dataset<Row> internationalPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/International USCarrier_Traffic_20210902163435.csv");
internationalPaxDf = internationalPaxDf
.withColumnRenamed("pax", "internationalPax")
.filter(col("month").isNotNull())
.filter(col("internationalPax").isNotNull());
Dataset<Row> domesticPaxDf = spark.read().format("csv")
.schema(schema)
.load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv");
domesticPaxDf = domesticPaxDf
.withColumnRenamed("pax", "domesticPax")
.filter(col("domesticPax").isNotNull());
/jgperrin/ai.jgp.drsti-spark
Lab #300
Ingest international passengers
Ingest domestic passengers

Dataset<Row> df = internationalPaxDf
.join(domesticPaxDf,
internationalPaxDf.col(“month").equalTo(domesticPaxDf.col("month")),
"outer")
.withColumn("pax", expr("internationalPax + domesticPax"))
.drop(domesticPaxDf.col("month"))
.filter(col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.orderBy(col("month"))
.cache();
df = DrstiUtils.setHeader(df, "month", "Month of");
df = DrstiUtils.setHeader(df, "pax", "Passengers");
df = DrstiUtils.setHeader(df, "internationalPax", "International Passengers");
df = DrstiUtils.setHeader(df, "domesticPax", "Domestic Passengers");
DrstiChart d = new DrstiLineChart(df);
d.setTitle("Air passenger traffic per month");
d.setXScale(DrstiK.SCALE_TIME);
d.setXTitle("Period from " + DataframeUtils.min(df, "month") + " to “ + DataframeUtils.max(df, "month"));
d.setYTitle("Passengers (000s)");
d.render();
Lab #300
All my data processing
Add meta data directly to the dataframe
Configure dṛṣṭi directly on the server

Aren’t you glad we
are using Java?

Apps
Analytics
Distrib.
Hardware
OS
Apps
Hardware
Hardware
OS OS
Distributed OS
Analytics OS
Apps
Hardware
Hardware
OS OS
An analytics operating system?

Hardware
Hardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?

Domestic
passengers
(CSV)
International
passengers
(CSV)
Domestic
passengers
(dataframe)
International
passengers
(dataframe) Combining
through an
outer join
Passengers
(dataframe)
Enhanced
data
(dataframe)
Enhanced
data
(CSV)
Visualization
metadata
(JSON)
dṛṣṭi
visualization
Server processing through . Transfer Visualization
Applying to our air traffic app

Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark

Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 1 -
HW
Node 2 -
HW
Node 3 -
HW
Node 4 -
HW
Spark SQL Spark Streaming
Spark MLlib
Machine learning
& artificial intelligence
Spark GraphX
Node 5 -
OS
Node 5 -
HW
Your application
…
…
Unified API
Node 6 -
OS
Node 6 -
HW
Node 7 -
OS
Node 7 -
HW
Node 8 -
OS
Node 8 -
HW

Spark SQL
Spark Streaming
Spark MLlib
Machine learning
Spark GraphX
Your application
Dataframe
Unified API
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 5 -
OS
…
Node 6 -
OS
Node 7 -
OS
Node 8 -
OS

Spark SQL
Spark Streaming
Spark MLlib
Machine learning
Spark GraphX
Dataframe
Source:
Pexels

.master("local[*]")
.getOrCreate();
.schema(schema)
.schema(schema)
Lab #310

internationalPaxDf.col("month")
.equalTo(domesticPaxDf.col("month")),
"outer")
.filter(
col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.cache();
Dataset<Row> dfQuarter = df
.withColumn("year", year(col("month")))
.withColumn("q", ceil(month(col("month")).$div(3)))
.withColumn("period", concat(col("year"), lit("-Q"), col("q")))
.groupBy(col("period"))
.agg(sum(“pax").as("pax"),
sum("internationalPax").as("internationalPax"),
sum("domesticPax").as("domesticPax"))
.drop("year")
.drop("q")
.orderBy(col("period"));
Lab #310
New code for quarter

.master("local[*]")
.getOrCreate();
.schema(schema)
.schema(schema)
Lab #320

internationalPaxDf.col("month")
.equalTo(domesticPaxDf.col("month")),
"outer")
.filter(
col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.cache();
Dataset<Row> dfYear = df
.groupBy(col("year"))
.agg(sum("pax").as("pax"),
sum("domesticPax").as("domesticPax"))
.orderBy(col("year"));
Lab #320
New code for year

Data
Bronze
Raw Data
Silver
Pure Data
Gold
Rich Data
Actionable
Data
Application
of
Data Quality
rules
Ingestion
Transfor-
mation
Publication
“Cache”
A (Big) Data Scenario
Building a pipeline

// Combining datasets
// ...
df.write()
.format("delta")
.mode("overwrite")
.save("./data/tmp/airtrafficmonth");
Lab #400
Saving to Delta Lake

Dataset<Row> df = spark.read().format("delta")
.load("./data/tmp/airtrafficmonth")
.orderBy(col("month"));
Dataset<Row> dfYear = df
.groupBy(col("year"))
.agg(sum(“pax").as("pax"),
...
Lab #430
Reading from Delta Lake

Can we project future traffic?
Source:
Comedy Central

Do you remember January 2020?
And March?
Source:
Pexels

• Make a model for 2000-2019
• See the projection
• Use 2020 data & imputation for
the rest of 2021
• See the projection
What now?
Source:
Pexels

Use my model
Split training & test data
String[] inputCols = { "year" };
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
df = assembler.transform(df);
LinearRegression lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.5)
.setElasticNetParam(0.8)
.setLabelCol("pax");
int threshold = 2019;
Dataset<Row> trainingData = df.filter(col("year").$less$eq(threshold));
Dataset<Row> testData = df.filter(col("year").$greater(threshold));
LinearRegressionModel model = lr.fit(trainingData);
Integer[] l = new Integer[] { 2020, 2021, 2022, 2023, 2024, 2025, 2026 };
List<Integer> data = Arrays.asList(l);
Dataset<Row> futuresDf = spark.createDataset(data, Encoders.INT()).toDF().withColumnRenamed("value", "year");
assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
futuresDf = assembler.transform(futuresDf);
df = df.unionByName(futuresDf, true);
df = model.transform(df);
Features are a vector - let’s build one
Build a linear regression
Building my model
Lab #500

Something happened in 2020…
Source:
Pexels

Label
Feature
Imputation
Real data for 2021
Model 2000-2019
Model 2000-2021

Connect now to dṛṣṭi
http:/
/172.25.177.2:3000

Dataset<Row> df2021 = df.filter(expr(
"month >= TO_DATE('2021-01-01') and month <= TO_DATE('2021-12-31')"));
int monthCount = (int) df2021.count();
df2021 = df2021
.agg(sum("pax").as("pax"),
sum("domesticPax").as("domesticPax"));
int pax = DataframeUtils.maxAsInt(df2021, "pax") / (12 - monthCount);
int intPax = DataframeUtils.maxAsInt(df2021, “internationalPax") / (12 - monthCount);
int domPax = DataframeUtils.maxAsInt(df2021, "domesticPax") / (12 - monthCount);
List<String> data = new ArrayList();
for (int i = monthCount + 1; i <= 12; i++) {
data.add("2021-" + i + "-01");
}
Dataset<Row> dfImputation2021 = spark
.createDataset(data, Encoders.STRING()).toDF()
.withColumn("month", col("value").cast(DataTypes.DateType))
.withColumn("pax", lit(pax))
.withColumn("internationalPax", lit(intPax))
.withColumn("domesticPax", lit(domPax))
.drop("value");
Extract 2021 data
Lab #600
Calculate imputation data
Create a new dataframe, from scratch with the
additional data

LinearRegression lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8).setLabelCol("pax");
LinearRegressionModel model2019 = lr.fit(df.filter(col(“year").$less$eq(2019)));
df = model2019
.transform(df)
.withColumnRenamed("prediction", "prediction2019");
LinearRegressionModel model2021 = lr.fit(df.filter(col("year").$less$eq(2021)));
df = model2021
.transform(df)
.withColumnRenamed("prediction", "prediction2021");
Pretty much the same code as lab #500,
except: for renaming columns
Lab #610
Reusing the same linear regression for both
model,
but the model is different!

Same model
Trainer Model
Dataset #1
Model
Dataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase
It’s all about the base model

There are two kinds of
data scientists:
1) Those who can
extrapolate from
incomplete data.

DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing systems.
DataOps is the new DevOps.
Match architecture
with business needs.
Develop processes for
data modeling,
mining, and pipelines.
Improve data
reliability and quality.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for innovative
correlations. Prepare data for
predictive models.
Explore data to find hidden
gems and patterns.
Tells stories to key
stakeholders.
Source:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer

DATA
Engineer
DATA
Scientist
SQL
Source:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
IBM Watson
Studio

Call for action
• We always need more data
• Air Traffic @ https://github.com/jgperrin/ai.jgp.drsti-spark
• COVID-19 @ https://github.com/jgperrin/net.jgp.books.spark.ch99
• Go try & contribute to dṛṣṭi at http://jgp.ai/drsti
• Follow me on Twitter @jgperrin & YouTube /jgperrin

Key takeaways
• Spark is very fun & powerful for any data application:
• Data engineering
• Data science
• New vocabulary & concept regarding Apache Spark: dataframe, analytics
operating system
• Machine learning & AI work better with Big Data
• Data is fluid (and it’s really painful)

Thank you! http://jgp.ai/sia
See you next month
for All Things Open!

It's painful how much data rules the world

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to It's painful how much data rules the world

Similar to It's painful how much data rules the world (20)

More from Jean-Georges Perrin

More from Jean-Georges Perrin (20)

Recently uploaded

Recently uploaded (20)

It's painful how much data rules the world