Jean-Georges Perrin • @jgperrin
It's painful
how much
data rules the world
All Things Open Meetup
Raleigh Convention Center • Raleigh, NC
September 15th 2021
The opinions expressed in this presentation and on the
following slides are solely those of the presenter and not
necessarily those of The NPD Group. The NPD Group does
not guarantee the accuracy or reliability of the information
provided herein.
Jean-Georges “jgp" Perrin
Software since 1983 >$0 1995
Big Data since 1984 >$0 2006
AI since 1994 >$0 2010
x13
It’s a story
about
data
4 avril 1980
Air & Space
Source:
NASA
Find & process the
data, not in Excel
Display the data in
a palatable form
Source:
Pexels
Sources:
Bureau of Transportation Statistics: https://www.transtats.bts.gov/TRAFFIC/
+----------+----------------+-----------+-----+
|month |internationalPax|domesticPax|pax |
+----------+----------------+-----------+-----+
|2000-01-01|5394 |41552 |46946|
|2000-02-01|5249 |43724 |48973|
|2000-03-01|6447 |52984 |59431|
|2000-04-01|6062 |50349 |56411|
|2000-05-01|6342 |52320 |58662|
+----------+----------------+-----------+-----+
only showing top 5 rows
root
|-- month: date (nullable = true)
|-- internationalPax: integer (nullable = true)
|-- domesticPax: integer (nullable = true)
|-- pax: integer (nullable = true)
I !
दृिष्ट • dṛṣṭi
Open source, React & IBM Carbon-based
data visualization framework
Download at https://jgp.ai/drsti
Apply light data quality
Create a session
Create a schema
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local[*]")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(“month", DataTypes.DateType, false),
DataTypes.createStructField(“pax", DataTypes.IntegerType, true) });
Dataset<Row> internationalPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/International USCarrier_Traffic_20210902163435.csv");
internationalPaxDf = internationalPaxDf
.withColumnRenamed("pax", "internationalPax")
.filter(col("month").isNotNull())
.filter(col("internationalPax").isNotNull());
Dataset<Row> domesticPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv");
domesticPaxDf = domesticPaxDf
.withColumnRenamed("pax", "domesticPax")
.filter(col("month").isNotNull())
.filter(col("domesticPax").isNotNull());
/jgperrin/ai.jgp.drsti-spark
Lab #300
Ingest international passengers
Ingest domestic passengers
Dataset<Row> df = internationalPaxDf
.join(domesticPaxDf,
internationalPaxDf.col(“month").equalTo(domesticPaxDf.col("month")),
"outer")
.withColumn("pax", expr("internationalPax + domesticPax"))
.drop(domesticPaxDf.col("month"))
.filter(col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.orderBy(col("month"))
.cache();
df = DrstiUtils.setHeader(df, "month", "Month of");
df = DrstiUtils.setHeader(df, "pax", "Passengers");
df = DrstiUtils.setHeader(df, "internationalPax", "International Passengers");
df = DrstiUtils.setHeader(df, "domesticPax", "Domestic Passengers");
DrstiChart d = new DrstiLineChart(df);
d.setTitle("Air passenger traffic per month");
d.setXScale(DrstiK.SCALE_TIME);
d.setXTitle("Period from " + DataframeUtils.min(df, "month") + " to “ + DataframeUtils.max(df, "month"));
d.setYTitle("Passengers (000s)");
d.render();
/jgperrin/ai.jgp.drsti-spark
Lab #300
All my data processing
Add meta data directly to the dataframe
Configure dṛṣṭi directly on the server
Aren’t you glad we
are using Java?
Apps
Analytics
Distrib.
Hardware
OS
Apps
Hardware
Hardware
OS OS
Distributed OS
Analytics OS
Apps
Hardware
Hardware
OS OS
An analytics operating system?
Hardware
Hardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
Domestic
passengers
(CSV)
International
passengers
(CSV)
Domestic
passengers
(dataframe)
International
passengers
(dataframe) Combining
through an
outer join
Passengers
(dataframe)
Enhanced
data
(dataframe)
Enhanced
data
(CSV)
Visualization
metadata
(JSON)
dṛṣṭi
visualization
Server processing through . Transfer Visualization
Applying to our air traffic app
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 1 -
HW
Node 2 -
HW
Node 3 -
HW
Node 4 -
HW
Spark SQL Spark Streaming
Spark MLlib
Machine learning
& artificial intelligence
Spark GraphX
Node 5 -
OS
Node 5 -
HW
Your application
…
…
Unified API
Node 6 -
OS
Node 6 -
HW
Node 7 -
OS
Node 7 -
HW
Node 8 -
OS
Node 8 -
HW
Spark SQL
Spark Streaming
Spark MLlib
Machine learning
& artificial intelligence
Spark GraphX
Your application
Dataframe
Unified API
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 5 -
OS
…
Node 6 -
OS
Node 7 -
OS
Node 8 -
OS
Spark SQL
Spark Streaming
Spark MLlib
Machine learning
& artificial intelligence
Spark GraphX
Dataframe
Source:
Pexels
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local[*]")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(“month", DataTypes.DateType, false),
DataTypes.createStructField(“pax", DataTypes.IntegerType, true) });
Dataset<Row> internationalPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/International USCarrier_Traffic_20210902163435.csv");
internationalPaxDf = internationalPaxDf
.withColumnRenamed("pax", "internationalPax")
.filter(col("month").isNotNull())
.filter(col("internationalPax").isNotNull());
Dataset<Row> domesticPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv");
domesticPaxDf = domesticPaxDf
.withColumnRenamed("pax", "domesticPax")
.filter(col("month").isNotNull())
.filter(col("domesticPax").isNotNull());
/jgperrin/ai.jgp.drsti-spark
Lab #310
Dataset<Row> df = internationalPaxDf
.join(domesticPaxDf,
internationalPaxDf.col("month")
.equalTo(domesticPaxDf.col("month")),
"outer")
.withColumn("pax", expr("internationalPax + domesticPax"))
.drop(domesticPaxDf.col("month"))
.filter(
col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.orderBy(col("month"))
.cache();
Dataset<Row> dfQuarter = df
.withColumn("year", year(col("month")))
.withColumn("q", ceil(month(col("month")).$div(3)))
.withColumn("period", concat(col("year"), lit("-Q"), col("q")))
.groupBy(col("period"))
.agg(sum(“pax").as("pax"),
sum("internationalPax").as("internationalPax"),
sum("domesticPax").as("domesticPax"))
.drop("year")
.drop("q")
.orderBy(col("period"));
/jgperrin/ai.jgp.drsti-spark
Lab #310
New code for quarter
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local[*]")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(“month", DataTypes.DateType, false),
DataTypes.createStructField(“pax", DataTypes.IntegerType, true) });
Dataset<Row> internationalPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/International USCarrier_Traffic_20210902163435.csv");
internationalPaxDf = internationalPaxDf
.withColumnRenamed("pax", "internationalPax")
.filter(col("month").isNotNull())
.filter(col("internationalPax").isNotNull());
Dataset<Row> domesticPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv");
domesticPaxDf = domesticPaxDf
.withColumnRenamed("pax", "domesticPax")
.filter(col("month").isNotNull())
.filter(col("domesticPax").isNotNull());
/jgperrin/ai.jgp.drsti-spark
Lab #320
Dataset<Row> df = internationalPaxDf
.join(domesticPaxDf,
internationalPaxDf.col("month")
.equalTo(domesticPaxDf.col("month")),
"outer")
.withColumn("pax", expr("internationalPax + domesticPax"))
.drop(domesticPaxDf.col("month"))
.filter(
col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.orderBy(col("month"))
.cache();
Dataset<Row> dfYear = df
.withColumn("year", year(col("month")))
.groupBy(col("year"))
.agg(sum("pax").as("pax"),
sum("internationalPax").as("internationalPax"),
sum("domesticPax").as("domesticPax"))
.orderBy(col("year"));
/jgperrin/ai.jgp.drsti-spark
Lab #320
New code for year
Data
Bronze
Raw Data
Silver
Pure Data
Gold
Rich Data
Actionable
Data
Application
of
Data Quality
rules
Ingestion
Transfor-
mation
Publication
“Cache”
A (Big) Data Scenario
Building a pipeline
Data
Bronze
Raw Data
Silver
Pure Data
Gold
Rich Data
Actionable
Data
Application
of
Data Quality
rules
Ingestion
Transfor-
mation
Publication
“Cache”
A (Big) Data Scenario
Building a pipeline
// Combining datasets
// ...
df.write()
.format("delta")
.mode("overwrite")
.save("./data/tmp/airtrafficmonth");
/jgperrin/ai.jgp.drsti-spark
Lab #400
Saving to Delta Lake
Dataset<Row> df = spark.read().format("delta")
.load("./data/tmp/airtrafficmonth")
.orderBy(col("month"));
Dataset<Row> dfYear = df
.withColumn("year", year(col("month")))
.groupBy(col("year"))
.agg(sum(“pax").as("pax"),
...
/jgperrin/ai.jgp.drsti-spark
Lab #430
Reading from Delta Lake
Can we project future traffic?
Source:
Comedy Central
Do you remember January 2020?
And March?
Source:
Pexels
• Make a model for 2000-2019
• See the projection
• Use 2020 data & imputation for
the rest of 2021
• See the projection
What now?
Source:
Pexels
Label
Feature
Use my model
Split training & test data
String[] inputCols = { "year" };
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
df = assembler.transform(df);
LinearRegression lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.5)
.setElasticNetParam(0.8)
.setLabelCol("pax");
int threshold = 2019;
Dataset<Row> trainingData = df.filter(col("year").$less$eq(threshold));
Dataset<Row> testData = df.filter(col("year").$greater(threshold));
LinearRegressionModel model = lr.fit(trainingData);
Integer[] l = new Integer[] { 2020, 2021, 2022, 2023, 2024, 2025, 2026 };
List<Integer> data = Arrays.asList(l);
Dataset<Row> futuresDf = spark.createDataset(data, Encoders.INT()).toDF().withColumnRenamed("value", "year");
assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
futuresDf = assembler.transform(futuresDf);
df = df.unionByName(futuresDf, true);
df = model.transform(df);
Features are a vector - let’s build one
Build a linear regression
Building my model
/jgperrin/ai.jgp.drsti-spark
Lab #500
Something happened in 2020…
Source:
Pexels
Something happened in 2020…
Source:
Pexels
Label
Feature
Imputation
Real data for 2021
Model 2000-2019
Model 2000-2021
Connect now to dṛṣṭi
http:/
/172.25.177.2:3000
Dataset<Row> df2021 = df.filter(expr(
"month >= TO_DATE('2021-01-01') and month <= TO_DATE('2021-12-31')"));
int monthCount = (int) df2021.count();
df2021 = df2021
.agg(sum("pax").as("pax"),
sum("internationalPax").as("internationalPax"),
sum("domesticPax").as("domesticPax"));
int pax = DataframeUtils.maxAsInt(df2021, "pax") / (12 - monthCount);
int intPax = DataframeUtils.maxAsInt(df2021, “internationalPax") / (12 - monthCount);
int domPax = DataframeUtils.maxAsInt(df2021, "domesticPax") / (12 - monthCount);
List<String> data = new ArrayList();
for (int i = monthCount + 1; i <= 12; i++) {
data.add("2021-" + i + "-01");
}
Dataset<Row> dfImputation2021 = spark
.createDataset(data, Encoders.STRING()).toDF()
.withColumn("month", col("value").cast(DataTypes.DateType))
.withColumn("pax", lit(pax))
.withColumn("internationalPax", lit(intPax))
.withColumn("domesticPax", lit(domPax))
.drop("value");
Extract 2021 data
/jgperrin/ai.jgp.drsti-spark
Lab #600
Calculate imputation data
Create a new dataframe, from scratch with the
additional data
LinearRegression lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8).setLabelCol("pax");
LinearRegressionModel model2019 = lr.fit(df.filter(col(“year").$less$eq(2019)));
df = model2019
.transform(df)
.withColumnRenamed("prediction", "prediction2019");
LinearRegressionModel model2021 = lr.fit(df.filter(col("year").$less$eq(2021)));
df = model2021
.transform(df)
.withColumnRenamed("prediction", "prediction2021");
Pretty much the same code as lab #500,
except: for renaming columns
/jgperrin/ai.jgp.drsti-spark
Lab #610
Reusing the same linear regression for both
model,
but the model is different!
Same model
Trainer Model
Dataset #1
Model
Dataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase
It’s all about the base model
Scientist & Engineer
There are two kinds of
data scientists:
1) Those who can
extrapolate from
incomplete data.
DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing systems.
DataOps is the new DevOps.
Match architecture
with business needs.
Develop processes for
data modeling,
mining, and pipelines.
Improve data
reliability and quality.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for innovative
correlations. Prepare data for
predictive models.
Explore data to find hidden
gems and patterns.
Tells stories to key
stakeholders.
Source:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
Source:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
IBM Watson
Studio
Conclusion
Call for action
• We always need more data
• Air Traffic @ https://github.com/jgperrin/ai.jgp.drsti-spark
• COVID-19 @ https://github.com/jgperrin/net.jgp.books.spark.ch99
• Go try & contribute to dṛṣṭi at http://jgp.ai/drsti
• Follow me on Twitter @jgperrin & YouTube /jgperrin
Key takeaways
• Spark is very fun & powerful for any data application:
• Data engineering
• Data science
• New vocabulary & concept regarding Apache Spark: dataframe, analytics
operating system
• Machine learning & AI work better with Big Data
• Data is fluid (and it’s really painful)
Thank you! http://jgp.ai/sia
See you next month
for All Things Open!

It's painful how much data rules the world

  • 1.
    Jean-Georges Perrin •@jgperrin It's painful how much data rules the world All Things Open Meetup Raleigh Convention Center • Raleigh, NC September 15th 2021
  • 2.
    The opinions expressedin this presentation and on the following slides are solely those of the presenter and not necessarily those of The NPD Group. The NPD Group does not guarantee the accuracy or reliability of the information provided herein.
  • 3.
    Jean-Georges “jgp" Perrin Softwaresince 1983 >$0 1995 Big Data since 1984 >$0 2006 AI since 1994 >$0 2010 x13
  • 5.
  • 6.
  • 8.
    Find & processthe data, not in Excel Display the data in a palatable form Source: Pexels
  • 9.
    Sources: Bureau of TransportationStatistics: https://www.transtats.bts.gov/TRAFFIC/
  • 10.
    +----------+----------------+-----------+-----+ |month |internationalPax|domesticPax|pax | +----------+----------------+-----------+-----+ |2000-01-01|5394|41552 |46946| |2000-02-01|5249 |43724 |48973| |2000-03-01|6447 |52984 |59431| |2000-04-01|6062 |50349 |56411| |2000-05-01|6342 |52320 |58662| +----------+----------------+-----------+-----+ only showing top 5 rows root |-- month: date (nullable = true) |-- internationalPax: integer (nullable = true) |-- domesticPax: integer (nullable = true) |-- pax: integer (nullable = true) I !
  • 11.
    दृिष्ट • dṛṣṭi Opensource, React & IBM Carbon-based data visualization framework Download at https://jgp.ai/drsti
  • 13.
    Apply light dataquality Create a session Create a schema SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local[*]") .getOrCreate(); StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField(“month", DataTypes.DateType, false), DataTypes.createStructField(“pax", DataTypes.IntegerType, true) }); Dataset<Row> internationalPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/International USCarrier_Traffic_20210902163435.csv"); internationalPaxDf = internationalPaxDf .withColumnRenamed("pax", "internationalPax") .filter(col("month").isNotNull()) .filter(col("internationalPax").isNotNull()); Dataset<Row> domesticPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv"); domesticPaxDf = domesticPaxDf .withColumnRenamed("pax", "domesticPax") .filter(col("month").isNotNull()) .filter(col("domesticPax").isNotNull()); /jgperrin/ai.jgp.drsti-spark Lab #300 Ingest international passengers Ingest domestic passengers
  • 14.
    Dataset<Row> df =internationalPaxDf .join(domesticPaxDf, internationalPaxDf.col(“month").equalTo(domesticPaxDf.col("month")), "outer") .withColumn("pax", expr("internationalPax + domesticPax")) .drop(domesticPaxDf.col("month")) .filter(col("month").$less(lit("2020-01-01").cast(DataTypes.DateType))) .orderBy(col("month")) .cache(); df = DrstiUtils.setHeader(df, "month", "Month of"); df = DrstiUtils.setHeader(df, "pax", "Passengers"); df = DrstiUtils.setHeader(df, "internationalPax", "International Passengers"); df = DrstiUtils.setHeader(df, "domesticPax", "Domestic Passengers"); DrstiChart d = new DrstiLineChart(df); d.setTitle("Air passenger traffic per month"); d.setXScale(DrstiK.SCALE_TIME); d.setXTitle("Period from " + DataframeUtils.min(df, "month") + " to “ + DataframeUtils.max(df, "month")); d.setYTitle("Passengers (000s)"); d.render(); /jgperrin/ai.jgp.drsti-spark Lab #300 All my data processing Add meta data directly to the dataframe Configure dṛṣṭi directly on the server
  • 15.
    Aren’t you gladwe are using Java?
  • 17.
    Apps Analytics Distrib. Hardware OS Apps Hardware Hardware OS OS Distributed OS AnalyticsOS Apps Hardware Hardware OS OS An analytics operating system?
  • 18.
    Hardware Hardware OS OS Distributed OS AnalyticsOS Apps { An analytics operating system?
  • 19.
    Domestic passengers (CSV) International passengers (CSV) Domestic passengers (dataframe) International passengers (dataframe) Combining through an outerjoin Passengers (dataframe) Enhanced data (dataframe) Enhanced data (CSV) Visualization metadata (JSON) dṛṣṭi visualization Server processing through . Transfer Visualization Applying to our air traffic app
  • 20.
  • 21.
    Node 1 - OS Node2 - OS Node 3 - OS Node 4 - OS Node 1 - HW Node 2 - HW Node 3 - HW Node 4 - HW Spark SQL Spark Streaming Spark MLlib Machine learning & artificial intelligence Spark GraphX Node 5 - OS Node 5 - HW Your application … … Unified API Node 6 - OS Node 6 - HW Node 7 - OS Node 7 - HW Node 8 - OS Node 8 - HW
  • 22.
    Spark SQL Spark Streaming SparkMLlib Machine learning & artificial intelligence Spark GraphX Your application Dataframe Unified API Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 5 - OS … Node 6 - OS Node 7 - OS Node 8 - OS
  • 23.
    Spark SQL Spark Streaming SparkMLlib Machine learning & artificial intelligence Spark GraphX Dataframe Source: Pexels
  • 25.
    SparkSession spark =SparkSession.builder() .appName("CSV to Dataset") .master("local[*]") .getOrCreate(); StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField(“month", DataTypes.DateType, false), DataTypes.createStructField(“pax", DataTypes.IntegerType, true) }); Dataset<Row> internationalPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/International USCarrier_Traffic_20210902163435.csv"); internationalPaxDf = internationalPaxDf .withColumnRenamed("pax", "internationalPax") .filter(col("month").isNotNull()) .filter(col("internationalPax").isNotNull()); Dataset<Row> domesticPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv"); domesticPaxDf = domesticPaxDf .withColumnRenamed("pax", "domesticPax") .filter(col("month").isNotNull()) .filter(col("domesticPax").isNotNull()); /jgperrin/ai.jgp.drsti-spark Lab #310
  • 26.
    Dataset<Row> df =internationalPaxDf .join(domesticPaxDf, internationalPaxDf.col("month") .equalTo(domesticPaxDf.col("month")), "outer") .withColumn("pax", expr("internationalPax + domesticPax")) .drop(domesticPaxDf.col("month")) .filter( col("month").$less(lit("2020-01-01").cast(DataTypes.DateType))) .orderBy(col("month")) .cache(); Dataset<Row> dfQuarter = df .withColumn("year", year(col("month"))) .withColumn("q", ceil(month(col("month")).$div(3))) .withColumn("period", concat(col("year"), lit("-Q"), col("q"))) .groupBy(col("period")) .agg(sum(“pax").as("pax"), sum("internationalPax").as("internationalPax"), sum("domesticPax").as("domesticPax")) .drop("year") .drop("q") .orderBy(col("period")); /jgperrin/ai.jgp.drsti-spark Lab #310 New code for quarter
  • 28.
    SparkSession spark =SparkSession.builder() .appName("CSV to Dataset") .master("local[*]") .getOrCreate(); StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField(“month", DataTypes.DateType, false), DataTypes.createStructField(“pax", DataTypes.IntegerType, true) }); Dataset<Row> internationalPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/International USCarrier_Traffic_20210902163435.csv"); internationalPaxDf = internationalPaxDf .withColumnRenamed("pax", "internationalPax") .filter(col("month").isNotNull()) .filter(col("internationalPax").isNotNull()); Dataset<Row> domesticPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv"); domesticPaxDf = domesticPaxDf .withColumnRenamed("pax", "domesticPax") .filter(col("month").isNotNull()) .filter(col("domesticPax").isNotNull()); /jgperrin/ai.jgp.drsti-spark Lab #320
  • 29.
    Dataset<Row> df =internationalPaxDf .join(domesticPaxDf, internationalPaxDf.col("month") .equalTo(domesticPaxDf.col("month")), "outer") .withColumn("pax", expr("internationalPax + domesticPax")) .drop(domesticPaxDf.col("month")) .filter( col("month").$less(lit("2020-01-01").cast(DataTypes.DateType))) .orderBy(col("month")) .cache(); Dataset<Row> dfYear = df .withColumn("year", year(col("month"))) .groupBy(col("year")) .agg(sum("pax").as("pax"), sum("internationalPax").as("internationalPax"), sum("domesticPax").as("domesticPax")) .orderBy(col("year")); /jgperrin/ai.jgp.drsti-spark Lab #320 New code for year
  • 30.
    Data Bronze Raw Data Silver Pure Data Gold RichData Actionable Data Application of Data Quality rules Ingestion Transfor- mation Publication “Cache” A (Big) Data Scenario Building a pipeline
  • 31.
    Data Bronze Raw Data Silver Pure Data Gold RichData Actionable Data Application of Data Quality rules Ingestion Transfor- mation Publication “Cache” A (Big) Data Scenario Building a pipeline
  • 32.
    // Combining datasets //... df.write() .format("delta") .mode("overwrite") .save("./data/tmp/airtrafficmonth"); /jgperrin/ai.jgp.drsti-spark Lab #400 Saving to Delta Lake
  • 33.
    Dataset<Row> df =spark.read().format("delta") .load("./data/tmp/airtrafficmonth") .orderBy(col("month")); Dataset<Row> dfYear = df .withColumn("year", year(col("month"))) .groupBy(col("year")) .agg(sum(“pax").as("pax"), ... /jgperrin/ai.jgp.drsti-spark Lab #430 Reading from Delta Lake
  • 34.
    Can we projectfuture traffic? Source: Comedy Central
  • 35.
    Do you rememberJanuary 2020? And March? Source: Pexels
  • 36.
    • Make amodel for 2000-2019 • See the projection • Use 2020 data & imputation for the rest of 2021 • See the projection What now? Source: Pexels
  • 37.
  • 38.
    Use my model Splittraining & test data String[] inputCols = { "year" }; VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features"); df = assembler.transform(df); LinearRegression lr = new LinearRegression() .setMaxIter(10) .setRegParam(0.5) .setElasticNetParam(0.8) .setLabelCol("pax"); int threshold = 2019; Dataset<Row> trainingData = df.filter(col("year").$less$eq(threshold)); Dataset<Row> testData = df.filter(col("year").$greater(threshold)); LinearRegressionModel model = lr.fit(trainingData); Integer[] l = new Integer[] { 2020, 2021, 2022, 2023, 2024, 2025, 2026 }; List<Integer> data = Arrays.asList(l); Dataset<Row> futuresDf = spark.createDataset(data, Encoders.INT()).toDF().withColumnRenamed("value", "year"); assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features"); futuresDf = assembler.transform(futuresDf); df = df.unionByName(futuresDf, true); df = model.transform(df); Features are a vector - let’s build one Build a linear regression Building my model /jgperrin/ai.jgp.drsti-spark Lab #500
  • 39.
    Something happened in2020… Source: Pexels
  • 40.
    Something happened in2020… Source: Pexels
  • 41.
    Label Feature Imputation Real data for2021 Model 2000-2019 Model 2000-2021
  • 42.
    Connect now todṛṣṭi http:/ /172.25.177.2:3000
  • 43.
    Dataset<Row> df2021 =df.filter(expr( "month >= TO_DATE('2021-01-01') and month <= TO_DATE('2021-12-31')")); int monthCount = (int) df2021.count(); df2021 = df2021 .agg(sum("pax").as("pax"), sum("internationalPax").as("internationalPax"), sum("domesticPax").as("domesticPax")); int pax = DataframeUtils.maxAsInt(df2021, "pax") / (12 - monthCount); int intPax = DataframeUtils.maxAsInt(df2021, “internationalPax") / (12 - monthCount); int domPax = DataframeUtils.maxAsInt(df2021, "domesticPax") / (12 - monthCount); List<String> data = new ArrayList(); for (int i = monthCount + 1; i <= 12; i++) { data.add("2021-" + i + "-01"); } Dataset<Row> dfImputation2021 = spark .createDataset(data, Encoders.STRING()).toDF() .withColumn("month", col("value").cast(DataTypes.DateType)) .withColumn("pax", lit(pax)) .withColumn("internationalPax", lit(intPax)) .withColumn("domesticPax", lit(domPax)) .drop("value"); Extract 2021 data /jgperrin/ai.jgp.drsti-spark Lab #600 Calculate imputation data Create a new dataframe, from scratch with the additional data
  • 44.
    LinearRegression lr =new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8).setLabelCol("pax"); LinearRegressionModel model2019 = lr.fit(df.filter(col(“year").$less$eq(2019))); df = model2019 .transform(df) .withColumnRenamed("prediction", "prediction2019"); LinearRegressionModel model2021 = lr.fit(df.filter(col("year").$less$eq(2021))); df = model2021 .transform(df) .withColumnRenamed("prediction", "prediction2021"); Pretty much the same code as lab #500, except: for renaming columns /jgperrin/ai.jgp.drsti-spark Lab #610 Reusing the same linear regression for both model, but the model is different!
  • 45.
    Same model Trainer Model Dataset#1 Model Dataset #2 Predicted Data Step 1: Learning phase Step 2..n: Predictive phase It’s all about the base model
  • 46.
  • 47.
    There are twokinds of data scientists: 1) Those who can extrapolate from incomplete data.
  • 48.
    DATA Engineer DATA Scientist Develop, build, test,and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders. Source: Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
  • 49.
  • 50.
  • 51.
    Call for action •We always need more data • Air Traffic @ https://github.com/jgperrin/ai.jgp.drsti-spark • COVID-19 @ https://github.com/jgperrin/net.jgp.books.spark.ch99 • Go try & contribute to dṛṣṭi at http://jgp.ai/drsti • Follow me on Twitter @jgperrin & YouTube /jgperrin
  • 52.
    Key takeaways • Sparkis very fun & powerful for any data application: • Data engineering • Data science • New vocabulary & concept regarding Apache Spark: dataframe, analytics operating system • Machine learning & AI work better with Big Data • Data is fluid (and it’s really painful)
  • 53.
    Thank you! http://jgp.ai/sia Seeyou next month for All Things Open!