An introduction to data engineering & data science using Apache Spark and Java.
Get Spark in Action 2e, at http://jgp.ai/sia.
In this presentation, I start by loading a few CSV files in Spark (ingestion) and displaying them through the help of this new tool I build, dṛṣṭi.
As you can expect, I clean the data, join it, transform it, and continue to visualize it through dṛṣṭi.
I use Delta Lake to create a cache for my data and explain what imputation is and show I can use imputation on my datasets to add the missing datapoints.
I then use Spark on simple linear regressions to predict/forecast data.
dṛṣṭi is open source (Apache 2 license) and is available at: https://github.com/jgperrin/ai.jgp.drsti.
All the labs are available at https://github.com/jgperrin/ai.jgp.drsti-spark.
1. Jean-Georges Perrin • @jgperrin
It's painful
how much
data rules the world
All Things Open Meetup
Raleigh Convention Center • Raleigh, NC
September 15th 2021
2. The opinions expressed in this presentation and on the
following slides are solely those of the presenter and not
necessarily those of The NPD Group. The NPD Group does
not guarantee the accuracy or reliability of the information
provided herein.
30. Data
Bronze
Raw Data
Silver
Pure Data
Gold
Rich Data
Actionable
Data
Application
of
Data Quality
rules
Ingestion
Transfor-
mation
Publication
“Cache”
A (Big) Data Scenario
Building a pipeline
31. Data
Bronze
Raw Data
Silver
Pure Data
Gold
Rich Data
Actionable
Data
Application
of
Data Quality
rules
Ingestion
Transfor-
mation
Publication
“Cache”
A (Big) Data Scenario
Building a pipeline
32. // Combining datasets
// ...
df.write()
.format("delta")
.mode("overwrite")
.save("./data/tmp/airtrafficmonth");
/jgperrin/ai.jgp.drsti-spark
Lab #400
Saving to Delta Lake
38. Use my model
Split training & test data
String[] inputCols = { "year" };
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
df = assembler.transform(df);
LinearRegression lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.5)
.setElasticNetParam(0.8)
.setLabelCol("pax");
int threshold = 2019;
Dataset<Row> trainingData = df.filter(col("year").$less$eq(threshold));
Dataset<Row> testData = df.filter(col("year").$greater(threshold));
LinearRegressionModel model = lr.fit(trainingData);
Integer[] l = new Integer[] { 2020, 2021, 2022, 2023, 2024, 2025, 2026 };
List<Integer> data = Arrays.asList(l);
Dataset<Row> futuresDf = spark.createDataset(data, Encoders.INT()).toDF().withColumnRenamed("value", "year");
assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
futuresDf = assembler.transform(futuresDf);
df = df.unionByName(futuresDf, true);
df = model.transform(df);
Features are a vector - let’s build one
Build a linear regression
Building my model
/jgperrin/ai.jgp.drsti-spark
Lab #500
43. Dataset<Row> df2021 = df.filter(expr(
"month >= TO_DATE('2021-01-01') and month <= TO_DATE('2021-12-31')"));
int monthCount = (int) df2021.count();
df2021 = df2021
.agg(sum("pax").as("pax"),
sum("internationalPax").as("internationalPax"),
sum("domesticPax").as("domesticPax"));
int pax = DataframeUtils.maxAsInt(df2021, "pax") / (12 - monthCount);
int intPax = DataframeUtils.maxAsInt(df2021, “internationalPax") / (12 - monthCount);
int domPax = DataframeUtils.maxAsInt(df2021, "domesticPax") / (12 - monthCount);
List<String> data = new ArrayList();
for (int i = monthCount + 1; i <= 12; i++) {
data.add("2021-" + i + "-01");
}
Dataset<Row> dfImputation2021 = spark
.createDataset(data, Encoders.STRING()).toDF()
.withColumn("month", col("value").cast(DataTypes.DateType))
.withColumn("pax", lit(pax))
.withColumn("internationalPax", lit(intPax))
.withColumn("domesticPax", lit(domPax))
.drop("value");
Extract 2021 data
/jgperrin/ai.jgp.drsti-spark
Lab #600
Calculate imputation data
Create a new dataframe, from scratch with the
additional data
44. LinearRegression lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8).setLabelCol("pax");
LinearRegressionModel model2019 = lr.fit(df.filter(col(“year").$less$eq(2019)));
df = model2019
.transform(df)
.withColumnRenamed("prediction", "prediction2019");
LinearRegressionModel model2021 = lr.fit(df.filter(col("year").$less$eq(2021)));
df = model2021
.transform(df)
.withColumnRenamed("prediction", "prediction2021");
Pretty much the same code as lab #500,
except: for renaming columns
/jgperrin/ai.jgp.drsti-spark
Lab #610
Reusing the same linear regression for both
model,
but the model is different!
45. Same model
Trainer Model
Dataset #1
Model
Dataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase
It’s all about the base model
47. There are two kinds of
data scientists:
1) Those who can
extrapolate from
incomplete data.
48. DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing systems.
DataOps is the new DevOps.
Match architecture
with business needs.
Develop processes for
data modeling,
mining, and pipelines.
Improve data
reliability and quality.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for innovative
correlations. Prepare data for
predictive models.
Explore data to find hidden
gems and patterns.
Tells stories to key
stakeholders.
Source:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
51. Call for action
• We always need more data
• Air Traffic @ https://github.com/jgperrin/ai.jgp.drsti-spark
• COVID-19 @ https://github.com/jgperrin/net.jgp.books.spark.ch99
• Go try & contribute to dṛṣṭi at http://jgp.ai/drsti
• Follow me on Twitter @jgperrin & YouTube /jgperrin
52. Key takeaways
• Spark is very fun & powerful for any data application:
• Data engineering
• Data science
• New vocabulary & concept regarding Apache Spark: dataframe, analytics
operating system
• Machine learning & AI work better with Big Data
• Data is fluid (and it’s really painful)