SlideShare a Scribd company logo
1 of 88
Download to read offline
Building a Versatile Analytics Pipeline
On Top Of Apache Spark
Misha Chernetsov, Grammarly
Spark Summit 2017
June 6, 2017
Data Team Lead @ Grammarly
Building Analytics Pipelines (5 years)
Coding on JVM (12 years), Scala + Spark (3 years)
About Me: Misha Chernetsov
@chernetsov
Tool that helps us better understand:
● Who are our users?
● How do they interact with the product?
● How do they get in, engage, pay, and how long do they stay?
Analytics @ Consumer Product Company
We want our decisions to be
data-driven
Everyone: product managers, marketing, engineers, support...
Analytics @ Consumer Product Company
data
analytics
report
Analytics @ Consumer Product Company
Calendar Day
Number of
unique active
users by day
Example Report 1 – Daily Active Users
dummy data!dummy data!
Example Report 2 – Comparison of Cohort Retention Over Time
dummy data!dummy data!
Ads
Email
Social
Number of
users who
bought a
subscription.
Split by traffic
source type
(where user
came from)
Calendar Day
Example Report 3 – Payer Conversions By Traffic Source
dummy data!dummy data!
● Landing page visit
○ URL with UTM tags
○ Referrer
● Subscription purchased
○ Is first in subscription
Example: Data
Everything is an Event
Example: Data
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
…
}
{
"eventName": "subscribe",
"period": "12 months",
…
}
Enrich and/or Join
Example: Data
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
…
}
{
"eventName": "subscribe",
"period": "12 months",
…
}
Slice by Plot
capture enrich index query
Analytics @ Consumer Product Company
capture enrich index query
Use 3rd Party?
1. Integrated Event Analytics
2. UI over your DB
Reports are not tailored for your
needs, limited capability.
Pre-aggregation / enriching
is still on you.
Hard to achieve accuracy and trust.
capture enrich index query
Build Step 1: Capture
● Always up, resilient
● Spikes / back pressure
● Buffer for delayed processing
Kafka
Capture
REST
{
"eventName": "page-visit",
"url": "...?utm_medium=paid",
…
}
Long-term
Storage
StreamKafka
Save To Long-Term Storage
Cassandra
micro-batch
Kafka
Save To Long-Term Storage
val rdd = KafkaUtils.createRDD[K, V](...)
rdd.saveToCassandra("raw")
capture enrich index query
Build Step 2: Enrich
Enrichment 1: User Attribution
Enrichment 1: User Attribution
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
…
}
{
"eventName": "subscribe",
"userId": 123,
"fingerprint": "abc",
…
}
Enrichment 1: User Attribution
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
…
}
{
"eventName": "subscribe",
"userId": 123,
"fingerprint": "abc",
…
}
"attributedUserId": 123,
t
Non-authenticated:
userId = null
Authenticated:
userId = 123
fingerprint = abc
(All events from a
given browser)
Enrichment 1: User Attribution
t
Non-authenticated:
userId = null
attributedUserId = 123
Authenticated:
userId = 123
fingerprint = abc
(All events from a
given browser)
Enrichment 1: User Attribution
Authenticated:
userId = 756
Enrichment 1: User Attribution
tfingerprint = abc
(All events from a
given browser)
Authenticated:
userId = 123
Heuristics to
attribute those
Authenticated:
userId = 756
Enrichment 1: User Attribution
tfingerprint = abc
(All events from a
given browser)
Authenticated:
userId = 123
Heuristics to
attribute those
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
Can grow big and
OOM your worker for
outliers who use
Grammarly without
ever registering
By default we
should operate
in User Memory
(small fraction).
Spark & Memory
User Memory
100% - spark.memory.fraction = 25%
Spark Memory
spark.memory.fraction = 75%
Let’s get into
Spark Memory
and use its
safety features.
rdd.mapPartitions { iterator =>
val buffer = new SpillableBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
Can safely grow in
mem while enough
free Spark Mem. Spills
to disk otherwise.
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory
×2
Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory
Spill to Disk
Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spill to Disk
Spark Memory Manager & Spillable Collection
Memory Disk
trait SizeTracker {
def afterUpdate(): Unit = { … }
def estimateSize(): Long = { … }
}
Call on every append.
Periodically estimates size
and saves samples.
Extrapolates
SizeTracker
trait Spillable {
abstract def spill(inMemCollection: C): Unit
def maybeSpill(currentMemory: Long, inMemCollection: C) {
try x2 if needed
}
}
Spillable
call on every append to collection
public long acquireExecutionMemory(long required, …)
public void releaseExecutionMemory(long size, …)
TaskMemoryManager
● Be safe with outliers
● Get outside User Memory (25%), use Spark Memory (75%)
● Spark APIs: Could be a bit friendlier and high level
Custom Spillable Collection
Enrichment 2: Calculable Props
Enrichment Phase 2: Calculable Props
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
"attributedUserId": 123,
…
}
{
"eventName": "subscribe",
"userId": 123,
"fingerprint": "abc",
…
}
Enrichment Phase 2: Calculable Props
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
"attributedUserId": 123,
…
}
{
"eventName": "subscribe",
"userId": 123,
"fingerprint": "abc",
"firstUtmMedium": "ad",
…
}
val firstUtmMedium: CalcProp[String] =
(E  "url").as[Url]
.map(_.param("utm_source"))
.forEvent("page-visit")
.first
Enrichment Phase 2: Calculable Props Engine & DSL
● Type-safe, functional, composable
● Familiar: similar to Scala collections API
● Batch & Stream (incremental)
Enrichment Phase 2: Calculable Props Engine & DSL
Enrichment Pipeline with Spark
Raw
Kafka
Spark Pipeline
Stream:
Save Raw Kafka
Stream:
User Attr.
User-attributed
Kafka
Stream:
Calc Props
Enriched and
Queryable
Batch:
User Attr.
Batch:
Calc Props
batch
micro-batchmicro-batchmicro-batch
Cassandra
Kafka
Spark Pipeline
Kafka
Cassandra
Kafka
Parquet on
AWS S3
batch
● Connectors for everything
● Great for batch
○ Shuffle with spilling
○ Failure recovery
● Great for streaming
○ Fast
○ Low overhead
Spark Pipeline
batch
micro-batchmicro-batchmicro-batch
Cassandra
Kafka
Spark Pipeline
Kafka
Cassandra
Kafka
Parquet on
AWS S3
batch
job
Multiple Output Destinations
Kafka Kafka
CassandraCassandra
val rdd: RDD[T]
rdd.sendToKafka(“topic_x”)
rdd.saveToCassandra(“table_foo”)
rdd.saveToCassandra(“table_bar”)
Multiple Output Destinations: Try 1
rdd.saveToCassandra(...)
rdd.forEachPartition(...)
sc.runJob(...)
Multiple Output Destinations: Try 1
job
Multiple Output Destinations: Try 1
Kafka Kafka
CassandraCassandra
job
Multiple Output Destinations: Try 1 = 3 Jobs
Kafka Kafka
job
Kafka
Table 1
job
Kafka
Table 2
val rdd: RDD[T]
rdd.cache()
rdd.sendToKafka(“topic_x”)
rdd.saveToCassandra(“table_foo”)
rdd.saveToCassandra(“table_bar”)
Multiple Output Destinations: Try 2
job
Multiple Output Destinations: Try 2 = Read Once, 3 Jobs
Kafka Kafka
job
Cache
Table 1
job
Cache
Table 2
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator.forEach { el =>
writer.writeln(el)
}
writer.close() // makes sure this writes
}
Writer
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator.forEach { el =>
writer.writeln(el)
}
writer.close() // makes sure this writes
}
Writer
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator.forEach { el =>
writer.writeln(el)
}
writer.close() // makes sure this writes
}
Writer
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream(...))
)
iterator.forEach { el =>
writer.writeln(el)
}
writer.close() // makes sure this writes
}
● Buffer
● Non-blocking
● Idempotent / Dedupe
Writer
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer.write(el)
}.closing(() => writer.close)
}
AndWriter
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer.write(el)
}.closing(() => writer.close)
}
AndWriter
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer.write(el)
}.closing(() => writer.close)
}
AndWriter
val rdd: RDD[T]
rdd.andSaveToCassandra(“table_foo”)
.andSaveToCassandra(“table_bar”)
.sendToKafka(“topic_x”)
Multiple Output Destinations: Try 3
job
Multiple Output Destinations: Try 3
Kafka Kafka
CassandraCassandra
● Kafka
● Cassandra
● HDFS
Important! Each andWriter will consume resources
● Memory (buffers)
● IO
And Writer
capture enrich index query
Build Step 3: Index
Index
● Parquet on AWS S3
● Custom partitioning: By eventName and time interval
● Append changes, compact + merge on the fly when querying
● Randomized names to maximize S3 parallelism
● Use s3a for max performance and and tweak for S3 read-after-write
consistency
● Support flexible schema, even with conflicts!
Some Stats
● Thousands of
events
per second
● Terabytes of
compressed
data
capture enrich index query
Build Step 4: Query
● DataFrames
● Spark SQL Scala dsl / Pure SQL
● Zeppelin
Hardcore Query
● Plot by day
● Unique visitors
● Filter by country
● Split by traffic source (top 20)
● Time from 2 weeks ago to today
Casual Query
Option 1: SQL
Quickly gets complex
Too expensive
to build, extend
and support
Option 2: UI
SEGMENT “eventName”
WHERE foo = “bar” AND x.y IN (“a”, “b”, “c”)
UNIQUE
BY m IS NOT NULL
TIME from 2 months ago to today
STEP 1 month SPAN 1 week
Option 3: Custom Query Language
SEGMENT “eventName”
WHERE foo = “bar” AND x.y IN (“a”, “b”, “c”)
UNIQUE
BY m IS NOT NULL
TIME from 2 months ago to today
STEP 1 month SPAN 1 week
Option 3: Custom Query Language
Expressions
● Segment, Funnel, Retention
● UI & as DataFrame in Zeppelin
● Spark <= 1.6 – Scala Parser Combinators
● Reuse most complex part of expression parser
● Relatively extensible
Option 3: Custom Query Language
Option 3: Custom Query Language
● Custom versatile analytics is doable and enjoyable
● Spark is a great platform to build analytics on top of
○ Enrichment Pipeline: Batch / Streaming, Query, ML
● Would be cool to see even deep internals slightly more extensible
Conclusion
We are hiring!
olivia@grammarly.com
https://www.grammarly.com/jobs
Thank you!
Questions?

More Related Content

What's hot

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

What's hot (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Presto
PrestoPresto
Presto
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 

Similar to Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov

Similar to Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov (20)

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVMVoxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov