Batched To Perfection: Modeling & Solving Business Problems With Apache Spark

•

1 like•288 views

The document discusses using Apache Spark to build a batch processing pipeline for aggregating music streaming user data. It presents a problem of aggregating a user's top liked genres for a music streaming app. It then covers building a basic Spark pipeline to ingest new data, transform it into aggregations, and update existing aggregations. It discusses using Delta Lake and incremental state updates to optimize the pipeline. It also covers approaches for continuous refinement of the aggregations.

Software

BATCHED TO
Modeling & Solving Business Problems
With Apache Spark
February 2021
PERFECTION

AGENDA
01 The Problem Space
02 Aggregations
03 A Basic Pipeline
04 Incremental State
05 Continuous Reﬁnement
06 After Thoughts

Eliav Lavi
➔ About 7 years at Riskiﬁed
➔ A classically trained musician
➔ A Francophile!
eliav@riskiﬁed.com
ABOUT ME

MEET RISKIFIED
Riskiﬁed is an AI platform
powering the eCommerce
revolution. We have an
unparalleled ability to recognize
legitimate customers and keep
them moving toward conversion.
The world’s largest brands trust us
to increase revenue, manage risk,
and improve customer
interactions.
transactions
daily
total employees
130 in R&D
in funding
to date
clients include many
publicly traded
companies
600+
1M+
$229M
Enterprise
Focus

• BlueNote - an imagined music streaming app
• We’re asked to present an aggregation per user,
in the app’s UserInfo page
• There are many metrics required in this project.
We will focus on “top liked genres”
THE PROBLEM
SPACE

IN PRAISE OF
AGGREGATIONS
Query Everything:
a big-data anti-pattern

liked_tracks
track_genres
IN PRAISE OF
AGGREGATIONS
Query Everything:
a big-data anti-pattern

IN PRAISE OF
AGGREGATIONS
Aggregation backed
systems

We gain:
• Faster response times
• Scalability
• Cleaner separation of concerns
However, we pay with:
• Data latency
• Diminished agility
• Storage & compute
IN PRAISE OF
AGGREGATIONS
Aggregation trade-oﬀs

SPARK 101
Apache Spark is an
open-source uniﬁed analytics
engine for large-scale data
processing. Spark provides an
interface for programming
entire clusters with implicit data
parallelism and fault tolerance.
Wikipedia

• Spark itself is written in Scala
• Spark Applications can be written in Scala,
Java, Python or R
• Dataset vs. Dataframe
• Various data formats, Parquet as a default
◦ Columnar format
◦ Compression
◦ Nested data structures
SPARK 101

To bring in new raw
data to work with
To transform new
raw data into useful
aggregation
To allow updated
aggregation to be
accessible for apps
A BASIC PIPELINE

Existing
Aggregation
Updated
Aggregation
New
Aggregation
A BASIC PIPELINE

New
Aggregation
A BASIC PIPELINE
Existing
Aggregation
🙅
Updated
Aggregation

• Database reads might be costly
• If the process fails, we will need to pay
again for reads & writes
Possible caveats
INCREMENTAL
STATE

INCREMENTAL
STATE
Buﬀering the production
DB

• ACID transactions over HDFS
• Updating parquets is slow
• Delta allows us to postpone optimization
• Integrates perfectly with Spark
• We are already using S3
How can Delta Lake help?
INCREMENTAL
STATE

Combine aggs where applicable
INCREMENTAL STATE

Semigroup is a neat, simple typeclass that allows us to describe how
two instances of type A combine into a single A:
(A, A) => A
The Semigroup instance for
Map comes from here!
Semigroup for the win
INCREMENTAL STATE

CONTINUOUS REFINEMENT
Semigroup for the win, again

• Proper data: what feeds our process?
• The population stage
• Testability
• Stack agnosticity
• S3 could be GCS or other
• Final DB can be anything
• Bring your own scheduler
AFTER THOUGHTS

• Batch processing as a fully ﬂedged app
• Spark + Delta = ❤
• Identifying the aggregation
• Continuous reﬁnement
RECAP

Eliav Lavi
eliav@riskﬁed.com
THANK YOU
FOR YOUR TIME!
medium.com/@eliavlavi
@eliavlavi / Twitter

What's hot

Spark Usage in Enterprise Business OperationsSAP Technology

GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsNicola Molinari

What is Oracle Cloud called and its features?-Oracle cloudZabeel Institute

GraphQL Europe RecapPhilipp Sporrer

The CSV File Strikes BackSascha Wenninger

Banking on Innovation and DevOpsTapabrata Pal

Exalytics for MII sales instituteBrama Dhaneswara

Revolution Analytics - Presentation at Hortonworks Booth - Strata 2014Hortonworks

Transforming Business with Intel and SAP HANA 2 PT Datacomm Diangraha

Pure Storage + Gainsight | Delivering Customer Success at an Infrastructure C...Matthew Klassen

Accelerating Time-to-Value with SAP Rapid Deployment Solutions for AribaSAP Ariba

Finding and fixing top performance issues with new relic rpmBrian Doll

Security with the Speed of Continuous DeliveryTapabrata Pal

Implementing PLM in the Fast-Paced, Innovation Driven Prepared Foods IndustryAras

Creating Value That Scales with Revolution Analytics & AlteryxRevolution Analytics

Building a System That Never Stops New Relic at ScaleNew Relic

SplunkLive! Utrecht - KPNSplunk

SAP HANA – The Heart and Soul of a Digital BusinessSAP Technology

#GeodeSummit - Modern manufacturing powered by Spring XD and GeodePivotalOpenSourceHub

#GeodeSummit - Using Geode as Operational Data Services for Real Time Mobile ...PivotalOpenSourceHub

What's hot (20)

Spark Usage in Enterprise Business Operations

GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools

What is Oracle Cloud called and its features?-Oracle cloud

GraphQL Europe Recap

The CSV File Strikes Back

Banking on Innovation and DevOps

Exalytics for MII sales institute

Revolution Analytics - Presentation at Hortonworks Booth - Strata 2014

Transforming Business with Intel and SAP HANA 2

Pure Storage + Gainsight | Delivering Customer Success at an Infrastructure C...

Accelerating Time-to-Value with SAP Rapid Deployment Solutions for Ariba

Finding and fixing top performance issues with new relic rpm

Security with the Speed of Continuous Delivery

Implementing PLM in the Fast-Paced, Innovation Driven Prepared Foods Industry

Creating Value That Scales with Revolution Analytics & Alteryx

Building a System That Never Stops New Relic at Scale

SplunkLive! Utrecht - KPN

SAP HANA – The Heart and Soul of a Digital Business

#GeodeSummit - Modern manufacturing powered by Spring XD and Geode

#GeodeSummit - Using Geode as Operational Data Services for Real Time Mobile ...

Similar to Batched To Perfection: Modeling & Solving Business Problems With Apache Spark

Hadoop and the Relational Database: The Best of Both WorldsInside Analysis

Augmented OLAP for Big Data AnalyticsTyler Wishnoff

ESGYN OverviewRajender K Salgam

Apache Kylin Use Cases in China and JapanLuke Han

Hadoop as an Analytic Platform: Why Not?Inside Analysis

Cloud-native Semantic Layer on Data LakeDatabricks

Open Source Technologies in the Analytics RevolutionSamanthaBerlant

Apache Kylin 101SamanthaBerlant

HP Helion Webinar #4 - Open stack the magic pillBeMyApp

ADV Slides: 2021 Trends in Enterprise AnalyticsDATAVERSITY

The Essential Guide to SAP Cloud, Data Migration, ABAP, and Reporting.pdfingenxtec

Democratizing Apache Spark for the Enterprise with Jonathan GoleDatabricks

Apache Kylin and Use Cases - 2018 Big Data SpainLuke Han

Future of Enterprise PaaS (Cloud Foundry Summit 2014)VMware Tanzu

2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks

Initiate Edinburgh 2019 - Big Data Meets AIAmazon Web Services

SQL In Hadoop: Big Data Innovation Without the RiskInside Analysis

Future of Enterprise PaaSSAP Technology

How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-ImpalaMediaMath

MT12 - SAP solutions from Dell – from your Datacenter to the CloudDell EMC World

Similar to Batched To Perfection: Modeling & Solving Business Problems With Apache Spark (20)

Hadoop and the Relational Database: The Best of Both Worlds

Augmented OLAP for Big Data Analytics

ESGYN Overview

Apache Kylin Use Cases in China and Japan

Hadoop as an Analytic Platform: Why Not?

Cloud-native Semantic Layer on Data Lake

Open Source Technologies in the Analytics Revolution

Apache Kylin 101

HP Helion Webinar #4 - Open stack the magic pill

ADV Slides: 2021 Trends in Enterprise Analytics

The Essential Guide to SAP Cloud, Data Migration, ABAP, and Reporting.pdf

Democratizing Apache Spark for the Enterprise with Jonathan Gole

Apache Kylin and Use Cases - 2018 Big Data Spain

Future of Enterprise PaaS (Cloud Foundry Summit 2014)

2015 02 12 talend hortonworks webinar challenges to hadoop adoption

Initiate Edinburgh 2019 - Big Data Meets AI

SQL In Hadoop: Big Data Innovation Without the Risk

Future of Enterprise PaaS

How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

MT12 - SAP solutions from Dell – from your Datacenter to the Cloud

Recently uploaded

WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2

Announcing Codolex 2.0 from GDK SoftwareJim McKeeth

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2

WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2

WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2

WSO2CON 2024 - Architecting AI in the Enterprise: APIs and ApplicationsWSO2

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

WSO2Con2024 - Low-Code Integration ToolingWSO2

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2

WSO2CON2024 - It's time to go PlatformlessWSO2

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver

WSO2Con204 - Hard Rock Presentation - KeynoteWSO2

WSO2CON 2024 - How to Run a Security ProgramWSO2

WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2

WSO2CON 2024 Slides - Open Source to SaaSWSO2

Recently uploaded (20)

WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...

Announcing Codolex 2.0 from GDK Software

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...

WSO2Con2024 - Organization Management: The Revolution in B2B CIAM

WSO2Con2024 - Unleashing the Financial Potential of 13 Million People

WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...

WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...

WSO2Con2024 - Low-Code Integration Tooling

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...

WSO2CON2024 - It's time to go Platformless

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...

WSO2Con204 - Hard Rock Presentation - Keynote

WSO2CON 2024 - How to Run a Security Program

WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...

WSO2CON 2024 Slides - Open Source to SaaS

Batched To Perfection: Modeling & Solving Business Problems With Apache Spark

1. BATCHED TO Modeling & Solving Business Problems With Apache Spark February 2021 PERFECTION

2. AGENDA 01 The Problem Space 02 Aggregations 03 A Basic Pipeline 04 Incremental State 05 Continuous Reﬁnement 06 After Thoughts

3. Eliav Lavi ➔ About 7 years at Riskiﬁed ➔ A classically trained musician ➔ A Francophile! eliav@riskiﬁed.com ABOUT ME

4. WHAT IS RISKIFIED?

5. MEET RISKIFIED Riskiﬁed is an AI platform powering the eCommerce revolution. We have an unparalleled ability to recognize legitimate customers and keep them moving toward conversion. The world’s largest brands trust us to increase revenue, manage risk, and improve customer interactions. transactions daily total employees 130 in R&D in funding to date clients include many publicly traded companies 600+ 1M+ $229M Enterprise Focus

6. THE PROBLEM SPACE

7. • BlueNote - an imagined music streaming app • We’re asked to present an aggregation per user, in the app’s UserInfo page • There are many metrics required in this project. We will focus on “top liked genres” THE PROBLEM SPACE

8. THE PROBLEM SPACE

9. THE PROBLEM SPACE

10. IN PRAISE OF AGGREGATIONS

11. IN PRAISE OF AGGREGATIONS

12. IN PRAISE OF AGGREGATIONS Query Everything: a big-data anti-pattern

13. liked_tracks track_genres IN PRAISE OF AGGREGATIONS Query Everything: a big-data anti-pattern

14. IN PRAISE OF AGGREGATIONS Query Everything: a big-data anti-pattern

15. IN PRAISE OF AGGREGATIONS Aggregation backed systems

16. We gain: • Faster response times • Scalability • Cleaner separation of concerns However, we pay with: • Data latency • Diminished agility • Storage & compute IN PRAISE OF AGGREGATIONS Aggregation trade-oﬀs

17. SPARK 101

18. SPARK 101 Apache Spark is an open-source uniﬁed analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Wikipedia

19. • Spark itself is written in Scala • Spark Applications can be written in Scala, Java, Python or R • Dataset vs. Dataframe • Various data formats, Parquet as a default ◦ Columnar format ◦ Compression ◦ Nested data structures SPARK 101

20. A BASIC PIPELINE

21. To bring in new raw data to work with To transform new raw data into useful aggregation To allow updated aggregation to be accessible for apps A BASIC PIPELINE

22. A BASIC PIPELINE

23. A BASIC PIPELINE

24. A BASIC PIPELINE

25. Existing Aggregation Updated Aggregation New Aggregation A BASIC PIPELINE

26. New Aggregation A BASIC PIPELINE Existing Aggregation 🙅 Updated Aggregation

27. A BASIC PIPELINE

28. A BASIC PIPELINE

29. A BASIC PIPELINE

30. INCREMENTAL STATE

31. INCREMENTAL STATE

32. INCREMENTAL STATE

33. INCREMENTAL STATE

34. INCREMENTAL STATE

35. INCREMENTAL STATE

36. INCREMENTAL STATE

37. INCREMENTAL STATE

38. INCREMENTAL STATE

39. INCREMENTAL STATE

40. INCREMENTAL STATE The Naive Approach

41. • Database reads might be costly • If the process fails, we will need to pay again for reads & writes Possible caveats INCREMENTAL STATE

42. INCREMENTAL STATE Buﬀering the production DB

43. • ACID transactions over HDFS • Updating parquets is slow • Delta allows us to postpone optimization • Integrates perfectly with Spark • We are already using S3 How can Delta Lake help? INCREMENTAL STATE

44. TIERED PERSISTENCE

45. Read Delta & join INCREMENTAL STATE

46. Combine aggs where applicable INCREMENTAL STATE

47. Semigroup is a neat, simple typeclass that allows us to describe how two instances of type A combine into a single A: (A, A) => A The Semigroup instance for Map comes from here! Semigroup for the win INCREMENTAL STATE

48. Agg as Semigroup INCREMENTAL STATE

49. Combine aggs where applicable INCREMENTAL STATE

50. 50 Upsert to Delta INCREMENTAL STATE

51. CONTINUOUS REFINEMENT

52. CONTINUOUS REFINEMENT

53. CONTINUOUS REFINEMENT

54. CONTINUOUS REFINEMENT

55. CONTINUOUS REFINEMENT

56. CONTINUOUS REFINEMENT Semigroup for the win, again

57. CONTINUOUS REFINEMENT

58. AFTER THOUGHTS

59. • Proper data: what feeds our process? • The population stage • Testability • Stack agnosticity • S3 could be GCS or other • Final DB can be anything • Bring your own scheduler AFTER THOUGHTS

60. RECAP

61. • Batch processing as a fully ﬂedged app • Spark + Delta = ❤ • Identifying the aggregation • Continuous reﬁnement RECAP

62. Eliav Lavi eliav@riskﬁed.com THANK YOU FOR YOUR TIME! medium.com/@eliavlavi @eliavlavi / Twitter

Batched To Perfection: Modeling & Solving Business Problems With Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Batched To Perfection: Modeling & Solving Business Problems With Apache Spark

Similar to Batched To Perfection: Modeling & Solving Business Problems With Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Batched To Perfection: Modeling & Solving Business Problems With Apache Spark