Batch (Spark) and Streaming (Kafka) Data-Preprocessing

•Download as PPTX, PDF•

0 likes•440 views

In this talk we will dive deep into data pre-processing or data preparation part of Data Scientist work. Why data pre-processing is such an important topic to pay attention for aspiring Data Scientists / Machine Learning Engineers? How to process TBs of static and moving, aka streaming, schemaless data? How to ensure horizontal scalability for up to PBs when you expect such growth? We'll give you several insights how Apache Spark, a fast and general engine for large-scale data processing, and Apache Kafka helped us to deal with 80% of Data Scientists’ work. Why do you need such high-caliber tools as Spark or Kafka, when is it viable to use them and how to avoid such tools? What are the pitfalls of distributed processing using Spark and Kafka? How can Google Cloud Platform help and save costs up to 90%? We'll share what we've learned along the (hard) way.

Engineering

MEET
OUR
TEAM
WRITE HERE SOMETHING
#1 Data Science Club
sli.do/exponea
17/18 Summer

MEET
OUR
TEAM
WRITE HERE SOMETHING
Batch (Spark) and Streaming (Kafka)
Data Preprocessing
Matus Cimerman

SOME NUMBERS
● 150+ employees from 12 countries
● 29 average age
● 8 offices in 5 countries on 4 continents
● 1000%+ growth over the last 2 years

AI TEAM
● Recommendations,
● Propensity to buy during the session,
● Optimal time to send an email,
● Tech stack: Python, Gensim, Kubernetes, Spark, Go, TF, ...

Arbitrary schema-less JSON objects
Storage: IMF, MongoDB, HDFS
Data Sources
Events
Columnar storage (CSV, SQL DB, …)
Storage: Elastic, HDFS
Products
Static attributes (age, location, …)
Dynamic attributes (# page visits, …)
Storage: MongoDB, IMF,
HDFS (parquet)
Customers

Long-term storage (archive)
● Events: Newline delimited JSON
● Customers: Parquet file
● Products: CSV, soon BigTable

Data preparation & preprocessing is at least
80% of Data Scientist job

We expect to growth 10-20x in the upcoming
1-2 years, so scale-out is critical

Batch processing is done mostly by Apache
Spark jobs, sadly written in Python

EASY, 12 LINES OF CODE
+ 500 LOC of boilerplate

Batch processing and Spark
things to consider first
1. Try Kappa first.
2. If not possible, try to implement Lambda arch.
3. Combine them all!

EASY, 8 LINES OF CODE
+ 500 LOC of boilerplate

Google Cloud Platform
more moving parts
● Inputs for ML algorithms from Storage.
● Training WORM1 to Storage.
● Kubernetes for orchestration.
● Large scale models deployment, K8s.
● Easy monitoring thanks to Stackdriver.
1Write once read many

Unified
PROS
Scale-out
No vendor
lock-in
Python

CONS
Distributed
debugging
Cluster size
guessing game
Python

Future
● Spark Streaming + for all transformations.
● Batch in Spark for daily data consolidation.
● Kafka Streams for real-time use-cases.

Matus Cimerman
matus.cimerman@exponea.com
www.exponea.com/internship

PALO ALTO, CA
456 University Ave
Palo Alto, CA 94301
+1 (650) 440-7297
PRAGUE, CZ
Rohanské nábřeží 687/29,
186 00 Prague, Czechia
+420 601 372 909
BRATISLAVA, SK
Twin City B, Mlynské Nivy 12,
821 09, Bratislava, Slovakia
+421 948 127 332
WARSAW, PL
Postępu 14, 02-676
Warsaw, PL
+48 603 663 766
MOSCOW, RU
10c1 Kozhevnicheskaya Street
115114, Moscow, RU
+7 (495) 120 26 53
LONDON, UK
41 Corsham Street
London N1 6DR, UK
+44 (0) 203 086 8894
MANCHESTER, UK
1 Spinningfields, Quay Street
Manchester M3 3JE
+44 (0) 203 086 8894
EDINBURGH, UK
20/6 Fountainhall Road
Edinburgh EH9 2NN
+44 (0) 203 086 8894
www.exponea.com

What's hot

TiDB IntroductionMorgan Tocker

The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP

Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...Bowen Li

Presto Apache BigData 2017Zhenxiao Luo

Presto@UberZhenxiao Luo

Going Elastic - Philipp Krenn - Codemotion Amsterdam 2016Codemotion

Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]Kevin Xu

Predictive Models at ScaleNikhil Ketkar

TiDB Introduction - San Francisco MySQL MeetupMorgan Tocker

Graph Processing with Titan and ScyllaJason Plurad

FleetDBDiego Pacheco

Presto GeoSpatial @ Strata New York 2017Zhenxiao Luo

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVKevin Xu

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu

Traveloka's data journey — Traveloka data meetup #2Traveloka

Stream processing with Apache Flink @ OfferUpBowen Li

Introducing TiDB @ SF DevOps MeetupKevin Xu

Graph Processing with Apache TinkerPopJason Plurad

TiDB as an HTAP DatabasePingCAP

Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi

What's hot (20)

TiDB Introduction

The Dark Side Of Go -- Go runtime related problems in TiDB in production

Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...

Presto Apache BigData 2017

Presto@Uber

Going Elastic - Philipp Krenn - Codemotion Amsterdam 2016

Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]

Predictive Models at Scale

TiDB Introduction - San Francisco MySQL Meetup

Graph Processing with Titan and Scylla

FleetDB

Presto GeoSpatial @ Strata New York 2017

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]

Traveloka's data journey — Traveloka data meetup #2

Stream processing with Apache Flink @ OfferUp

Introducing TiDB @ SF DevOps Meetup

Graph Processing with Apache TinkerPop

TiDB as an HTAP Database

Introduction to Data Engineer and Data Pipeline at Credit OK

Similar to Batch (Spark) and Streaming (Kafka) Data-Preprocessing

Data pipelines from zero Lars Albertsson

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

Python and H2O with Cliff Click at PyData Dallas 2015Sri Ambati

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty

Elastic Data Analytics Platform @DatadogC4Media

Testing data streaming applicationsLars Albertsson

Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty

ITCamp 2018 - Laurent Bugnion - Azure, Windows and Xamarin: Using the cloud t...ITCamp

Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media

SnappyData Toronto Meetup Nov 2017SnappyData

Hail hydrate! from stream to lake using open sourceTimothy Spann

Liveperson DLD 2015 LivePerson

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...Databricks

«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...it-people

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann

Introduction To Spark - Durham LUG 20150916Ian Pointer

WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan

Scio - Moving to Google Cloud, A Spotify StoryNeville Li

Intro to Apache SparkMammoth Data

Netflix Keystone—Cloud scale event processing pipelineMonal Daxini

Similar to Batch (Spark) and Streaming (Kafka) Data-Preprocessing (20)

Data pipelines from zero

AWS Big Data Demystified #1: Big data architecture lessons learned

Python and H2O with Cliff Click at PyData Dallas 2015

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Elastic Data Analytics Platform @Datadog

Testing data streaming applications

Big Data in 200 km/h | AWS Big Data Demystified #1.3

ITCamp 2018 - Laurent Bugnion - Azure, Windows and Xamarin: Using the cloud t...

Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...

SnappyData Toronto Meetup Nov 2017

Hail hydrate! from stream to lake using open source

Liveperson DLD 2015

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...

«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines

Introduction To Spark - Durham LUG 20150916

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms

Scio - Moving to Google Cloud, A Spotify Story

Intro to Apache Spark

Netflix Keystone—Cloud scale event processing pipeline

Recently uploaded

POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao

Heart Disease Prediction using machine learning.pptxPoojaBan

Architect Hassan Khalil Portfolio for 2024hassan khalil

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

Churning of Butter, Factors affecting .Satyam Kumar

SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR

Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Electronically Controlled suspensions system .pdfme23b1001

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000

Design and analysis of solar grass cutter.pdfTagore Institute of Engineering And Technology

Biology for Computer Engineers Course Handout.pptxDeepakSakkari2

Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples

Heart Disease Prediction using machine learning.pptx

Architect Hassan Khalil Portfolio for 2024

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

Churning of Butter, Factors affecting .

SPICE PARK APR2024 ( 6,793 SPICE Models )

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...

Concrete Mix Design - IS 10262-2019 - .pptx

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR

young call girls in Green Park🔝 9953056974 🔝 escort Service

Electronically Controlled suspensions system .pdf

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...

Design and analysis of solar grass cutter.pdf

Biology for Computer Engineers Course Handout.pptx

Call Girls Delhi {Jodhpur} 9711199012 high profile service

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE

Batch (Spark) and Streaming (Kafka) Data-Preprocessing

1. MEET OUR TEAM WRITE HERE SOMETHING #1 Data Science Club sli.do/exponea 17/18 Summer

2. MEET OUR TEAM WRITE HERE SOMETHING Batch (Spark) and Streaming (Kafka) Data Preprocessing Matus Cimerman

3. What do we do

4. FULL-STACK MARKETING CLOUD

5. SOME NUMBERS ● 150+ employees from 12 countries ● 29 average age ● 8 offices in 5 countries on 4 continents ● 1000%+ growth over the last 2 years

6. AI TEAM ● Recommendations, ● Propensity to buy during the session, ● Optimal time to send an email, ● Tech stack: Python, Gensim, Kubernetes, Spark, Go, TF, ...

7. How do we collect data

8. We don't scrape any websites

9. Arbitrary schema-less JSON objects Storage: IMF, MongoDB, HDFS Data Sources Events Columnar storage (CSV, SQL DB, …) Storage: Elastic, HDFS Products Static attributes (age, location, …) Dynamic attributes (# page visits, …) Storage: MongoDB, IMF, HDFS (parquet) Customers

10. Long-term storage (archive) ● Events: Newline delimited JSON ● Customers: Parquet file ● Products: CSV, soon BigTable

11. Why even this talk at all?

12. Data preparation & preprocessing is at least 80% of Data Scientist job

13. ML Algorithms don't eat raw data

14. We expect to growth 10-20x in the upcoming 1-2 years, so scale-out is critical

15. Batch (Spark)

16. Batch processing is done mostly by Apache Spark jobs, sadly written in Python

17. EASY, 12 LINES OF CODE + 500 LOC of boilerplate

18. Batch processing and Spark things to consider first 1. Try Kappa first. 2. If not possible, try to implement Lambda arch. 3. Combine them all!

19. Streaming (Kafka)

20. Streaming ETL

21. Streaming ETL

22. Streaming ETL

23. EASY, 8 LINES OF CODE + 500 LOC of boilerplate

24. Infrastructure

25. Legacy bare-metal

26. Google Cloud Platform for the Win

27. Google Cloud Platform for the win

28. Google Cloud Platform more moving parts ● Inputs for ML algorithms from Storage. ● Training WORM1 to Storage. ● Kubernetes for orchestration. ● Large scale models deployment, K8s. ● Easy monitoring thanks to Stackdriver. 1Write once read many

29. Lessons learned

30. Unified PROS Scale-out No vendor lock-in Python

31. CONS Distributed debugging Cluster size guessing game Python

32. Future ● Spark Streaming + for all transformations. ● Batch in Spark for daily data consolidation. ● Kafka Streams for real-time use-cases.

33. That's all folks

34. Matus Cimerman matus.cimerman@exponea.com www.exponea.com/internship

35. PALO ALTO, CA 456 University Ave Palo Alto, CA 94301 +1 (650) 440-7297 PRAGUE, CZ Rohanské nábřeží 687/29, 186 00 Prague, Czechia +420 601 372 909 BRATISLAVA, SK Twin City B, Mlynské Nivy 12, 821 09, Bratislava, Slovakia +421 948 127 332 WARSAW, PL Postępu 14, 02-676 Warsaw, PL +48 603 663 766 MOSCOW, RU 10c1 Kozhevnicheskaya Street 115114, Moscow, RU +7 (495) 120 26 53 LONDON, UK 41 Corsham Street London N1 6DR, UK +44 (0) 203 086 8894 MANCHESTER, UK 1 Spinningfields, Quay Street Manchester M3 3JE +44 (0) 203 086 8894 EDINBURGH, UK 20/6 Fountainhall Road Edinburgh EH9 2NN +44 (0) 203 086 8894 www.exponea.com

Batch (Spark) and Streaming (Kafka) Data-Preprocessing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Batch (Spark) and Streaming (Kafka) Data-Preprocessing

Similar to Batch (Spark) and Streaming (Kafka) Data-Preprocessing (20)

More from Data Science Club

More from Data Science Club (6)

Recently uploaded

Recently uploaded (20)

Batch (Spark) and Streaming (Kafka) Data-Preprocessing