Spark meetup2 final (Taboola)

Spark Summit Highlights Spark DataFrames and Zeppelin on top of Cassandra

Tal Sliwowicz
Director, R&D
tal@taboola.com
Who are we?
Ruthy Goldberg
Sr. Software Engineer
ruthy@taboola.com

Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement

Largest Content Discovery and
Monetization Network
550MMonthly Unique
Users
240BMonthly
Recommendations
10B+Daily User Events
5TB+Incoming Daily Data

• Using Spark in production since v0.8
• 6 Data Centers across the globe
• Dedicated Spark & Cassandra (for spark) cluster consists of
– 5000+ cores with 35TB of RAM memory and ~1PB of SSD local
storage, across 2 Data Centers.
• Data must be processed and analyzed in real time, for example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics
What Does it Mean?

• Spark DataFrames: Simple and Fast Analysis of
Structured Data
https://spark-summit.org/2015/events/spark-dataframes-simple-and-fast-analysis-
of-structured-data/
DataFrames

• From DataFrames to Tungsten: A Peek into Spark's
Future
https://spark-summit.org/2015/events/keynote-9/
• Deep Dive into Project Tungsten: Bringing Spark
Closer to Bare Metal
https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-
spark-closer-to-bare-metal/
Tungsten

• Spark and Spark Streaming at Netflix
https://spark-summit.org/2015/events/spark-and-spark-streaming-at-netflix/
Interesting Users’ Experience - Netflix

• How Spark Fits into Baidu's Scale
https://spark-summit.org/2015/events/keynote-10/
Interesting Users’ Experience - Baidu

• Recipes for Running Spark Streaming Applications in
Production
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-
applications-in-production/
Databricks Practical Talks – Spark Streaming

• Building, Debugging, and Tuning Spark Machine
Learning Pipelines
https://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-
mllib-2/
Databricks Practical Talks – Machine Learning

• Making Sense of Spark Performance
https://spark-summit.org/2015/events/making-sense-of-spark-performance/
• Taming GC Pauses for Humongous Java Heaps in
Spark Graph Computing
https://spark-summit.org/2015/events/taming-gc-pauses-for-humongous-java-
heaps-in-spark-graph-computing/
• IndexedRDD: Efficient Fine-Grained Updates for
RDDs
https://spark-summit.org/2015/events/indexedrdd-efficient-fine-grained-updates-
for-rdds/
Performance

• All Spark summit videos and presentations can be
found here https://spark-summit.org/2015/
Summary

USING SPARK AND C* TOGETHER
FOR DATA ANALYSIS USING DATA
FRAMES AND ZEPPELIN

Main Taboola’s framework classes:
– CassandraTableSchemaProvider
– CassandraDataLoader
Cassandra Table  Spark DataFrame

CassandraTableSchemaProvider
Getting Cassandra Metadata

SparkContext, SQLContext, ZeppelinContext are
automatically created and exposed as variable names
'sc', 'sqlContext' and 'z', respectively, both in scala and
python environments.
General Variables In Zeppelin

• Connect Zeppelin to the cluster (not
standalone)
• Load raw sessions data
• Run code (python/scala) for algorithmic
analysis
Zeppelin @Taboola - What’s next?

tal@taboola.com
ruthy@taboola.com
Thank You!

Defining Asynchronous APIs and sharing them with your developer community is the most effective way for internal app developers and partners to create new services using real-time event streams. But how do you do it? What specification do you use to define the APIs? What are the best practices for sharing them with the developer community? What framework can you use to code? And what’s next? How do you manage the lifecycle of these APIs? In this talk, Fran Mendez, founder of AsyncAPI and Jonathan Schabowsky, Solace CTO Architect will introduce you to the AsyncAPI specification and show you two different methods to define and share your event APIs, quickly get up to speed, and more. You will learn how to create a Kafka application using asynchronous APIs in minutes!

Engineering products for scale, speed and agility

Atul Narkhede

How you can ensure product scale-ability and performance while racing to meet market needs? Software Product Development companies today work in a high-speed, dynamic and challenging environment. Starting with an idea, you need to build a Minimum Viable Product that you can take to the market for feedback, then incorporate user feedback, while still being ready to launch before competition. In this situation, how can you ensure that your products are reliable, scale-able and secure? The secret is in following the best practices of product engineering. Watch the audio-visual recording of this talk at http://bit.ly/UMaCEq

DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...

DataStax Academy

In this in-depth workshop you will gain hands on experience with using Spark and Cassandra inside the DataStax Enterprise Platform. The focus of the workshop will be working through data analytics exercises to understand the major developer developer considerations. You will also gain an understanding of the internals behind the integration that allow for large scale data loading and analysis. It will also review some of the major machine learning libraries in Spark as an example of data analysis. The workshop will start with a review the basics of how Spark and Cassandra are integrated. Then we will work through a series of exercises that will show how to perform large scale Data Analytics with Spark and Cassandra. A major part of the workshop will be to understand effective data modeling techniques in Cassandra that allow for fast parallel loading of the data into Spark to perform large scale analytics on that data. The exercises will also look at how to how to use the open source Spark Notebook to run interactive data analytics with the DataStax Enterprise Platform.

Scala Jday 2014

Russ Hertzberg

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Databricks

The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!

Slidedeck is related to the presentation done for Azure Singapore user group about Monitoring Kubernetes with Prometheus and Grafana on 19 August 2021. Covered Prometheus Architecture, installation using Prometheus operator, Service Monitor, Pod Monitor, Alert rules. Live demo included Prometheus and Grafana integrations for Spring Boot and .Net Core application. Monitoring for infrastructure / messaging platforms using RabbitMQ is also covered. Youtube video recording - https://youtu.be/t8uenUoI4Mw https://www.meetup.com/en-AU/mssgug/events/279925499

Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...

Server-Sent Events (SSE) is a server push technology where clients receive automatic server updates through the secure http connection. SSE can be used in apps like live stock updates, that use one way data communications and also helps to replace long polling by maintaining a single connection and keeping a continuous event stream going through it. We used a simple Kafka producer to publish messages onto Kafka topics and developed a reactive Kafka consumer by leveraging Spring Webflux to read data from Kafka topic in non-blocking manner and send data to clients that are registered with Kafka consumer without closing any http connections. This implementation allows us to send data in a fully asynchronous & non-blocking manner and allows us to handle a massive number of concurrent connections. We’ll cover: •Push data to external or internal apps in near real time •Push data onto the files and securely copy them to any cloud services •Handle multiple third-party apps integrations

WSO2Con ASIA 2016: API Driven Innovation Within the Enterprise

WSO2

85% of enterprises have a digital transformation strategy in place, but only 30% have really execute on it. What about you? In this session, we will explore how enterprises can embark on their digital transformation journey, leveraging APIs and a service-based architecture. Isabelle shares how several customers have achieved their business goals and describes the technical approach they have taken to do so.

Icinga Camp Bangalore - Enterprise exceptions

Monitoring cloud applications and containers

Build Your Own Recommendation Engine

Sri Ambati

Big data and non relational database

JEEConf 2015 Big Data Analysis in Java World

Serg Masyutin

Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...

Embracing open source software for critical platform operations is a tough organizational evolution for a company of any size. This is particularly daunting for technology teams accustomed to a fully supported managed service. Come learn about how we are using OSS to modernize Health Care at UnitedHealth Group as a roadmap to adopt and offer OSS in your own organization! Over the last three years, Kafka as a Service within UnitedHealth Group has gone from non-existent to being centrally managed and utilized by over 200 internal application teams as an essential component to our ecosystem. In this session, I will share how to tactically implement a Kafka as a Service platform offering within any organization with a very lean team and how to get broad adoption from engineers and leadership. I'll discuss the engineering cultural changes needed, both on the DevOps team as well as more broadly, to adopt OSS. Spoiler: Documentation is the key to success. I will talk about some of our "aha" moments, including the importance of internal Terms of Service and how to encourage teams to "Google first." I will include things that haven't worked as well, such as requiring manual review of all topic creation PRs (this doesn't scale!). Attendees will learn how to both stand up their own OSS offering as well as how to be a good internal consumer of other such offerings. Come ready to learn and laugh about my journey to offering OSS to thousands of people!

Extending KEDA with External Scalers

Baltazar Chua

Google Charts for native Android apps

Chuck Greb

Cis 528presentation final

priyalmistry4

Cis 528 big data

akashgandhi10

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...

Databricks

This talk is a case-study on how Apache Spark and the Spark-Solr library is being used at Flipp for driving search relevancy. Flipp is a Toronto based digital flyer and ecommerce company which helps shoppers save money on weekly shopping. Our customers have the option of browsing through our 5+ million products from the brick-and-mortar retailers in North America. This makes Search a very challenging function in our app. How to show the most relevant and personalized search results to users on a query? The talk will focus on using user signals such as Click Through Rate (CTR) and Impressions to increase search relevancy. I will also talk about how PySpark is used to create the Flipp Search ETL platform for collecting user signals and reading product data from Solr. The problem scenario will be explained in which keyword search and basic relevancy algorithms become ineffective when dealing with a large product database. The solutions will cover the following implementations being used at Flipp to drive relevancy: – Utilizing user clicks and popularity data to derive and index normalized item weights to implement the Search Crowd Curation models in Apache Solr – How around 5+ million items are classified into Google Categories in real time using Keras and Apache Spark to power product category curation in Solr. – How to create a crowd sourced query intent categorizer in Solr using the Spark-Solr library. – The use of offline and online metrics at Flipp for evaluating changes in search relevancy. – Future plans for incorporating Kafka-connect in Apache Solr with structured streaming to perform real-time product indexing with Spark-Solr library.

01 supermapiportaloverview

Kafka Summit NYC 2017 - The Rise of the Streaming Platform

confluent

Up and Running with firebase

Md. Sadhan Sarker

Real Time API delivering data @ Scale

Akash Mishra

Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal

02 supermapiclientforjavascriptintroduction

Spark Magic Building and Deploying a High Scale Product in 4 Months

End-to-End Data Pipelines with Apache Spark

Burak Yavuz

What's hot

Monitoring kubernetes wwith prometheus and grafana azure singapore - 19 aug...

Nilesh Gule

Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...

WSO2Con ASIA 2016: API Driven Innovation Within the Enterprise

WSO2

Icinga Camp Bangalore - Enterprise exceptions

Monitoring cloud applications and containers

Build Your Own Recommendation Engine

Sri Ambati

Big data and non relational database

JEEConf 2015 Big Data Analysis in Java World

Serg Masyutin

Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...

Extending KEDA with External Scalers

Baltazar Chua

Google Charts for native Android apps

Chuck Greb

Cis 528presentation final

priyalmistry4

Cis 528 big data

akashgandhi10

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...

Databricks

01 supermapiportaloverview

Kafka Summit NYC 2017 - The Rise of the Streaming Platform

confluent

Up and Running with firebase

Md. Sadhan Sarker

Real Time API delivering data @ Scale

Akash Mishra

Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal

02 supermapiclientforjavascriptintroduction

What's hot (20)

Monitoring kubernetes wwith prometheus and grafana azure singapore - 19 aug...

Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...

WSO2Con ASIA 2016: API Driven Innovation Within the Enterprise

Icinga Camp Bangalore - Enterprise exceptions

Monitoring cloud applications and containers

Build Your Own Recommendation Engine

Big data and non relational database

JEEConf 2015 Big Data Analysis in Java World

Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...

Extending KEDA with External Scalers

Google Charts for native Android apps

Cis 528presentation final

Cis 528 big data

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...

01 supermapiportaloverview

Kafka Summit NYC 2017 - The Rise of the Streaming Platform

Up and Running with firebase

Real Time API delivering data @ Scale

Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal

02 supermapiclientforjavascriptintroduction

Similar to Spark meetup2 final (Taboola)

Spark Magic Building and Deploying a High Scale Product in 4 Months

End-to-End Data Pipelines with Apache Spark

Burak Yavuz

Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...

You have been building your applications with stateless microservices. You might even be a rockstar using Kafka for inter service communication. Everything works wonderfully but you feel you could do something more. You want your microservices to have a state. Developing stateful microservices can be hard. I will share my experience with building stateful applications with Kafka and Spring Cloud Stream libraries. Kafka Streams State Stores and Interactive Queries are the main building blocks. They are used by stream processing applications to store and query data. They can scale and be fault tolerant together with your application instances in your container platform. But there are some limitations and we need to know how to monitor their performance. This session is targeted for developers who are interested in learning event streaming practices. Demo application code will be available to participants.

Databricks Meetup @ Los Angeles Apache Spark User Group

Paco Nathan

Spark Hsinchu meetup

Yung-An He

Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...

Lillian Pierson

In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

Yao Yao

Yao Yao Mooyoung Lee https://github.com/yaowser/learn-spark/tree/master/Final%20project https://www.youtube.com/watch?v=IVMbSDS4q3A https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/ Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications

Serverless machine learning operations

Stepan Pushkarev

Any startup has to have a clear go-to-market strategy from the beginning. Similarly, any data science project has to have a go-to-production strategy from its first days, so it could go beyond proof-of-concept. Machine learning and artificial intelligence in production would result in hundreds of training pipelines and machine learning models that are continuously revised by teams of data scientists and seamlessly connected with web applications for tenants and users. In this demo-based talk we will walk through the best practices for simplifying machine learning operations across the enterprise and providing a serverless abstraction for data scientists and data engineers, so they could train, deploy and monitor machine learning models faster and with better quality.

Spark Development Lifecycle at Workday - ApacheCon 2020

Pavel Hardak

Presented by Eren Avsarogullari and Pavel Hardak (ApacheCon 2020) https://www.linkedin.com/in/erenavsarogullari/ https://www.linkedin.com/in/pavelhardak/ Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.

Apache Spark Development Lifecycle @ Workday - ApacheCon 2020

Eren Avşaroğulları

Presented by Pavel Hardak and Eren Avsarogullari (ApacheCon 2020) https://www.linkedin.com/in/pavelhardak/ https://www.linkedin.com/in/erenavsarogullari/ Title: Apache Spark Development Lifecycle at Workday Abstract: Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.

An Insider’s Guide to Maximizing Spark SQL Performance

Takuya UESHIN

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Ali Hodroj

Spark and machine learning in microservices architecture

Stepan Pushkarev

Building Data Products with BigQuery for PPC and SEO (SMX 2022)

Christopher Gutknecht

In this data management session, Christopher describes how to build robust and reliable data products in BigQuery and dbt, for PPC and SEO use cases. After an introduction to the modern data stack, six principles of reliable data products are presented, followed by the following use cases: - Google Ads Conversion upload - SEO sitemap efficiency report - Google Shopping product rating sync - Large-Scale link checker with advertools - Inventory-based PPC campaigns with dbt Here is the referenced selection of gists on github: https://gist.github.com/ChrisGutknecht

Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day

Vishal Pawar

Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day Power Apps: A software as a service application platform that enables power users in line of business roles to easily build and deploy custom business apps. You will learn how to build Canvas and Modeldriven style of apps. Common Data Service (CDS): Make it easier to bring your data together and quickly create powerful apps using a compliant and scalable data service and app platform that’s integrated into Power Apps. Power Automate: A business service for line of business specialists and IT pros to build automated workflows intuitively. Power BI: Self-service business intelligence capabilities, where end users can create reports and dashboards by themselves, without having to depend on information technology staff or database administrators.

Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open Source

Rodrigo Kono

Esta sessão é uma visão da abordagem da Microsoft para Linux e para Open Source, incluindo o cenário de desenvolvimento de software e os benefícios para você. Você vai conhecer o trabalho da Microsoft com o Linux e o código aberto, tanto em ambientes locais, quanto na nuvem pelo Azure. Você também irá tomar conhecimento como poderá desenvolver em tecnologia .NET, utilizando C# com o Linux e rodando independente de Windows Server.

SPSNYC2019 - What is Common Data Model and how to use it?

Nicolas Georgeault

Knowage roadmap-2022 (1)

KNOWAGE

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Spark Summit

Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.

Introduction to real time big data with Apache Spark

Taras Matyashovsky

This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it. Was presented on Morning@Lohika tech talks in Lviv. Design by Yarko Filevych: http://www.filevych.com/

Similar to Spark meetup2 final (Taboola) (20)

Spark Magic Building and Deploying a High Scale Product in 4 Months

End-to-End Data Pipelines with Apache Spark

Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...

Databricks Meetup @ Los Angeles Apache Spark User Group

Spark Hsinchu meetup

Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

Serverless machine learning operations

Spark Development Lifecycle at Workday - ApacheCon 2020

Apache Spark Development Lifecycle @ Workday - ApacheCon 2020

An Insider’s Guide to Maximizing Spark SQL Performance

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Spark and machine learning in microservices architecture

Building Data Products with BigQuery for PPC and SEO (SMX 2022)

Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day

Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open Source

SPSNYC2019 - What is Common Data Model and how to use it?

Knowage roadmap-2022 (1)

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Introduction to real time big data with Apache Spark

More from tsliwowicz

Spark war stories taboola

Spark on Dataproc - Israel Spark Meetup at taboola

Using apache spark to fight world hunger - Israel spark meetup at taboola

Inneractive - Spark meetup2

Taboola Road To Scale With Apache Sparktsliwowicz

Taboola's experience with Apache Spark (presentation @ Reversim 2014)