Why would you care? Because PySpark is a cloud-agnostic analytics tool for Big Data processing, "hidden" in:
* AWS Glue - Managed ETL Service
* Amazon EMR - Big Data Platform
* Google Cloud Dataproc - Cloud-native Spark and Hadoop
* Azure HDInsight - Microsoft implementation of Apache Spark in the cloud
In this #ServerlessTO talk, Jonathan Rioux - Head of Data Science at EPAM Canada & author of PySpark in Action book (https://www.manning.com/books/pyspark-in-action), will get you acquainted with PySpark - Python API for Spark.
Event details: https://www.meetup.com/Serverless-Toronto/events/269124392/
Event recording: https://youtu.be/QGxytMbrjGY
Like always, BIG thanks to our knowledge sponsor Manning Publications – who generously offered to raffle not 1 but 3 of Jonathan's books!
RSVP for more exciting (online) events at https://www.meetup.com/Serverless-Toronto/events/
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training**
This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:
1) Limitations of Apache Hive
2) Spark SQL Advantages Over Hive
3) Spark SQL Success Story
4) Spark SQL Features
5) Architecture of Spark SQL
6) Spark SQL Libraries
7) Querying Using Spark SQL
8) Demo: Stock Market Analysis With Spark SQL
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training**
This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:
1) Limitations of Apache Hive
2) Spark SQL Advantages Over Hive
3) Spark SQL Success Story
4) Spark SQL Features
5) Architecture of Spark SQL
6) Spark SQL Libraries
7) Querying Using Spark SQL
8) Demo: Stock Market Analysis With Spark SQL
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
A brief introduction to Spark ML with PySpark for Alpine Academy Spark Workshop #2. This workshop covers basic feature transformation, model training, and prediction. See the corresponding github repo for code examples https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
A tour of pyspark streaming in Apache Spark with an example calculating CPU usage using the Docker stats API. Two buzzwordy technologies for the price of one.
Making Nested Columns as First Citizen in Apache Spark SQLDatabricks
Apple Siri is the world's largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. We use large amounts of data to provide our users the best possible personalized experience. Our raw event data is cleaned and pre-joined into an unified data for our data consumers to use. To keep the rich hierarchical structure of the data, our data schemas are very deep nested structures. In this talk, we will discuss how Spark handles nested structures in Spark 2.4, and we'll show the fundamental design issues in reading nested fields which is not being well considered when Spark SQL was designed. This results in Spark SQL reading unnecessary data in many operations. Given that Siri's data is super nested and humongous, this soon becomes a bottleneck in our pipelines. Then we will talk about the various approaches we have taken to tackle this problem. By making nested columns as first citizen in Spark SQL, we can achieve dramatic performance gain. In some of our production queries, the speed-up can be 20x in wall clock time and 8x less data being read. All of our work will be open source, and some has already been merged into upstream.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
This Edureka Spark Streaming Tutorial will help you understand how to use Spark Streaming to stream data from twitter in real-time and then process it for Sentiment Analysis. This Spark Streaming tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) What is Streaming?
2) Spark Ecosystem
3) Why Spark Streaming?
4) Spark Streaming Overview
5) DStreams
6) DStream Transformations
7) Caching/ Persistence
8) Accumulators, Broadcast Variables and Checkpoints
9) Use Case – Twitter Sentiment Analysis
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Training will help you learn about PySpark API. You will get to know how python can be used with Apache Spark for Big Data Analytics. Edureka's structured training on Pyspark will help you master skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).
Machine learning in the enterprise is an iterative process. Data scientists will tweak or replace their learning algorithm in a small data sample until they find an approach that works for the business problem and then apply the Analytics to the full data set. Apache SystemML is a new system that accelerates this kind of exploratory algorithm development for large-scale machine learning problems. SystemML provides a high-level language to quickly implement and run machine learning algorithms on Spark. SystemML’s cost-based optimizer takes care of low-level decisions about how to use Spark’s parallelism, allowing users to focus on the algorithm and the real-world problem that the algorithm is trying to solve. This talk will introduce you to SystemML and get you started building declarative analytics with SystemML using a simple Zeppelin notebook and running on Apache Spark environment.
Strata NYC 2015 - What's coming for the Spark communityDatabricks
In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
On the technical side, I’ll discuss two key initiatives ahead for Spark. The first is a tighter integration of Spark’s libraries through shared primitives such as the data frame API. The second is across-the-board performance optimizations that exploit schema information embedded in Spark’s newer APIs. These initiatives are both designed to make Spark applications easier to write and faster to run.
On the community side, this talk will focus on the growing ecosystem of extensions, tools, and integrations evolving around Spark. I’ll survey popular language bindings, data sources, notebooks, visualization libraries, statistics libraries, and other community projects. Extensions will be a major point of growth in the future, and this talk will discuss how we can position the upstream project to help encourage and foster this growth.
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
A brief introduction to Spark ML with PySpark for Alpine Academy Spark Workshop #2. This workshop covers basic feature transformation, model training, and prediction. See the corresponding github repo for code examples https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
A tour of pyspark streaming in Apache Spark with an example calculating CPU usage using the Docker stats API. Two buzzwordy technologies for the price of one.
Making Nested Columns as First Citizen in Apache Spark SQLDatabricks
Apple Siri is the world's largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. We use large amounts of data to provide our users the best possible personalized experience. Our raw event data is cleaned and pre-joined into an unified data for our data consumers to use. To keep the rich hierarchical structure of the data, our data schemas are very deep nested structures. In this talk, we will discuss how Spark handles nested structures in Spark 2.4, and we'll show the fundamental design issues in reading nested fields which is not being well considered when Spark SQL was designed. This results in Spark SQL reading unnecessary data in many operations. Given that Siri's data is super nested and humongous, this soon becomes a bottleneck in our pipelines. Then we will talk about the various approaches we have taken to tackle this problem. By making nested columns as first citizen in Spark SQL, we can achieve dramatic performance gain. In some of our production queries, the speed-up can be 20x in wall clock time and 8x less data being read. All of our work will be open source, and some has already been merged into upstream.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
This Edureka Spark Streaming Tutorial will help you understand how to use Spark Streaming to stream data from twitter in real-time and then process it for Sentiment Analysis. This Spark Streaming tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) What is Streaming?
2) Spark Ecosystem
3) Why Spark Streaming?
4) Spark Streaming Overview
5) DStreams
6) DStream Transformations
7) Caching/ Persistence
8) Accumulators, Broadcast Variables and Checkpoints
9) Use Case – Twitter Sentiment Analysis
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Training will help you learn about PySpark API. You will get to know how python can be used with Apache Spark for Big Data Analytics. Edureka's structured training on Pyspark will help you master skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).
Machine learning in the enterprise is an iterative process. Data scientists will tweak or replace their learning algorithm in a small data sample until they find an approach that works for the business problem and then apply the Analytics to the full data set. Apache SystemML is a new system that accelerates this kind of exploratory algorithm development for large-scale machine learning problems. SystemML provides a high-level language to quickly implement and run machine learning algorithms on Spark. SystemML’s cost-based optimizer takes care of low-level decisions about how to use Spark’s parallelism, allowing users to focus on the algorithm and the real-world problem that the algorithm is trying to solve. This talk will introduce you to SystemML and get you started building declarative analytics with SystemML using a simple Zeppelin notebook and running on Apache Spark environment.
Strata NYC 2015 - What's coming for the Spark communityDatabricks
In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
On the technical side, I’ll discuss two key initiatives ahead for Spark. The first is a tighter integration of Spark’s libraries through shared primitives such as the data frame API. The second is across-the-board performance optimizations that exploit schema information embedded in Spark’s newer APIs. These initiatives are both designed to make Spark applications easier to write and faster to run.
On the community side, this talk will focus on the growing ecosystem of extensions, tools, and integrations evolving around Spark. I’ll survey popular language bindings, data sources, notebooks, visualization libraries, statistics libraries, and other community projects. Extensions will be a major point of growth in the future, and this talk will discuss how we can position the upstream project to help encourage and foster this growth.
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Provectus
In this demo based talk with live coding, we’ll present a functional typeful framework for developing Apache Spark applications. We’ll walk through the following key topics: – turning unmanageable Spark scripts into typeful Spark Functions – serverless deployment of Spark functions into the cloud – unit testing Spark functions to save cluster resources and developers time – seamless Spark session management between concurrent Spark jobs in exclusive or share modes
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets.
In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data.
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain
This talk describes how open source Hue [1] was built in order to provide a better Hadoop User Experience. The underlying technical details of its architecture, the lessons learned and how it integrates with Impala, Search and Spark under the cover will be explained.
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...GeeksLab Odessa
DataScience Lab, 13 мая 2017
Cервинг моделей, построенных на больших данных с помощью Apache Spark
Степан Пушкарев (GM (Kazan) at Provectus / CTO at Hydrosphere.io)
После подготовки данных и обучения моделей на больших данных с использованием Apache Spark встает вопрос о том, как использовать обученные модели в реальных приложениях. Помимо модели важно не забывать про весь пайплайн пре-процессинга данных, который должен попасть в продакшн в том виде, в котором его спроектировал и реализовал дата саентист. Такие решения, как PMML/PFA, основанные на экспорте/импорте модели и алгоритма имеют очевидные недостатки и ограничения. В данном докладе мы предложим альтернативное решение, которое упрощает процесс использования моделей и пайплайнов в реальных боевых приложениях.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
Title:
Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
*Bio*
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
*Related Links*
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
Two #ModernDataStack talks and one DevOps talk: https://youtu.be/4R--iLnjCmU
1. "From Data-driven Business to Business-driven Data: Hands-on #DataModelling exercise" by Jacob Frackson of Montreal Analytics
2. "Trends in the #DataEngineering Consulting Landscape" by Nadji Bessa of Infostrux Solutions
3. "Building Secure #Serverless Delivery Pipelines on #GCP" by Ugo Udokporo of Google Cloud Canada
We ran out of time for the 4th presenter, so the event will CONTINUE in March... stay tuned! Compliments of #ServerlessTO.
Opinionated re:Invent recap with AWS Heroes & BuildersDaniel Zivkovic
AWS Heroes & Builders from Bosnia, Montenegro, Serbia and Canada share their impressions of the re:Invent 2022, most important announcements, opinions about where #AWS is going next and how that will impact you: https://youtu.be/KfkQU8QbQ4U
* Dzenan Dzevlan - AWS Community Hero, AWS Authorized Instructor & AWS User Group Bosnia leader
* Goran Opacic - AWS Data Hero, CEO @ Esteh & AWS User Group Belgrade leader
* Dzenana Dzevlan - AWS Community Builder, Production Engineer @ Yahoo & AWS User Group Bosnia leader
* Marin Radjenovic - AWS Community Builder, Cloud Architect @ Crayon & AWS User Group Montenegro leader
* Andrew Brown - AWS Community Hero, GCP Champion Innovator, CEO @ ExamPro & AWS Ontario Virtual User Group leader
TABLE OF CONTENT
00:00:00 Roundtable discussion
00:55:10 Q&A
00:57:45 Why you should watch this video!
00:59:35 Panelists into
01:06:11 How it felt to be at #reInvent 2022
01:07:19 Manning Publications raffle
01:08:15 #ServerlessTO past & future
LINKS FROM THE MEETUP CHAT
https://www.linkedin.com/in/dzenanadzevlan/
https://twitter.com/DzenanaDzevlan
https://www.linkedin.com/in/sqlheisenberg/
https://twitter.com/sqlheisenberg
https://www.linkedin.com/in/marinradjenovic/
https://twitter.com/marin_ra
https://medium.com/@marinradjenovic
https://www.linkedin.com/in/goranopacic/
https://twitter.com/goranopacic
https://hachyderm.io/@goranopacic/
https://www.linkedin.com/in/andrew-wc-brown/
https://twitter.com/andrewbrown
https://www.youtube.com/playlist?list=PLBfufR7vyJJ7k25byhRXJldB5AiwgNnWv
AWS Java Panel #2 SnapStart and SpringCloud AWS: https://www.youtube.com/watch?v=nhwgm9J4F9A
Top Announcements of AWS re:Invent 2022: https://aws.amazon.com/blogs/aws/top-announcements-of-aws-reinvent-2022/
AWS Supply Chain https://aws.amazon.com/aws-supply-chain/
Serverless MySQL https://planetscale.com/
MORE EVENTS LIKE THIS
* past interactive lectures at: http://youtube.serverlesstoronto.org/
* upcoming events: https://www.meetup.com/Serverless-Toronto/events/
Google Cloud Next '22 Recap: Serverless & Data editionDaniel Zivkovic
See what's new in #Serverless and #Data at GCP. Our guest, Guillaume Blaquiere - Stack Overflow contributor & #GCP #Developer Expert from France, covered the best #GoogleCloudNext announcements, practically demoed how to benefit from #BigQuery Remote Functions and answered many questions.
The meetup recording with TOC for easy navigation is at https://youtu.be/AuZZTwHIcdY
P.S. For more interactive lectures like this, go to http://youtube.serverlesstoronto.org/ or sign up for our upcoming live events at https://www.meetup.com/Serverless-Toronto/events/
Conversational Document Processing AI with Rui CostaDaniel Zivkovic
Learn how to bridge the gap between #ConversationalAI and #DocumentProcessing with #GCP guru and #OReilly "#GoogleCloud Cookbook" author Rui Costa. Even if #Chatbots and #DocumentManagement#automation are not your "cup of tea", getting access to the #sourcecode of the his end-to-end #Serverless solution (with #Dialogflow, #Flutter, #Firebase, #Firestore, #AppEngine, #CloudRun) is priceless: https://forms.gle/domTVAQxUN6AthFz5
Proudly brought to you by #ServerlessTO: http://youtube.serverlesstoronto.org/
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowDaniel Zivkovic
Apache Beam is a beautiful framework that blurs the line between Batch and Streaming, so check out this interactive tutorial by Patrick Lecuyer - Head of Specialist Customer Engineering at Google Canada. His examples run on GCP Dataflow, but what you'll learn will be portable across clouds, and distributed processing engines like Apache Flink, Apache Samza, Apache Spark, IBM Streams... regardless of where you do your Big Data processing!
The meetup recording with TOC for easy navigation is at https://youtu.be/7pUYKX40RfA.
P.S. For more interactive lectures like this, go to http://youtube.serverlesstoronto.org/ or sign up for our upcoming live events at https://www.meetup.com/Serverless-Toronto/events/
Gojko's 5 rules for super responsive Serverless applicationsDaniel Zivkovic
Gojko Adzic (#AWS Serverless Hero, Trainer, Entrepreneur & Book Author) shares 5 important Architectural ideas to make request processing lightning fast with #Serverless deployments. Video at https://youtu.be/XLLdWYdJ4Vw
P.S. For more interactive lectures like this, go to http://youtube.serverlesstoronto.org/ or sign up for our upcoming live events at https://www.meetup.com/Serverless-Toronto/events/
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettDaniel Zivkovic
Leigha Jarett of GCP explains how to bring Cloud "superpowers" to your Data and modernize your Business Intelligence with Looker, BigQuery and Google Cloud services on an example of Cymbal Direct - one of Google Cloud's demo brands. The meetup recording with TOC for easy navigation is at https://youtu.be/BpzJU_S40ic.
P.S. For more interactive lectures like this, go to http://youtube.serverlesstoronto.org/ or sign up for our upcoming live events at https://www.meetup.com/Serverless-Toronto/events/
The entire AWS Serverless Developer Advocates team recaps the news from Amazon Web Services & answers many serverless questions, so the event felt like a mini re:Invent. The meetup recording with TOC for easy navigation is at https://www.youtube.com/watch?v=Y4vMXsY2Pc4.
Thank you @talia_nassi, @edjgeek, @benjamin_l_s, @julian_wood and @jbesw for visiting our Serverless Tronto community!
P.S. For more interactive lectures like this, go to http://youtube.serverlesstoronto.org/ or sign up for our upcoming live events at https://www.meetup.com/Serverless-Toronto/events/
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersDaniel Zivkovic
#MLOps is a hot buzzword, just like #DevOps before it. It sparked a gold rush for software vendors, so it's hard to choose the best tool for your needs. Vertex AI is a unified MLOps platform for the entire #AI #workflow on #GoogleCloud. It is the 3rd iteration of the Google Cloud #ML platform (since its original launch), and we think they did it right (this time).
That's why #ServerlessTO invited 2 AI/ML gurus from #GCP (Jarek Kazmierczak & Brian Kang) to introduce the #VertexAI you to.
The lecture recording with Q&A is at https://youtu.be/X1S7360ip-k
MEETUP "CODE-ALONG" RESOURCES
Vertex workbench - Managed and User-managed Notebooks
https://cloud.google.com/vertex-ai/docs/workbench/managed/quickstarts
Example that the training code was based on - Fashion MNIST dataset
https://www.tensorflow.org/tutorials/keras/classification
Hyperparameter tuning codelab
https://codelabs.developers.google.com/vertex_hyperparameter_tuning
Vertex pipeline codelabs
https://codelabs.developers.google.com/vertex-pipelines-intro
https://codelabs.developers.google.com/vertex-pipelines-custom-model
CI/CD slides
https://github.com/shivajid/MLOpsCICD/blob/master/presentation/AI%20Workshop%20Day4.pdf
CI/CD github example
https://github.com/shivajid/MLOpsCICD
Model monitoring example
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/model_monitoring/model_monitoring.ipynb
Best practices for MLOps
https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
https://cloud.google.com/resources/mlops-whitepaper
Official Vertex AI Github repository
https://github.com/GoogleCloudPlatform/vertex-ai-samples/
MEETUP CHAT LINKS
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/notebook_template.ipynb
https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/master/notebooks/official/custom
https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/master/notebooks/community/sdk
https://cloud.google.com/architecture/ml-on-gcp-best-practices#model-deployment-and-serving
https://www.youtube.com/watch?v=ntBEQdD1IeQ&list=PLd31CCJlr9FrZazLqRg1Lxq7xw9b6VNP6&index=3
Empowering Developers to be Healthcare HeroesDaniel Zivkovic
Learn from Dr. Kevin Maloy in 1hr how to write Healthcare Apps to connect to EHR systems, instead of spending weeks to become fluent in HL7 SMART on FHIR standard. Kevin is a practicing, board-certified Emergency Medicine physician who also codes. The meetup recording (with Q&A) is at https://youtu.be/alB-45nu0lo
Get started with Dialogflow & Contact Center AI on Google CloudDaniel Zivkovic
Google #ConversationalAI expert Lee Boonstra explains how to build Enterprise Chatbots and Telephony (#CcaaS #CallCenter) Agents using #Dialogflow, #CCAI and other #GoogleCloud #Serverless services. Courtesy of #ServerlessTO.
The lecture recording with Q&A is at https://youtu.be/apyr6dgx52Q
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...Daniel Zivkovic
Learn how Google Cloud addresses the key challenges when building an Agile Data & AI platform. This lecture is important regardless of the Cloud you are (will be) using because most businesses face the same 6 challenges:
1. High-quality AI requires a lot of data
2. AI Expertise is in high demand
3. Getting the value of ML requires a modern data platform
4. Activating ML requires surfacing AI into decision UIs
5. Operationalizing ML is hard
6. State-of-the-art changes rapidly
The lecture recording with Q&A is at https://youtu.be/ntBEQdD1IeQ
Smart Cities of Italy: Integrating the Cyber World with the IoTDaniel Zivkovic
Plant the #SmartCity #IoT seed in your community by borrowing some production-ready projects from #Messina, Italy! There's plenty of ideas to choose from http://SmartMe.io, http://smartme.unime.it/ & https://github.com/MDSLab. Our guest Antonio Puliafito explained how Smart Messina technology works and shared many tips for succeeding on your next Smart/Connected Community IoT Initiative.
Event recording is at https://youtu.be/-jLLfE8fRH8
Doubting it's possible to implement that in your community? Or just not sure you can spare 1.5 hours to watch this #Serverless #Toronto meetup? Then, watch this 5min CNET video from 2017 and get inspired (like we did :) https://www.cnet.com/videos/sicilys-smart-cities-show-its-getting-easier-to-get-smart/
And if you'll have any questions for Antonio and his team, post them to the #smart-city channel of http://slack.serverlesstoronto.org/, and the University of Messina researchers will get back to you!
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Daniel Zivkovic
Take a peek into the future of IT - beyond Serverless Software Development, when Serverless becomes a way to run Internal IT.
When ServerlessToronto.org invited Joe Emison - AWS Serverless Hero, we expected to see how he "knocked down the wall" between AWS & Google Clouds (to query Amazon DynamoDB from Google BigQuery) using the Fivetran ELT tool, but we learned so much more... and you will too: https://youtu.be/GK5Ivm6EOlI
This is my Architecture to prevent Cloud Bill ShockDaniel Zivkovic
“Fail Fast and Learn Fast” with Cloud is a bad idea because Cloud overall is like a double-edged sword: when used correctly, it can be of great use, but it can be lethal if misused. In this meetup, Sudeep Chauha - founder of the ToMilkieWay.com shared his “near business death” experience after a GCP experiment ended up with a $72,000 bill shock.
Infinite Recursions are a common problem, so this talk is useful to developers from any public Cloud. Sudeep explained the mistakes he made, and the lessons he learned - so the rest of us can avoid similar near-Bankruptcy incidents. Thank you, Sudeep!
P.S. Watch the recording at http://youtube.ServerlessToronto.org and for more forward-looking #Software #Developerment topics, join http://ServerlessToronto.org User Group
LINKS FROM THE MEETUP & CHAT
https://www.askyourdeveloper.com/
https://svpg.com/empowered-ordinary-people-extraordinary-products/
https://www.youtube.com/playlist?list=PLd31CCJlr9FrZazLqRg1Lxq7xw9b6VNP6
https://www.meetup.com/Serverless-Toronto/events/276752609/
https://www.meetup.com/Serverless-Toronto/events/277272390/
https://www.snowflake.com/trending/data-cloud-storage
https://aisoftwarellc.weebly.com/books.html
https://tomilkieway.com/
https://blog.tomilkieway.com/72k-1/
https://blog.tomilkieway.com/72k-2/
https://sudcha.com/guide-to-cloud/
https://announce.today
https://pointaddress.com
https://maia.rest/point
https://wikimapia.org
https://cloudopty.com/
Gregor Hohpe "No one wants a server - a fresh look at Cloud strategy": https://www.youtube.com/watch?v=ACT2tXhFCDk
Adrian Cockcroft compares Vendor Lock-in to Dating: https://www.slideshare.net/AmazonWebServices/digital-transformation-arc219-reinvent-2017/85
Survey to plan #ServerlessTO Community future: https://forms.gle/BUiHVT3ZCp1dcuoH7
Our learning sponsor: https://www.manning.com/
Lunch & Learn BigQuery & Firebase from other Google Cloud customersDaniel Zivkovic
1) Migrating your on-prem #Enterprise #Data #Warehouse into the #Cloud? Here is what you need to learn (and unlearn) when designing a modern Cloud #DataWarehouse in #BigQuery!
2) Launching a #Startup? See how to supercharge your idea with #Firebase!
Watch the recording at https://youtu.be/zezhXNqD0rs and more forward-looking talks on #Cloud #Architectures & #DataEngineering join http://ServerlessToronto.org User Group.
Azure for AWS & GCP Pros: Which Azure services to use?Daniel Zivkovic
Learn how to choose which #Azure services to use so that you can start "Jumping Clouds" with confidence :) Watch the recording at https://youtu.be/34U1hUJmCUc and for more forward-looking #Software #Developerment topics, join http://ServerlessToronto.org User Group
LINKS FROM THE MEETUP & CHAT
https://www.askyourdeveloper.com/
http://youtube.serverlesstoronto.org
https://youtu.be/Ivcndg9pTpk?t=1390
https://www.meetup.com/Serverless-Toronto/events/276721419/
https://www.meetup.com/Serverless-Toronto/events/275256767/
https://www.meetup.com/Serverless-Toronto/events/276752609/
https://developerweeklypodcast.com/
https://channel9.msdn.com/Shows/Azure-Friday
https://www.pluralsight.com/paths/microsoft-azure-compute-for-developers
https://azureoverview.com/
https://build5nines.com/
https://azure.microsoft.com/en-us/updates/
https://azure.microsoft.com/en-us/blog/
https://docs.microsoft.com/en-us/azure/architecture/
https://www.mssqltips.com/sqlservertip/5144/sql-server-temporal-tables-vs-change-data-capture-vs-change-tracking--part-3/
https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/
https://www.manning.com/books/azure-data-engineering
https://www.manning.com/books/azure-storage-streaming-and-batch-analytics
https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp
https://cloudevents.io/
https://docs.microsoft.com/en-us/azure/architecture/patterns/
https://www.linkedin.com/pulse/you-asking-your-team-design-perfect-solution-daniel-zivkovic/
https://youtu.be/GBTdnfD6s5Q
https://www.linkedin.com/company/serverless-toronto/
Serverless Evolution during 3 years of Serverless TorontoDaniel Zivkovic
Four presentations for the 3rd Birthday of our User Group! After a short overview about Serverless Mindset (regardless of your tech stack), see:
1. how #Serverless has changed Software Development Process (Gareth McCumskey of Serverless.com) and a demo of Serverless Desktop (https://github.com/serverless/desktop)
2. How small teams achieve BIG things with Firebase and #GCP Serverless Services (Kudzanai Murefu of Strma.io)
3. See folks competing to get involved with "COVID-19 Vaccination Passport", a project with a greater moral purpose in today's "upside-down world" (David Janes of Consensas.com)
4. A reflection on the Serverless evolution and optimism for the future of Serverless (and Startups) as the line between its ecosystem and other Cloud-native Technologies keeps blurring (Mike Apted of #AWS #Startups).
BONUS
1. Recording https://youtu.be/mdxT929JJoE
2. Invitation https://www.meetup.com/Serverless-Toronto/events/273716629/
3. For more forward-looking #Software #Developerment topics, join #ServerlessTO User Group
LINKS FROM THE MEETUP
https://www.askyourdeveloper.com/
https://www.meetup.com/en-AU/lean-product/
https://www.linkedin.com/in/marcbrouillard/
https://www.youtube.com/watch?t=1390&v=Ivcndg9pTpk
https://youtu.be/8Rzv68K8ZOY
https://www.youtube.com/watch?t=2304&v=SPsaqiegOP4
https://www.manning.com/
https://www.serverless.com/author/garethmccumskey/
https://www.linkedin.com/in/kudzanai-murefu-7b128886/
https://www.linkedin.com/in/davidjanes/
https://www.linkedin.com/in/mikeapted/
https://serverless.com/slack
https://github.com/serverless/desktop
https://strma.io
https://cccc4.ca/
https://passport.consensas.com/
https://github.com/Consensas/information-passport/tree/main/docs
https://dpjanes.medium.com/
https://en.wikipedia.org/wiki/Antoine_de_Saint-Exup%C3%A9ry
https://youtu.be/1SqfJo47kMA
https://youtu.be/tz89XTBby-M
https://aws.amazon.com/activate/founders/
https://aws.amazon.com/builders-library/
https://www.amazon.science/publications
https://www.linkedin.com/in/rupakg
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
Enterprises traditionally think of App Platforms as PCF (Pivotal Cloud Foundry) or Red Hat OpenShift. In reality, public Clouds have evolved into Application Platforms - especially when using Managed Services & Serverless.
• If you are an IT Executive under increased pressure to cut costs, see how better Technology Stack choices – not layoffs or pay cuts, can reduce IT costs + increase business agility (while avoiding vendor lock-in):
• If you are a Developer lost in the sea of the Cloud Computing choices, watch Ray Tsang (Java Champion from GCP) live-code, and you will walk away Cloud-Native :)
See how to stop cannibalization of IT by deploying your good ol' Java Spring Boot Apps directly to Google Cloud Platform - no Servers/PCF/OpenShift/Kubernetes to manage, nor to limit your creativity: https://youtu.be/2B0wWagE0dc
P.S. For more forward-looking Software Developerment topics, join ServerlessToronto.org Meetups, and if you have any questions about the Architectural Patterns discussed, reach out to me to chat.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Intro to PySpark: Python Data Analysis at scale in the Cloud
1. Welcome to ServerlessToronto.org
“Home of Less IT Mess”
1
Introduce Yourself ☺
- Why are you here?
- Looking for work?
- Offering work?
Our feature presentation “Intro to PySpark” starts at 6:20pm…
2. Serverless is not just about the Tech:
2
Serverless is New Agile & Mindset
Serverless Dev (gluing
other people’s APIs and
managed services)
We're obsessed to
creating business value
(meaningful MVPs,
products), by helping
Startups & empowering
Business users!
We build bridges
between Serverless
Community (“Dev leg”),
and Front-end & Voice-
First folks (“UX leg”),
and empower UX
developers
Achieve agility NOT by
“sprinting” faster (like in
Scrum), but by working
smarter (by using
bigger building blocks
and less Ops)
3. Upcoming #ServerlessTO Online Meetups
3
1. Accelerating with a Cloud Contact Center – Patrick Kolencherry
Sr. Product Marketing Manager, and Karla Nussbaumer, Head of
Technical Marketing at Twilio **JULY 9 @ 6pm **
2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
15. Goals of this presentation
Share my love of (Py)Spark
6/49
16. Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
6/49
17. Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
6/49
18. Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
6/49
19. Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
36,000 ft overview: Managed Spark in the Cloud
6/49
29. Data manipulation uses the
same vocabulary as SQL
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
.count("*")
)
13/49
30. Data manipulation uses the
same vocabulary as SQL
.select("id", "first_name", "last_name", "age")
(
my_table
.where(col("age") > 21)
.groupby("age")
.count("*")
)
select
13/49
31. Data manipulation uses the
same vocabulary as SQL
.where(col("age") > 21)
(
my_table
.select("id", "first_name", "last_name", "age")
.groupby("age")
.count("*")
)
where
13/49
32. Data manipulation uses the
same vocabulary as SQL
.groupby("age")
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.count("*")
)
group by
13/49
33. Data manipulation uses the
same vocabulary as SQL
.count("*")
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
)
count
13/49
34. I mean, you can legitimately use
SQL
spark.sql("""
select count(*) from (
select id, first_name, last_name, age
from my_table
where age > 21
)
group by age""")
14/49
35. Data manipulation and machine
learning with a uent API
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
15/49
36. Data manipulation and machine
learning with a uent API
spark.read.text("./data/Ch02/1342-0.txt")
results = (
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Read a text file
15/49
37. Data manipulation and machine
learning with a uent API
.select(F.split(F.col("value"), " ").alias("line"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Select the column value, where each element is
splitted (space as a separator). Alias to line.
15/49
38. Data manipulation and machine
learning with a uent API
.select(F.explode(F.col("line")).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Explode each element of line into its own record.
Alias to word.
15/49
39. Data manipulation and machine
learning with a uent API
.select(F.lower(F.col("word")).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Lower-case each word
15/49
40. Data manipulation and machine
learning with a uent API
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Extract only the first group of lower-case letters from
each word.
15/49
41. Data manipulation and machine
learning with a uent API
.where(F.col("word") != "")
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.groupby(F.col("word"))
.count()
)
Keep only the records where the word is not the
empty string.
15/49
42. Data manipulation and machine
learning with a uent API
.groupby(F.col("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.count()
)
Group by word
15/49
43. Data manipulation and machine
learning with a uent API
.count()
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
)
Count the number of records in each group
15/49
44.
Scala is not the only player in
town
16/49
48. Summoning PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
A SparkSession is your entry point to distributed data manipulation
19/49
49. Summoning PySpark
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
from pyspark.sql import SparkSession
We create our SparkSession with an optional library to access BigQuery as a data source.
19/49
50. Reading data
from functools import reduce
from pyspark.sql import DataFrame
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
20/49
51. Reading data
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
from functools import reduce
from pyspark.sql import DataFrame
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
We create a helper function to read our code from BigQuery.
20/49
52. Reading data
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
)
)
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
A DataFrame is a regular Python object.
20/49
57. Any data frame transformation will be stored until we need the
data.
Then, when we trigger an action, (Py)Spark will go and optimize
the query plan, select the best physical plan and apply the
transformation on the data.
24/49
67. Something a little more complex
import pyspark.sql.functions as F
stations = (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.stations")
.option("credentialsFile", "bq-key.json")
.load()
)
# We want to get the "hottest Countries" that have at least 60 measures
answer = (
gsod.join(stations, gsod["stn"] == stations["usaf"])
.where(F.col("country").isNotNull())
.groupBy("country")
.agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count"))
).where(F.col("count") > 12 * 5)
read, join, where, groupby, avg/count, where, orderby, show
26/49
72. Python or SQL?
spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
gsod.createTempView("gsod")
stations.createTempView("stations")
We then can query using SQL without leaving Python!
29/49
73. Python and SQL!
(
spark.sql(
"""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
group by country"""
)
.where("country is not null")
.where("count > (12 * 5)")
.orderby("avg_temp", ascending=False)
.show(5)
)
30/49
83. Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)
# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows
A UDF can be used like any PySpark function.
35/49
84. Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49
85. Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
A regular, fun, harmless function on (pandas) DataFrames
36/49
86. Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49
92. You are not limited library-wise
from sklearn.linear_model import LinearRegression
@F.pandas_udf(T.DoubleType())
def rate_of_change_temperature(
day: pd.Series,
temp: pd.Series
) -> float:
"""Returns the slope of the daily temperature
for a given period of time."""
return (
LinearRegression()
.fit(X=day.astype("int").values.reshape(-1, 1), y=temp)
.coef_[0]
)
39/49
105. "Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
47/49
106. "Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
Uneven documentation
47/49