Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Anyscale
Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference.
In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning.
We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Anyscale
Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference.
In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning.
We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Databricks
When you run an Apache Spark application on a large cluster, you want to make sure you’re getting the most from that cluster. Any CPU or memory left on the table represents either a waste of money or a lost opportunity to speed up your Spark jobs. What many people don’t realize is how sensitive Spark cluster utilization is to the resource manager. Resource managers decide how to allocate cluster resources among the many users and applications contending for them. In this deep dive session, we will discuss how Spark integrates with two common open source resource managers, YARN and Mesos, as well as a new commercial product called IBM Spectrum Conductor with Spark. You will learn how resource managers arbitrate resources in multi-user/multi-tenant Spark clusters, and how this affects application performance. You will come away with new techniques for tuning Spark resource management to optimize goals like speed and fairness. The session will include a demo of a new open source benchmark designed to help analyse Spark multi-user/multi-tenant performance. The benchmark uses Spark SQL and machine learning jobs to load the cluster, and can be used during a pre-production cycle to tune Spark and resource manager configurations.
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
The RAPIDS Accelerator for Apache Spark is a plugin that enables the power of GPUs to be leveraged in Spark DataFrame and SQL queries, improving the performance of ETL pipelines. User-defined functions (UDFs) in the query appear as opaque transforms and can prevent the RAPIDS Accelerator from processing some query operations on the GPU.
This presentation discusses how users can leverage the RAPIDS Accelerator UDF Compiler to automatically translate some simple UDFs to equivalent Catalyst operations that are processed on the GPU. The presentation also covers how users can provide a GPU version of Scala, Java, or Hive UDFs for maximum control and performance. Sample UDFs for each case will be shown along with how the query plans are impacted when the UDFs are processed on the GPU.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy serving loads. However, most machine learning frameworks and systems only address model training and not deployment.
Clipper is a general-purpose model-serving system that addresses these challenges. Interposing between applications that consume predictions and the machine-learning models that produce predictions, Clipper simplifies the model deployment process by isolating models in their own containers and communicating with them over a lightweight RPC system. This architecture allows models to be deployed for serving in the same runtime environment as that used during training. Further, it provides simple mechanisms for scaling out models to meet increased throughput demands and performing fine-grained physical resource allocation for each model.
In this talk, I will provide an overview of the Clipper serving system and then discuss how to get started using Clipper to serve Spark and TensorFlow models in a production serving environment.
Advanced Natural Language Processing with Apache Spark NLPDatabricks
NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is an Apache Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk, we discuss the philosophy behind Deep Learning Pipelines, as well as the main tools it provides, how they fit into the deep learning ecosystem, and how they demonstrate Spark's role in deep learning.
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
As the adoption of Spark Streaming increases rapidly, the community has been asking for greater robustness and scalability from Spark Streaming applications in a wider range of operating environments. To fulfill these demands, we have steadily added a number of features in Spark Streaming. We have added backpressure mechanisms which allows Spark Streaming to dynamically adapt to changes in incoming data rates, and maintain stability of the application. In addition, we are extending Spark’s Dynamic Allocation to Spark Streaming, so that streaming applications can elastically scale based on processing requirements. In my talk, I am going to explore these mechanisms and explain how developers can write robust, scalable and adaptive streaming applications using them. Presented by Tathagata "TD" Das from Databricks.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. SparkR’s evolving interface to Apache Spark offers a wide range of APIs and capabilities to Data Scientists and Statisticians. With the release of Spark 2.0, and subsequent releases, the R API officially supports executing user code on distributed data. This is done primarily through a family of apply() functions.
In this Data Science Central webinar, we will explore the following:
●Provide an overview of this new functionality in SparkR.
●Show how to use this API with some changes to regular code with dapply().
●Focus on how to correctly use this API to parallelize existing R packages.
●Consider performance and examine correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki, Software Engineer -- Databricks Inc.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Databricks
When you run an Apache Spark application on a large cluster, you want to make sure you’re getting the most from that cluster. Any CPU or memory left on the table represents either a waste of money or a lost opportunity to speed up your Spark jobs. What many people don’t realize is how sensitive Spark cluster utilization is to the resource manager. Resource managers decide how to allocate cluster resources among the many users and applications contending for them. In this deep dive session, we will discuss how Spark integrates with two common open source resource managers, YARN and Mesos, as well as a new commercial product called IBM Spectrum Conductor with Spark. You will learn how resource managers arbitrate resources in multi-user/multi-tenant Spark clusters, and how this affects application performance. You will come away with new techniques for tuning Spark resource management to optimize goals like speed and fairness. The session will include a demo of a new open source benchmark designed to help analyse Spark multi-user/multi-tenant performance. The benchmark uses Spark SQL and machine learning jobs to load the cluster, and can be used during a pre-production cycle to tune Spark and resource manager configurations.
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
The RAPIDS Accelerator for Apache Spark is a plugin that enables the power of GPUs to be leveraged in Spark DataFrame and SQL queries, improving the performance of ETL pipelines. User-defined functions (UDFs) in the query appear as opaque transforms and can prevent the RAPIDS Accelerator from processing some query operations on the GPU.
This presentation discusses how users can leverage the RAPIDS Accelerator UDF Compiler to automatically translate some simple UDFs to equivalent Catalyst operations that are processed on the GPU. The presentation also covers how users can provide a GPU version of Scala, Java, or Hive UDFs for maximum control and performance. Sample UDFs for each case will be shown along with how the query plans are impacted when the UDFs are processed on the GPU.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy serving loads. However, most machine learning frameworks and systems only address model training and not deployment.
Clipper is a general-purpose model-serving system that addresses these challenges. Interposing between applications that consume predictions and the machine-learning models that produce predictions, Clipper simplifies the model deployment process by isolating models in their own containers and communicating with them over a lightweight RPC system. This architecture allows models to be deployed for serving in the same runtime environment as that used during training. Further, it provides simple mechanisms for scaling out models to meet increased throughput demands and performing fine-grained physical resource allocation for each model.
In this talk, I will provide an overview of the Clipper serving system and then discuss how to get started using Clipper to serve Spark and TensorFlow models in a production serving environment.
Advanced Natural Language Processing with Apache Spark NLPDatabricks
NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is an Apache Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk, we discuss the philosophy behind Deep Learning Pipelines, as well as the main tools it provides, how they fit into the deep learning ecosystem, and how they demonstrate Spark's role in deep learning.
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
As the adoption of Spark Streaming increases rapidly, the community has been asking for greater robustness and scalability from Spark Streaming applications in a wider range of operating environments. To fulfill these demands, we have steadily added a number of features in Spark Streaming. We have added backpressure mechanisms which allows Spark Streaming to dynamically adapt to changes in incoming data rates, and maintain stability of the application. In addition, we are extending Spark’s Dynamic Allocation to Spark Streaming, so that streaming applications can elastically scale based on processing requirements. In my talk, I am going to explore these mechanisms and explain how developers can write robust, scalable and adaptive streaming applications using them. Presented by Tathagata "TD" Das from Databricks.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. SparkR’s evolving interface to Apache Spark offers a wide range of APIs and capabilities to Data Scientists and Statisticians. With the release of Spark 2.0, and subsequent releases, the R API officially supports executing user code on distributed data. This is done primarily through a family of apply() functions.
In this Data Science Central webinar, we will explore the following:
●Provide an overview of this new functionality in SparkR.
●Show how to use this API with some changes to regular code with dapply().
●Focus on how to correctly use this API to parallelize existing R packages.
●Consider performance and examine correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki, Software Engineer -- Databricks Inc.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
Real-Time Analytics with Apache Cassandra and Apache Spark, par Guido Schmutz, Technology Manager Trivadis. Conférence donnée dans le cadre du Swiss Data Forum à Lausanne le 24 novembre
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
Time series data is everywhere: IoT, sensor data, financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. However data is pointless without being able to process it in near real time. That's where Spark combined with Cassandra comes in! What was one just your storage system (Cassandra) can be transformed into an analytics system and it's really surprising how easy it is!
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
http://www.agildata.com/agildata-hosts-big-data-meetup-featuring-apache-spark/
Slides for talks given at the Denver Java Users Group, Boulder Java Users Group, Denver/Boulder Big Data Users Group.
Dan and Andy will spend an evening rolling up our sleeves with you to try out some real-world use cases for Apache Spark.
We’ll cover Spark’s RDD API, the DataFrame API, as well as the brand new Dataset API.
Similar to Jump Start on Apache Spark 2.2 with Databricks (20)
Modern machine learning (ML) workloads, such as deep learning and large-scale model training, are compute-intensive and require distributed execution. Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales Python applications and ML workloads from a laptop to a cluster, with an emphasis on the unique performance challenges of ML/AI systems. It is now used in many production deployments.
This talk will cover Ray’s overview, architecture, core concepts, and primitives, such as remote Tasks and Actors; briefly discuss Ray native libraries (Ray Tune, Ray Train, Ray Serve, Ray Datasets, RLlib); and Ray’s growing ecosystem.
Through a demo using XGBoost for classification, we will demonstrate how you can scale training, hyperparameter tuning, and inference—from a single node to a cluster, with tangible performance difference when using Ray.
The takeaways from this talk are :
Learn Ray architecture, core concepts, and Ray primitives and patterns
Why Distributed computing will be the norm not an exception
How to scale your ML workloads with Ray libraries:
Training on a single node vs. Ray cluster, using XGBoost with/without Ray
Hyperparameter search and tuning, using XGBoost with Ray Tune
Inferencing at scale, using XGBoost with/without Ray
MLflow is aiming to stabilize its API in version 1.0 this spring and add a number of other new features. In this talk, we'll share some of the features we have in mind for the rest of the year. These include ongoing work such as a database store for the tracking server and Docker project packaging, as well as new improvements in multi-step workflows, model management, and production monitoring.
Apache Spark simplifies AI, but why not use AI to simplify Spark performance and operations management? An AI-driven approach can drastically reduce the time Spark application developers and operations teams spend troubleshooting problems.
This talk will discuss algorithms that run real-time streaming pipelines as well as build ML models in batch to enable Spark users to automatically solve problems like: (i) fixing a failed Spark application, (ii) auto tuning SLA-bound Spark streaming pipelines, (iii) identifying the best broadcast joins and caching for SparkSQL queries and tables, (iv) picking cost-effective machine types and container sizes to run Spark workloads on the AWS, Azure, and Google cloud; and more.
At Databricks, we manage Spark clusters for customers to run various production workloads. In this talk, we share our experiences in building a real-time monitoring system for thousands of Spark nodes, including the lessons we learned and the value we’ve seen from our efforts so far.
The was part of the talk presented at #monitorSF Meetup held at Databricks HQ in SF.
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
At Databricks, we manage Apache Spark clusters for customers to run various production workloads. In this talk, we share our experiences in building a real-time monitoring system for thousands of Spark nodes, including the lessons we learned and the value we’ve seen from our efforts so far.
The was part of the talk presented at #monitorSF Meetup held at Databricks HQ in SF.
Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from Kafka/Kinesis.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
A Deep Dive into Structured Streaming in Apache Spark Anyscale
This presentation was given at Apache Spark Meetup in Milano by Databricks software engineer and Apache Spark contributor Burak Yavuz. It covers how to write end-to-end, fault-tolerant continuous application using Structured Streaming APIs available in Apache Spark 2.x
Continuous Application with Structured Streaming 2.0Anyscale
Introduction to Continuous Application with Apache Spark 2.0 Structured Streaming. This presentation is a culmination and curation from talks and meetups presented by Databricks engineers.
The notebooks on Structured Streaming demonstrates aspects of the Structured Streaming APIs
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Jump Start on Apache Spark 2.2 with Databricks
1. Jump Start on Apache®
Spark™ 2.2 with Databricks
Jules S. Damji
Apache Spark Community Evangelist
Spark Saturday DC , Nov 11, 2017
Big Data Madison @ Madison College
5. Morning Afternoon
Agenda for the day
• Introduction to DataFrames, Datasets
and Spark SQL
• Workshop Notebook 2
• Break
• Introduction to StructuredStreaming
Concepts
• Workshop Notebook 3
• Go Home…Watch the Game!
• Get to know Databricks
• Overview of Spark
Fundamentals & Architecture
• What’s New in Spark 2.x
• Break
• Unified APIs:SparkSessions,
SQL, DataFrames, Datasets…
• Workshop Notebook [0,1]
• Lunch
6. Get to know Databricks
1. Get http://databricks.com/try-databricks
2. https://github.com/dmatrix/spark-saturday
3. [OR] ImportNotebook: http://dbricks.co/ss_wkshp0
8. Big Data Systems of Yesterday…
MapReduce/Hadoop
Generalbatch
processing
Drill
Storm
Pregel Giraph
Dremel Mahout
Storm Impala
Drill . . .
Specialized systems
for newworkloads
Hard to combine in pipelines
13. TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeleyin 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
14. Accelerate innovation by
unifying data science,
engineering and business.
Unified Analytics
Platform
UNIFIED
INFRASTRUCTURE
UNIFIED
EXPERIENCE
ACROSS TEAMS
UNIFIED
ANALYTIC
WORKFLOWS
39. A Brief History
39
2012
Started
@
UC Berkeley
2010 2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 & libraries
(SQL, ML,GraphX)
2015
DataFrames/Datasets
Tungsten
Catalyst Optimizer
ML Pipelines
2016-17
Apache Spark 2.0,2.1,2.2
Structured Streaming
Cost Based Optimizer
Deep Learning Pipelines Easier
Smarter
Faster
40. Apache Spark 2.X
• Steps to Bigger & Better Things….
Builds on all we learned in past 2 years
41. Major Themes in Apache Spark 2.x
TungstenPhase 2
speedupsof 5-10x
& Catalyst Optimizer
Faster
StructuredStreaming
real-time engine
on SQL / DataFrames
Smarter
Unifying Datasets
and DataFrames &
SparkSessions
Easier
42. Unified API Foundation for
the Future: SparkSessions,
DataFrame, Dataset, MLlib,
Structured Streaming
43. SparkSession – A Unified entry point to
Spark
• Conduit to Spark
– Creates Datasets/DataFrames
– Reads/writes data
– Works with metadata
– Sets/gets Spark
Configuration
– Driver uses for Cluster
resource management
47. Long Term
• RDD as the low-level API in Spark
• For control and certain type-safety in Java/Scala
• Datasets & DataFrames give richer semantics &
optimizations
• For semi-structured data and DSL like operations
• New libraries will increasingly use these as interchange
format
• Examples: Structured Streaming, MLlib, GraphFrames,
and Deep Learning Pipelines
50. Towards SQL 2003
• Today, Spark can run all 99 TPC-DS queries!
- New standard compliant parser (with good error messages!)
- Subqueries (correlated & uncorrelated)
- Approximate aggregate stats
- https://databricks.com/blog/2016/06/17/sql-subqueries-in-apache-spark-
2-0.html
52. Other notable API improvements
• DataFrame-based ML pipeline API becoming the main MLlib API
• ML model & pipeline persistence with almost complete coverage
• In all programminglanguages: Scala, Java, Python, R
• Improved R support
• (Parallelizable) User-defined functions in R
• Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K-
Means
• Structured Streaming Features & Production Readiness
• https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html
53. Workshop: Notebook on RDD & SparkSession
• ImportNotebook into your Spark 2.2 Cluster
– http://dbricks.co/ss_wkshp0
– http://dbricks.co/ss_wkshp1
– http://docs.databricks.com
– http://spark.apache.org/docs/latest/api/scala/index.html#or
g.apache.spark.sql.SparkSession
• Familiarizeyour self with Databricks Notebook environment
• Work through each cell
• CNTR + <return>/ Shift+ Return
• Try challenges
• Break…
55. The not so secret truth…
SQL
is not about SQL
is about more thanSQL
56. 10
Is About Creating and Running Spark Programs
Faster:
• Write less code
• Read less data
• Do less work
• optimizerdoes the hard work
Spark SQL: The wholestory
58. 58
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzinga logicalplan to resolve references
Logical Optimization: logicalplan optimization
Physical Planning: Physical planning
Code Generation:Compileparts of the query to Java bytecode
SQL AST
DataFrame
Datasets
59. LOGICAL OPTIMIZATIONS PHYSICAL OPTIMIZATIONS
Catalyst Optimizations
• Catalyst compiles operations into
physical plan for execution and
generates JVM byte code
• Intelligently choose between
broadcast joins and shuffle joins to
reduce network traffic
• Lower level optimizations:
eliminate expensive object
allocations and reduce virtual
functions calls
• Push filter predicate down to data
source, so irrelevant data can be
skipped
• Parquet: skip entire blocks, turn
comparisons into cheaper integer
comparisons via dictionary coding
• RDMS: reduce amount of data traffic
by pushing down predicates
65. Background: What is in an RDD?
•Dependencies
• Partitions (with optional localityinfo)
• Compute function: Partition =>Iterator[T]
Opaque Computation
& Opaque Data
66. Structured APIs In Spark
66
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Analysis errors are reported before a distributed job starts
68. DataFrame API code.
// convert RDD -> DF with column names
val df = parsedRDD.toDF("project", "page", "numRequests")
//filter, groupBy, sum, and then agg()
df.filter($"project" === "en").
groupBy($"page").
agg(sum($"numRequests").as("count")).
limit(100).
show(100)
project page numRequests
en 23 45
en 24 200
69. Take DataFrame à SQL Table à Query
df. createOrReplaceTempView(("edits")
val results = spark.sql("""SELECT page, sum(numRequests)
AS count FROM edits WHERE project = 'en' GROUP BY page
LIMIT 100""")
results.show(100)
project page numRequests
en 23 45
en 24 200
70. Easy to write code... Believe it!
from pyspark.sql.functions import avg
dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])
dataDF = dataRDD.toDF(["name", "age"])
# Using RDD code to compute aggregate average
(dataRDD.map(lambda (x,y): (x, (y,1))) .reduceByKey(lambda x,y: (x[0] +y[0], x[1]
+y[1])) .map(lambda (x, (y, z)): (x, y / z)))
# Using DataFrame
dataDF.groupBy("name").agg(avg("age"))
name age
Jim 20
Ann 31
Jim 30
71. Why structure APIs?
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)}
.map { case (dept, (age, c)) => dept -> age / c }
select dept, avg(age) from data group by 1
SQL
DataFrame
RDD
data.groupBy("dept").avg("age")
72. Type-safe:operate
on domain objects
with compiled
lambda functions
8
Dataset API in Spark 2.x
v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ")
/ / Convert data to domain o b j e c ts .
case c l a s s Person(name: S tr i n g , age: I n t )
v a l d s : Dataset[Person] = d f.a s [P e r s on ]
v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)
87. Complexities in stream processing
COMPLEX DATA
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
COMPLEX SYSTEMS
Diverse storage systems
(Kafka, S3, Kinesis, RDBMS, …)
System failures
COMPLEX WORKLOADS
Combining streaming with
interactive queries
Machine learning
88. Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
92. DataFrames,
Datasets, SQL
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Writeto
Parquet
Apache Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new batches of data
Series of Incremental
Execution Plans
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Parquet
Sink
Optimized
Physical Plan
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
94. Anatomy of a Streaming Query: Step 1
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.
Source
• Specify one or more locations
to read data from
• Built in support for
Files/Kafka/Socket,
pluggable.
95. Anatomy of a Streaming Query: Step 2
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
Transformation
• Using DataFrames,Datasets and/or
SQL.
• Internal processingalways exactly-
once.
96. Anatomy of a Streaming Query: Step 3
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Sink
• Accepts the output of each
batch.
• When supported sinks are
transactional and exactly
once (Files).
• Use foreach to execute
arbitrary code.
97. Anatomy of a Streaming Query: Output Modes
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Output mode – What's output
• Complete – Output the whole answer
every time
• Update – Output changed rows
• Append– Output new rowsonly
Trigger – When to output
• Specifiedas a time, eventually
supportsdata size
• No trigger means as fast as possible
98. Anatomy of a Streaming Query: Checkpoint
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Checkpoint
• Tracks the progress of a
query in persistent storage
• Can be used to restart the
query if there is a failure.
99. Fault-tolerance with Checkpointing
Checkpointing – tracks progress
(offsets) of consuming data from
the source and intermediate state.
Offsets and metadata saved as JSON
Can resume after changing your
streaming transformations
end-to-end
exactly-once
guarantees
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
write
ahead
log
101. Traditional ETL
Raw, dirty, un/semi-structured is data dumped as files
Periodic jobs run every few hours to convert raw data
to structured data ready for further analytics
101
file
dump
seconds hours
table
10101010
102. Traditional ETL
Hours of delay before taking decisions on latest data
Unacceptable when time is of essence
[intrusion detection, anomaly detection, etc.]
file
dump
seconds hours
table
10101010
103. Streaming ETL w/ Structured Streaming
Structured Streaming enables raw data to be available
as structured data as soon as possible
103
seconds
table
10101010
104. Streaming ETL w/ Structured Streaming
Example
Json data being received in Kafka
Parse nested json and flatten it
Store in structured Parquet table
Get end-to-end failure guarantees
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
val parsedData = rawData
.selectExpr("cast (value as string) as json"))
.select(from_json("json", schema).as("data"))
.select("data.*")
val query = parsedData.writeStream
.option("checkpointLocation", "/checkpoint")
.partitionBy("date")
.format("parquet")
.start("/parquetTable")
105. Reading from Kafka
Specify options to configure
How?
kafka.boostrap.servers => broker1,broker2
What?
subscribe => topic1,topic2,topic3 // fixed list of topics
subscribePattern => topic* // dynamic list of topics
assign => {"topicA":[0,1] } // specific partitions
Where?
startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} }
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
106. Reading from Kafka
val rawDataDF = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
rawData dataframe has
the following columns
key value topic partition offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
107. Transforming Data
Cast binary value to string
Name it column json
val parsedDataDF = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
108. Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
json
{ "timestamp": 1486087873, "device": "devA", …}
{ "timestamp": 1486082418, "device": "devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"
109. Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
val parsedDataDF = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
timestamp device …
1486087873 devA …
1486086721 devX …
select("data.*")
(not nested)
110. Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
powerful built-in APIs to
performcomplex data
transformations
from_json, to_json, explode,...
100s offunctions
(see our blogpost & tutorial)
111. Writing to
Save parsed data as Parquet
table in the given path
Partition files by date so that
future queries on time slices of
data is fast
e.g. query on last 48 hours of data
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.partitionBy("date")
.format("parquet")
.start("/parquetTable") //pathname
112. Checkpointing
Enable checkpointing by
setting the checkpoint
location to save offset logs
start actually starts a
continuous running
StreamingQuery in the
Spark cluster
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.format("parquet")
.partitionBy("date")
.start("/parquetTable/")
113. Streaming Query
query is a handle to the continuously
running StreamingQuery
Used to monitor and manage the
execution
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.format("parquet")
.partitionBy("date")
.start("/parquetTable")/")
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
StreamingQuery
114. More Kafka Support [Spark 2.2]
Write out to Kafka
DataFrame must have binary fields
named key and value
Direct, interactive and batch
queries on Kafka
Makes Kafka even more powerful
as a storage platform!
result.writeStream
.format("kafka")
.option("topic", "output")
.start()
val df = spark
.read // not readStream
.format("kafka")
.option("subscribe", "topic")
.load()
df.createOrReplaceTempView("topicData")
spark.sql("select value from topicData")
117. Event Time
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system in the 1 hour windows?
Many challenges
Extractingevent time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
118. Event time Aggregations
Windowing is just another type of grouping in Struct.
Streaming
number of records every hour
Support UDAFs!
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
119. Stateful Processing for Aggregations
Aggregates have to be saved as
distributed state between triggers
Each trigger reads previous state and
writes updated state
State stored in memory,
backed by write ahead log in HDFS/S3
Fault-tolerant, exactly-once guarantee!
process
newdata
t = 1
sink
src
t = 2
process
newdata
sink
src
t = 3
process
newdata
sink
src
state state
write
ahead
log
state updates
are written to
log for checkpointing
state
120. Automatically handles Late Data
12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00Keeping state allows
late data to update
counts of old windows
red = state updated
with late data
But size of the state increasesindefinitely
if old windows are notdropped
121. Watermarking
max eventtime
event time
watermark
allowed
lateness
of 10 mins
parsedDataDF
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowedto
aggregate
data too
late,
dropped
Useful only in stateful operations
(streaming aggs, dropDuplicates,mapGroupsWithState,...)
Ignored in non-stateful streaming
queries and batch queries
123. Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful function to a
user-defined state
Direct support for per-key
timeouts in event-time or
processing-time
Supports Scala and Java
123
ds.groupByKey(_.id)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,
values: Iterator[V],
state: GroupState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}
125. Other interestingoperations
Streaming Deduplication
Watermarks to limit state
Stream-batch Joins
Stream-stream Joins
Can use mapGroupsWithState
Direct support coming soon
Apache Spark 2.3!
val batchDataDF = spark.read
.format("parquet")
.load("/additional-data")
//join with stream DataFrame
parsedDataDF.join(batchData, "device")
parsedDataDF.dropDuplicates("eventId")
127. More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Anthology of Technical Assets on Structured Streaming
http://dbricks.co/ss_anthology
128. Resources
• Getting Started Guide with Apache Spark on Databricks
• docs.databricks.com
• Spark Programming Guide
• Structured Streaming Programming Guide
• Databricks Engineering Blogs
• sparkhub.databricks.com
• spark-packages.org