At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
A complex real-time data workflow implementation is very challenging. This session will describe the architecture of a data platform that provides a single, secure, high-performance system that can be deployed in a hybrid cloud architectures. We will present how to support simultaneous, consistent and high-performance access through multiple industry open source and cloud compatible standards of streaming, table, TSDB, object, and file APIs. A new serverless technology is also used in the architecture to support a dynamic and flexible implementations. The presenter will also outline how the platform was integrated with the Spark eco-system, including AI and ML tools, to simplify the development process
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
Around the world, businesses are turning to AI to transform the way they operate and serve their customers. But before they can implement these technologies, companies must address the roadblock of moving from batch analytics to making real-time decisions by rapidly accessing and analyzing the relevant information amidst a sea of data. Yaron will explain how to make Spark handle multivariate real-time, historical and event data simultaneously to provide immediate and intelligent responses. He will present several time sensitive use-cases including fraud detection, prevention of outages and customer recommendations to demonstrate how to perform predictive analytics and real-time actions with Spark.
Speaker: Yaron Ekshtein
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks
"In the oil and gas industry, utilizing vast amounts of data has long been identified as an important indicator of operational performance. The measurement of key performance indicators is a routine practice in well construction, but a systematic way of statistically analyzing performance against a large data bank of offset wells is not a common practice. The performance of statistical analysis in real-time is even less common. With the adoption of distributed computing platforms, like Apache Spark, new analysis opportunities become available to leverage large-scale time-series data sets to optimize performance. Two case studies are presented in this talk: the rate of penetration (ROP) and the amount of vibration per run.
By collecting real-time, telemetry data and comparing it with historic sample datasets within the Databricks Unified Analytics Platform, the optimization team was able to quickly determine whether the performance being delivered matched or exceeded past performance with statistical certainty. This is extremely important while trying new techniques with data that is highly variable. By substituting anecdotal evidence with statistical analysis, decision making is more precise and better informed. In this talk we'll share how we accomplished this and the lessons learned along the way."
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Configuration Driven Reporting On Large Dataset Using Apache SparkDatabricks
In financial world, petabytes of transactional data need to be stored, processed, distributed across global customers and partners in a secured, compliant and accurate way with high availability, resiliency and observability. In American Express, we need to generate hundreds of different kinds of reports and distribute to thousands of partners in different schedules based on billions of daily transactions. Our next generation reporting framework is a highly configurable enterprise framework that caters to different reporting needs with zero development. This reusable framework entails dynamic scheduling of partner-specific reports, transforming, aggregating and filtering the data into different dataframes using inbuilt as well as user-defined spark functions leveraging spark’s in memory and parallel processing capabilities. This also encompasses applying business rules and converting it into different formats by embedding template engines like FreeMarker and Mustache into the framework.
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
A complex real-time data workflow implementation is very challenging. This session will describe the architecture of a data platform that provides a single, secure, high-performance system that can be deployed in a hybrid cloud architectures. We will present how to support simultaneous, consistent and high-performance access through multiple industry open source and cloud compatible standards of streaming, table, TSDB, object, and file APIs. A new serverless technology is also used in the architecture to support a dynamic and flexible implementations. The presenter will also outline how the platform was integrated with the Spark eco-system, including AI and ML tools, to simplify the development process
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
Around the world, businesses are turning to AI to transform the way they operate and serve their customers. But before they can implement these technologies, companies must address the roadblock of moving from batch analytics to making real-time decisions by rapidly accessing and analyzing the relevant information amidst a sea of data. Yaron will explain how to make Spark handle multivariate real-time, historical and event data simultaneously to provide immediate and intelligent responses. He will present several time sensitive use-cases including fraud detection, prevention of outages and customer recommendations to demonstrate how to perform predictive analytics and real-time actions with Spark.
Speaker: Yaron Ekshtein
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks
"In the oil and gas industry, utilizing vast amounts of data has long been identified as an important indicator of operational performance. The measurement of key performance indicators is a routine practice in well construction, but a systematic way of statistically analyzing performance against a large data bank of offset wells is not a common practice. The performance of statistical analysis in real-time is even less common. With the adoption of distributed computing platforms, like Apache Spark, new analysis opportunities become available to leverage large-scale time-series data sets to optimize performance. Two case studies are presented in this talk: the rate of penetration (ROP) and the amount of vibration per run.
By collecting real-time, telemetry data and comparing it with historic sample datasets within the Databricks Unified Analytics Platform, the optimization team was able to quickly determine whether the performance being delivered matched or exceeded past performance with statistical certainty. This is extremely important while trying new techniques with data that is highly variable. By substituting anecdotal evidence with statistical analysis, decision making is more precise and better informed. In this talk we'll share how we accomplished this and the lessons learned along the way."
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Configuration Driven Reporting On Large Dataset Using Apache SparkDatabricks
In financial world, petabytes of transactional data need to be stored, processed, distributed across global customers and partners in a secured, compliant and accurate way with high availability, resiliency and observability. In American Express, we need to generate hundreds of different kinds of reports and distribute to thousands of partners in different schedules based on billions of daily transactions. Our next generation reporting framework is a highly configurable enterprise framework that caters to different reporting needs with zero development. This reusable framework entails dynamic scheduling of partner-specific reports, transforming, aggregating and filtering the data into different dataframes using inbuilt as well as user-defined spark functions leveraging spark’s in memory and parallel processing capabilities. This also encompasses applying business rules and converting it into different formats by embedding template engines like FreeMarker and Mustache into the framework.
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>
Self-Service Apache Spark Structured Streaming Applications and AnalyticsDatabricks
Organizations are increasingly building more and more Apache Spark Structured Streaming Applications for IoT analytics, real-time fraud detection, anomaly detection, analyzing streaming data from devices, turbines etc. However building the streaming applications and operationalizing them is challenging. There is a need for a self-serve platform on Spark Structured Streaming to enable many users to quickly build, deploy, run and monitor a variety of big data streaming use cases. At Sparkflows we built out a Self-Service Platform for building Structured Streaming Applications in minutes. Variety of users can log in with their Browser and build and deploy these applications seamlessly with drag and drop of 200+ Processors. They can also build charts on the streaming data and perform streaming analytics. In this talk we will dive deeper into our journey. We started with a workflow editor and workflow engine for building and running structured streaming jobs. We will explain how we built out the connectors to streaming sources for running in the designer mode, perform ML model scoring with real-time ingestion, streaming analytics, schema inference and propagation and displaying results in continuously moving charts. We will go over how we built self-serve streaming data preparation, lookup and analytics with SQL, Scala, Python etc. Finally, we will also discuss how we enabled deployment, operationalization and monitoring of the long running Structured Streaming jobs. We want to show how Spark can be used to enable very complex Self-Serve Big Data Streaming Applications and Analytics for Enterprises.
Speaker: Jayant Shekhar
Realtime streaming architecture in INFINARIOJozo Kovac
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
Power Your Delta Lake with Streaming Transactional ChangesDatabricks
Organizations are adopting data digitization and data-driven decision making is at the heart of this transformation. Cloud Data Lakes and Datawarehouses provide great flexibility to proto-type and roll out applications continuously at much lower costs.
Transactional databases are optimized for processing huge volumes of transactions in real-time, whereas the cloud data lake needs to be optimized for analyzing huge volumes of data quickly. This brings about a challenge in creating a streamlined data flow process from capturing realtime transactions into a cloud datawarehouse to drive realtime insights in a scalable and cost effective manner.
In this session, we’ll show how organizations can easily overcome that challenge by adopting a robust platform with StreamSets and Delta Lake. StreamSets provides a no-code framework to automate ingestion of transactional data and data processing on Spark, while Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
Build Real-Time Applications with Databricks StreamingDatabricks
In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure.
This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.)
• The current locations and status of firefighters, EMT personnel and other relevant fire department employees
• The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments.
In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Life occurs in real-time, and not surprisingly, more solutions are being built using streaming technologies. Event-based architectures are becoming the norm, and customers are expecting immediate access to their data. This new world offers many exciting opportunities, but also some new challenges. What do you do when your streaming data is not complete? What if it relies on another data source? Does the dependent data exist yet, and does it come from a 3rd party? How do we merge a complete picture of data when data is sourcing from multiple places at the same time? A new norm in the world of distributed services. Join us as we dive deep into the technical details around these scenarios and more. Expect to learn about stream-stream joins, enriching stream data using local or remote data, and ways to anticipate and correct errors within the stream. Leave with a better understanding of managing data dependencies within a Spark Structured Streaming application.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Building Data Pipelines with Spark and StreamSetsPat Patterson
Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an Apache 2.0 licensed open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we’ll look at how SDC’s “intent-driven” approach keeps the data flowing, with a particular focus on clustered deployment with Spark and other exciting Spark integrations in the works.
Streaming Analytics for Financial EnterprisesDatabricks
Streaming Analytics (or Fast Data processing) is becoming an increasingly popular subject in the financial sector. There are two main reasons for this development. First, more and more data has to be analyze in real-time to prevent fraud; all transactions that are being processed by banks have to pass and ever-growing number of tests to make sure that the money is coming from and going to legitimate sources. Second, customers want to have friction-less mobile experiences while managing their money, such as immediate notifications and personal advise based on their online behavior and other users’ actions.
A typical streaming analytics solution follows a ‘pipes and filters’ pattern that consists of three main steps: detecting patterns on raw event data (Complex Event Processing), evaluating the outcomes with the aid of business rules and machine learning algorithms, and deciding on the next action. At the core of this architecture is the execution of predictive models that operate on enormous amounts of never-ending data streams.
In this talk, I’ll present an architecture for streaming analytics solutions that covers many use cases that follow this pattern: actionable insights, fraud detection, log parsing, traffic analysis, factory data, the IoT, and others. I’ll go through a few architecture challenges that will arise when dealing with streaming data, such as latency issues, event time vs server time, and exactly-once processing. The solution is build on the KISSS stack: Kafka, Ignite, and Spark Structured Streaming. The solution is open source and available on GitHub.
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. Workday is a “pure SaaS” company, providing a suite of Financial and HCM (Human Capital Management) apps to about 2000 companies around the world, including more than 30% from Fortune-500 list. There are significant business and technical challenges to support millions of concurrent users and hundreds of millions daily transactions. Using memory-centric graph-based architecture allowed to overcome most of these problems.
As Workday grew, data transactions from existing and new customers generated vast amounts of valuable and highly sensitive data. The next big challenge was to provide in-app analytics platform, which for the multiple types of accumulated data, and also would allow using blend in external datasets. Workday users wanted it to be super-fast, but also intuitive and easy-to-use both for the financial and HR analysts and for regular, less technical users. Existing backend technologies were not a good fit, so we turned to Apache Spark.
In this presentation, we will share the lessons we learned when building highly scalable multi-tenant analytics service for transactional data. We will start with the big picture and business requirements. Then describe the architecture with batch and interactive modules for data preparation, publishing, and query engine, noting the relevant Spark technologies. Then we will dive into the internals of Prism’s Query Engine, focusing on Spark SQL, DataFrames and Catalyst compiler features used. We will describe the issues we encountered while compiling and executing complex pipelines and queries, and how we use caching, sampling, and query compilation techniques to support interactive user experience.
Finally, we will share the future challenges for 2018 and beyond.
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience.
In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses.
We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>
Self-Service Apache Spark Structured Streaming Applications and AnalyticsDatabricks
Organizations are increasingly building more and more Apache Spark Structured Streaming Applications for IoT analytics, real-time fraud detection, anomaly detection, analyzing streaming data from devices, turbines etc. However building the streaming applications and operationalizing them is challenging. There is a need for a self-serve platform on Spark Structured Streaming to enable many users to quickly build, deploy, run and monitor a variety of big data streaming use cases. At Sparkflows we built out a Self-Service Platform for building Structured Streaming Applications in minutes. Variety of users can log in with their Browser and build and deploy these applications seamlessly with drag and drop of 200+ Processors. They can also build charts on the streaming data and perform streaming analytics. In this talk we will dive deeper into our journey. We started with a workflow editor and workflow engine for building and running structured streaming jobs. We will explain how we built out the connectors to streaming sources for running in the designer mode, perform ML model scoring with real-time ingestion, streaming analytics, schema inference and propagation and displaying results in continuously moving charts. We will go over how we built self-serve streaming data preparation, lookup and analytics with SQL, Scala, Python etc. Finally, we will also discuss how we enabled deployment, operationalization and monitoring of the long running Structured Streaming jobs. We want to show how Spark can be used to enable very complex Self-Serve Big Data Streaming Applications and Analytics for Enterprises.
Speaker: Jayant Shekhar
Realtime streaming architecture in INFINARIOJozo Kovac
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
Power Your Delta Lake with Streaming Transactional ChangesDatabricks
Organizations are adopting data digitization and data-driven decision making is at the heart of this transformation. Cloud Data Lakes and Datawarehouses provide great flexibility to proto-type and roll out applications continuously at much lower costs.
Transactional databases are optimized for processing huge volumes of transactions in real-time, whereas the cloud data lake needs to be optimized for analyzing huge volumes of data quickly. This brings about a challenge in creating a streamlined data flow process from capturing realtime transactions into a cloud datawarehouse to drive realtime insights in a scalable and cost effective manner.
In this session, we’ll show how organizations can easily overcome that challenge by adopting a robust platform with StreamSets and Delta Lake. StreamSets provides a no-code framework to automate ingestion of transactional data and data processing on Spark, while Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
Build Real-Time Applications with Databricks StreamingDatabricks
In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure.
This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.)
• The current locations and status of firefighters, EMT personnel and other relevant fire department employees
• The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments.
In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Life occurs in real-time, and not surprisingly, more solutions are being built using streaming technologies. Event-based architectures are becoming the norm, and customers are expecting immediate access to their data. This new world offers many exciting opportunities, but also some new challenges. What do you do when your streaming data is not complete? What if it relies on another data source? Does the dependent data exist yet, and does it come from a 3rd party? How do we merge a complete picture of data when data is sourcing from multiple places at the same time? A new norm in the world of distributed services. Join us as we dive deep into the technical details around these scenarios and more. Expect to learn about stream-stream joins, enriching stream data using local or remote data, and ways to anticipate and correct errors within the stream. Leave with a better understanding of managing data dependencies within a Spark Structured Streaming application.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Building Data Pipelines with Spark and StreamSetsPat Patterson
Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an Apache 2.0 licensed open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we’ll look at how SDC’s “intent-driven” approach keeps the data flowing, with a particular focus on clustered deployment with Spark and other exciting Spark integrations in the works.
Streaming Analytics for Financial EnterprisesDatabricks
Streaming Analytics (or Fast Data processing) is becoming an increasingly popular subject in the financial sector. There are two main reasons for this development. First, more and more data has to be analyze in real-time to prevent fraud; all transactions that are being processed by banks have to pass and ever-growing number of tests to make sure that the money is coming from and going to legitimate sources. Second, customers want to have friction-less mobile experiences while managing their money, such as immediate notifications and personal advise based on their online behavior and other users’ actions.
A typical streaming analytics solution follows a ‘pipes and filters’ pattern that consists of three main steps: detecting patterns on raw event data (Complex Event Processing), evaluating the outcomes with the aid of business rules and machine learning algorithms, and deciding on the next action. At the core of this architecture is the execution of predictive models that operate on enormous amounts of never-ending data streams.
In this talk, I’ll present an architecture for streaming analytics solutions that covers many use cases that follow this pattern: actionable insights, fraud detection, log parsing, traffic analysis, factory data, the IoT, and others. I’ll go through a few architecture challenges that will arise when dealing with streaming data, such as latency issues, event time vs server time, and exactly-once processing. The solution is build on the KISSS stack: Kafka, Ignite, and Spark Structured Streaming. The solution is open source and available on GitHub.
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. Workday is a “pure SaaS” company, providing a suite of Financial and HCM (Human Capital Management) apps to about 2000 companies around the world, including more than 30% from Fortune-500 list. There are significant business and technical challenges to support millions of concurrent users and hundreds of millions daily transactions. Using memory-centric graph-based architecture allowed to overcome most of these problems.
As Workday grew, data transactions from existing and new customers generated vast amounts of valuable and highly sensitive data. The next big challenge was to provide in-app analytics platform, which for the multiple types of accumulated data, and also would allow using blend in external datasets. Workday users wanted it to be super-fast, but also intuitive and easy-to-use both for the financial and HR analysts and for regular, less technical users. Existing backend technologies were not a good fit, so we turned to Apache Spark.
In this presentation, we will share the lessons we learned when building highly scalable multi-tenant analytics service for transactional data. We will start with the big picture and business requirements. Then describe the architecture with batch and interactive modules for data preparation, publishing, and query engine, noting the relevant Spark technologies. Then we will dive into the internals of Prism’s Query Engine, focusing on Spark SQL, DataFrames and Catalyst compiler features used. We will describe the issues we encountered while compiling and executing complex pipelines and queries, and how we use caching, sampling, and query compilation techniques to support interactive user experience.
Finally, we will share the future challenges for 2018 and beyond.
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience.
In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses.
We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:
How do I manage offsets?
How do I manage state?
How do I make my spark streaming job resilient to failures? Can I avoid some failures?
How do I gracefully shutdown my streaming job?
How do I monitor and manage (e.g. re-try logic) streaming job?
How can I better manage the DAG in my streaming job?
When to use checkpointing and for what? When not to use checkpointing?
Do I need a WAL when using streaming data source? Why? When don’t I need one?
In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
"Low latency analytics is becoming a very popular scenario. In this session we will discuss several architectural options for doing
analytics on moving data using Amazon Kinesis and EMR/Spark Streaming and share some best practices and real world examples."
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
In this talk, we share our experience when we build up our data pipeline. We went from mongodb, and migrated to cassandra, and now we use kafka and spark to handle our data. We also talk about what problem encounter, why we select these solutions, and where we will go next.
Similar to Headaches and Breakthroughs in Building Continuous Applications (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. Landon Robinson & Jack Chapa
Spark Streaming
Headaches and Breakthroughs in Building
Continuous Applications
#UnifiedAnalytics #SparkAISummit
3. Who We Are
3#UnifiedAnalytics #SparkAISummit
Landon Robinson
Data Engineer
Jack Chapa
Data Engineer
Big Data Team @ SpotX
4. Because Spark Streaming...
• is very powerful
• can supercharge your infrastructure
• … and can be very complex!
Lots of headaches and breakthroughs!
But first… why are we here?
4#UnifiedAnalytics #SparkAISummit
5. ● Streaming Basics
● Testing
● Monitoring & Alerts
● Batch Intervals & Resources
Takeaways
Leave with a few actionable items that we wish we knew
when we started with Spark Streaming.
Focus Areas
● Helpful Configurations
● Backpressure
● Data Enrichment &
Transformations
7. We Process a Lot of Data
Data:
- 220 MM+ Total Files/Blocks
- 8 PB+ HDFS Space
- 20 TB+ new data daily
- 100MM+ records/minute
- 300+ Data Nodes
Apps:
- Thousands of daily Spark apps
- Hundreds of daily user queries
- Multiple 24/7 Streaming apps
8. Our uses include:
- Rapid ingestion of data into warehouse for querying
- Machine learning on near-live data streams
- Ability to react to and impact live situations
- Accelerated processing / updating of metadata
- Real-time visualization of data streams and processing
Spark Streaming is Key for Us
10. Spark Streaming Basics
Spark Streaming is an extension of Spark that
enables scalable, high-throughput, fault-tolerant
processing of live data streams.
• Stream == live data stream
– Topic == Kafka’s name for a stream
•
DStream == sequence of RDDs formed
from reading a data stream
• Batch == a self-contained job within your
Streaming app that processes a segment of
the stream.
12. Use Spark in Local Mode
You can start building Spark Streaming apps
in minutes, using Spark locally!
On your local machine
• No cluster needed!
• Great for rough testing
We Recommend:
IntelliJ Community Edition
• with SBT: For dependency management
13. Use Spark in Local Mode
In your build.sbt:
• src/test/scala => “provided”
• src/main/scala => “compiled”
The Scala Build Tool is your friend!
Simply:
• Import Spark libraries
• Invoke a Context and/or Session
• Set master to local[*] or local[n]
14. Example Unit Test using just a SparkContext
Invoke a local session:
• In your unit test classes
• Test logic on small datasets
Add to your deployment pipeline
for a nice pre-release gut check!
15. Unit Testing
Spark Streaming Apps can easily be unit tested
- Using .queueStream()
- Using a spark testing library
Libraries
- spark-testing-base
- sscheck
- spark-tests
Use Cases
- DStream actions
- Business Logic
- Integration
16. Example Library: spark-testing-base
- Easy to Use
- Helpful wrappers
- Integrates w/ scalatest
- Minimal code required
- Clock management
- Runs alongside other tests
GitHub: https://github.com/holdenk/spark-testing-base
18. Monitoring is Awesome
It can reveal:
• How your app is performing
• Problems + Bugs!
And provide opportunities to:
• See and address issues
• Observe behavior visually
But monitoring can be tough to implement!
19. Monitoring (a less than ideal approach)
You could do it all in the app...
Example: Looping over RDDs to:
• Count records
• Track Kafka offsets
• Processing time / delays
But it’s less than ideal...
• Calculating performance significantly
impacts performance… not great.
• All these metrics are calculated by
Spark!
20. Monitoring and Visualization (using Listeners)
Use Spark Listeners to access
metrics in the background!
Let Spark do the hard work:
• Batch duration, delays
• Record throughput
• Stream position recovery
Come to our talk: Spark Listeners:
A Crash Course in Fast, Easy
Monitoring!
• Room 3016 | Today @ 5:30 PM
22. Inside the Spark
Listener class, after a
batch completes...
You can access an
object generated by
Spark containing your
offsets processed.
Take those offsets and
back them up to a
DB...
Writing Offsets to MySQL
23. Your offsets are now
stored in a DB after
each batch completes.
Whenever your app
restarts, it reads those
offsets from the DB...
And starts processing
where it last left off!
Reading Offsets from MySQL
27. - Record timing info for fast
troubleshooting
- Escalate alarms to the
appropriate team
- Quickly resolve while
app continues running
Timing Logging (around actions)
28. React
How do I react to this monitoring?
● Heartbeats
● Scheduled Monitor Jobs
○ Version Updates
○ Ensure Running
○ Act on failure/problem
● Monitoring Alarms
● Look at them!
30. You want batches that process
faster than the interval, but
not so fast that resources are
idling and therefore wasted!
Setting Appropriate Batch Intervals
An appropriate batch interval is key to
an app that is quick and efficient.
Effectiveness of interval is
affected by:
• Resources alloc (cpu + ram)
• Quantity of work
• Quantity of data
Batch interval
31. Setting Appropriate Batch Intervals
Consider these questions:
How quickly do I need to process data?
• Can I slow it down to save resources?
What is my resource budget / allocation?
• Can I increase? Can I cut back?
• Bigger interval = more time to process
• … but also more data to process
• Smaller interval = the opposite
32. Tips for finding an optimal combination:
Start small!
a. Short batch interval (seconds)
b. Modest resources
Whichever you have in more flexible
supply (a or b), increasing accordingly.
Again: processing time < interval = good
Comfortably less, not significantly less.
Setting Appropriate Batch Intervals
33. Additional Resource Notes
- Scale down when possible
- Free up resources or save on cloud utilization spend
- Avoid preemption
- Use resource pools with prioritization
- With preemption disabled if you can
- Set appropriate # of partitions for Kafka topics
- Higher volume == higher partition count
- Higher partition count == greater parallelization
36. Backpressure
Use Case:
You have irregular spikes in message throughput from Kafka topics
• Backpressure dynamically alters rate data is received per batch from Kafka.
• Prevents overwhelming of app at startup and peak load.
Settings:
• spark.streaming.backpressure.enabled = true
• spark.streaming.kafka.maxRatePerPartition = 20000
– max rate (messages/second) at which each Kafka partition will be read
• PID Rate Estimator: can be used to tweak the rate based on batch performance
– spark.streaming.backpressure.pid.*
Source: https://www.linkedin.com/pulse/enable-back-pressure-make-your-spark-streaming-production-lan-jiang/
38. Transformations (Streaming + Static)
transform()
● Allows RDD-level access to data.
● Use case: joining with another RDD
updateStateByKey() / mapWithState()
● Apply function to each key - useful for keeping
track of state
● Use case: maintaining state between batches
(e.g. rolling join w/ two streams)
reduceByKey()
● Reduce a keyed RDD with appropriate
function.
● Use case: deduping, aggregations
39. Using the transform() method on DStream:
Apply an RDD-to-RDD function to every RDD of the DStream. Used for arbitrary RDD
operations on the DStream.
• Useful for applying arbitrary RDD operations on a DStream.
• Great for enriching streaming data with supplemental static data
Joining Streaming and Static Data
Source: https://hadoopsters.net/2017/11/26/how-to-join-static-data-with-streaming-data-dstream-in-spark/
transactions = … // streaming dataset (dstream)
transaction_details = … // static dataset (rdd)
val complete_transaction_data = transactions.transform(live_transaction =>
live_transaction.join(transaction_details))
40. Effective Static Joining
How do we handle static and persistent data?
Driver:
● Broadcast if small enough
● Read on driver every batch, then join
Worker:
● Connect on worker - lazy val connection
object
● Useful for persisting data
41. Streaming isn’t always easy… but here are some great takeaways!
• Testing: Use Spark Locally w/ Unit Tests
• Monitoring: Use Listeners & React
• Batch Intervals & Resources: Be thoughtful!
• Configuration: Lots of awesome ones!
• Transformations: Do more with your streaming data!
• Offset Recovery: Stop worrying and love the offset management!
Review