Intel IT empowers business units to easily make rapid, impactful business decisions. Ingesting a variety of internal/external data sources has challenges. This slideset covers how Intel IT overcame the issues with Hadoop and Gobblin. Learn more at http://www.intel.com/itcenter
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
Empowering Zillow’s Developers with Self-Service ETLDatabricks
As the amount of data and the number of unique data sources within an organization grow, handling the volume of new pipeline requests becomes difficult. Not all new pipeline requests are created equal — some are for business-critical datasets, others are for routine data preparation, and others are for experimental transformations that allow data scientists to iterate quickly on their solutions.
To meet the growing demand for new data pipelines, Zillow created multiple self-service solutions that enable any team to build, maintain, and monitor their data pipelines. These tools abstract away the orchestration, deployment, and Apache Spark processing implementation from their respective users. In this talk, Zillow engineers discuss two internal platforms they created to address the specific needs of two distinct user groups: data analysts and data producers. Each platform addresses the use cases of its intended user, leverages internal services through its modular design, and empowers users to create their own ETL without having to worry about how the ETL is implemented.
Members of Zillow’s data engineering team discuss:
Why they created two separate user interfaces to meet the needs different user groups
What degree of abstraction from the orchestration, deployment, processing, and other ancillary tasks that chose for each user group
How they leveraged internal services and packages, including their Apache Spark package — Pipeler, to democratize the creation of high-quality, reliable pipelines within Zillow
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
Search and recommendation system for Alibaba’s e-commerce platform use batch and streaming processing heavily. Flink SQL and Table API (which is a SQL-like DSL) provide simple, flexible, and powerful language to express the data processing logic. More importantly, it opens the door to unify the semantics of batch and streaming jobs.
Blink is a project at Alibaba which improves Apache Flink to make it ready for large scale production use. To support our products, we made lots of improvements to Flink SQL & TableAPI in Alibaba's Blink project. We added the support for User-Defined Table function (UDTF), User-Defined Aggregates (UDAGG), Window Aggregate, and retraction, etc. We are actively working with the Flink community to contribute these improvements back. In this talk, we will present the rationale, semantics, design and implementation of these improvements. We will also share the experience of running large scale Flink SQL and TableAPI jobs at Alibaba.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
Empowering Zillow’s Developers with Self-Service ETLDatabricks
As the amount of data and the number of unique data sources within an organization grow, handling the volume of new pipeline requests becomes difficult. Not all new pipeline requests are created equal — some are for business-critical datasets, others are for routine data preparation, and others are for experimental transformations that allow data scientists to iterate quickly on their solutions.
To meet the growing demand for new data pipelines, Zillow created multiple self-service solutions that enable any team to build, maintain, and monitor their data pipelines. These tools abstract away the orchestration, deployment, and Apache Spark processing implementation from their respective users. In this talk, Zillow engineers discuss two internal platforms they created to address the specific needs of two distinct user groups: data analysts and data producers. Each platform addresses the use cases of its intended user, leverages internal services through its modular design, and empowers users to create their own ETL without having to worry about how the ETL is implemented.
Members of Zillow’s data engineering team discuss:
Why they created two separate user interfaces to meet the needs different user groups
What degree of abstraction from the orchestration, deployment, processing, and other ancillary tasks that chose for each user group
How they leveraged internal services and packages, including their Apache Spark package — Pipeler, to democratize the creation of high-quality, reliable pipelines within Zillow
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
Search and recommendation system for Alibaba’s e-commerce platform use batch and streaming processing heavily. Flink SQL and Table API (which is a SQL-like DSL) provide simple, flexible, and powerful language to express the data processing logic. More importantly, it opens the door to unify the semantics of batch and streaming jobs.
Blink is a project at Alibaba which improves Apache Flink to make it ready for large scale production use. To support our products, we made lots of improvements to Flink SQL & TableAPI in Alibaba's Blink project. We added the support for User-Defined Table function (UDTF), User-Defined Aggregates (UDAGG), Window Aggregate, and retraction, etc. We are actively working with the Flink community to contribute these improvements back. In this talk, we will present the rationale, semantics, design and implementation of these improvements. We will also share the experience of running large scale Flink SQL and TableAPI jobs at Alibaba.
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
Migrating Your Data Platform At a High Growth StartupDatabricks
At Abnormal Security, Spark has played a fundamental role in helping us create an ML system that detects thousands of sophisticated email threats every day. Initially, we set up our Spark infrastructure using YARN on EMR because we had previous experience with it. But after growing very quickly in a short amount of time, we found ourselves spending too much time solving problems with our Spark infrastructure and less time solving problems for our customers. Given we’re in a high growth environment where the only constant is change, we asked ourselves: aren’t these problems only going to get worse as we add more employees, more products, and more data?
Over the past few months, Abnormal Security executed a full migration of our Spark infrastructure to Databricks, not only improving cost, operational overhead, and developer productivity, but simultaneously laying the foundation for a modern Data Platform via the Lakehouse architecture.
In this talk, we’ll cover how we executed the migration in a few months’ time, from pre-Databricks-POC, through the POC, through the migration itself. We’ll talk about how to really figure out exactly what it is that you care about when evaluating Databricks, splitting out the must-haves from the nice-to-haves, so that you can best utilize the Databricks trial and have concrete and measurable results. We’ll talk about the work we did to execute the migration and the tooling we created to minimize downtime and properly measure performance and cost. Finally, we’ll talk about how the move to Databricks not only solved problems with our legacy Spark infrastructure, but will save us huge amounts of time in the long-run from adopting a Lakehouse architecture.
Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon-to-be-released Spark 2.3 features:
• New deployment mode: Kubernetes scheduler backend
• PySpark performance and enhancements
• New structured streaming execution engine: continuous processing
• Data source v2 APIs for both structured streaming and Spark SQL
• ML on structured streaming
• Image reader
• Stable codegen engine
• Spark History Server V2
• Native ORC support
• Vectorized ORC and SQL cache readers
• Stream-stream Join
• UDF enhancements
• Various SQL enhancements
Speakers
Xiao Li, Software Engineer, Databricks
Wenchen Fan, Software Engineer, Databricks
Drivetribe is the world’s digital hub for motoring, as envisioned by Jeremy Clarkson, Richard Hammond, and James May. The Drivetribe platform was designed ground up with high scalability in mind. Built on top of the Event Sourcing/CQRS pattern, the platform uses Apache Kafka as its source of truth and Apache Flink as its processing backbone. This talk aims to introduce the architecture, and elaborate on how common problems in social media, such as counting big numbers and dealing with outliers, can be resolved by a healthy mix of Flink and functional programming.
KFServing, Model Monitoring with Apache Spark and a Feature StoreDatabricks
In recent years, MLOps has emerged to bring DevOps processes to the machine learning (ML) development process, aiming at more automation in the execution of repetitive tasks and at smoother interoperability between tools. Among the different stages in the ML lifecycle, model monitoring involves the supervision of model performance over time, involving the combination of techniques in four categories: outlier detection, data drift detection, explainability and adversarial attacks. Most existing model monitoring tools follow a scheduled batch processing approach or analyse model performance using isolated subsets of the inference data. However, for the continuous monitoring of models, stream processing platforms show several advantages, including support for continuous data analytics, scalable processing of large amounts of data and first-class support for window-based aggregations useful for concept drift detection.
In this talk, we present an open-source platform for serving and monitoring models at scale based on Kubeflow’s model serving framework, KFServing, the Hopsworks Online Feature Store for enriching feature vectors with transformer in KFServing, and Spark and Spark Streaming as general purpose frameworks for monitoring models in production.
We also show how Spark Streaming can use the Hopsworks Feature Store to implement continuous data drift detection, where the Feature Store provides statistics on the distribution of feature values in training, and Spark Streaming computes the statistics on live traffic to the model, alerting if the live traffic differs significantly from the training data. We will include a live demonstration of the platform in action.
Join Postgres experts Bruce Momjian and Marc Linster as they preview everything new in Postgres 12. You don’t want to miss this!
Highlights include:
- New compatibility features
- PostgreSQL: Table access methods
- Partitioning Improvements
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...Databricks
Spark makes possible just-in-time analytics – moving the data warehouse into the same environment that supports ETL and non-SQL analytics. This results in the benefits of elastic compute, schema-on-read, and Spark’s unified API for graph, streaming, and machine learning. However, even with this capability, challenges for interactivity, efficiency, and scalability remain. As just-in-time analytics becomes the norm, data scientists and engineers have had to take on the capacity planning, configuration, and performance tuning roles of the DBA as well.
The Algebraix Query Accelerator for Spark shims the existing Spark DataFrames and SQL APIs so that it can unobtrusively build a model of how the users’ queries relate to the data and to each other. The AQA uses this model to predict future query characteristics and deploy optimizations to speed up future queries. SQL queries and DataFrame programs are translated into SQL-DA, a data algebra representation, and stored together in a graph-like data structure called the algebraic cache, which serves as the core data structure in the model. An exemplary use case of materializing views into common expression patterns is discussed.
AQA for Spark is an artificially intelligent agent that helps data scientists and engineers focus on the analysis task by automating the performance tuning and resource management tasks of the DBA. The AQA functions as an inter-query optimizer that complements a traditional query optimizer (Catalyst) by creating additional speed-up, which is demonstrated with a benchmark analysis.
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleDatabricks
The increase in consumer data privacy laws brings continuing challenges to data teams all over the world which collect, store, and use data protected by these laws. The data engineering team at Mars Petcare is no exception, and in order to improve efficiency and accuracy in responding to these challenges they have built Gecko: an efficient, auditable, and simple CCPA compliance ecosystem designed for Spark and Delta Lake.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Which Change Data Capture Strategy is Right for You?Precisely
Change Data Capture or CDC is the practice of moving the changes made in an important transactional system to other systems, so that data is kept current and consistent across the enterprise. CDC keeps reporting and analytic systems working on the latest, most accurate data.
Many different CDC strategies exist. Each strategy has advantages and disadvantages. Some put an undue burden on the source database. They can cause queries or applications to become slow or even fail. Some bog down network bandwidth, or have big delays between change and replication.
Each business process has different requirements, as well. For some business needs, a replication delay of more than a second is too long. For others, a delay of less than 24 hours is excellent.
Which CDC strategy will match your business needs? How do you choose?
View this webcast on-demand to learn:
• Advantages and disadvantages of different CDC methods
• The replication latency your project requires
• How to keep data current in Big Data technologies like Hadoop
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
Since its release in 2018, KSQL has grown from interesting curiosity into ksqlDB - a production grade streaming system. What does it look like to run KSQL in the enterprise? How has the promise of the Kafka Streams with an SQL dialect worked in the wild?
Let's explore stream processing with ksqlDB in the enterprise. How is it used to rapid prototyping; for taking an idea to production. Using the flexible scripting to help teams with error discover and system introspection. Plus how extended teams can use KSQL as a stepping stone for building and sharing real-time scoring and streaming insights.
This session will cover production deployments of ksqlDB in banking, finance, transport and insurance. What can go wrong, and what can go right. See how teams embrace the technology to solve stream processing challenges.
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020HostedbyConfluent
One of the key metrics to monitor when working with Apache Kafka, as a data pipeline or a streaming platform, is Consumer Groups Lag.
Lag is the delta between the last produced message and the last committed message of a partition. In other words, lag indicates how far behind your application is in processing up-to-date information.
For a long time, we used our own service to keep track of these metrics, collect them and visualize them. But this didn’t scale well.
You had to perform many manual operations, redeploy it and to do other tedious manual tasks, but most importantly, the biggest gap for us, was that its output was represented in absolute numbers (e.g - your lag is 30K), which basically tells you nothing as a human being.
We understood that we had to find a more suitable solution that will give us better visibility and will allow us to measure the lag in a time-based format that we all understand.
In this talk, I’m going to go over the core concepts of Kafka offsets and lags, and explain why lag even matters and is an important KPI to measure. I’ll also talk about the kind of research we did to find the right tool, what the options in the market were at the time, and eventually why we chose Linkedin’s Burrow as the right tool for us. And finally, I’ll take a closer look at Burrow, its building blocks, how we build and deploy it, how we monitor better with it, and eventually the most important improvement - how we transformed its output from numbers to time-based metrics.
This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
Migrating Your Data Platform At a High Growth StartupDatabricks
At Abnormal Security, Spark has played a fundamental role in helping us create an ML system that detects thousands of sophisticated email threats every day. Initially, we set up our Spark infrastructure using YARN on EMR because we had previous experience with it. But after growing very quickly in a short amount of time, we found ourselves spending too much time solving problems with our Spark infrastructure and less time solving problems for our customers. Given we’re in a high growth environment where the only constant is change, we asked ourselves: aren’t these problems only going to get worse as we add more employees, more products, and more data?
Over the past few months, Abnormal Security executed a full migration of our Spark infrastructure to Databricks, not only improving cost, operational overhead, and developer productivity, but simultaneously laying the foundation for a modern Data Platform via the Lakehouse architecture.
In this talk, we’ll cover how we executed the migration in a few months’ time, from pre-Databricks-POC, through the POC, through the migration itself. We’ll talk about how to really figure out exactly what it is that you care about when evaluating Databricks, splitting out the must-haves from the nice-to-haves, so that you can best utilize the Databricks trial and have concrete and measurable results. We’ll talk about the work we did to execute the migration and the tooling we created to minimize downtime and properly measure performance and cost. Finally, we’ll talk about how the move to Databricks not only solved problems with our legacy Spark infrastructure, but will save us huge amounts of time in the long-run from adopting a Lakehouse architecture.
Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon-to-be-released Spark 2.3 features:
• New deployment mode: Kubernetes scheduler backend
• PySpark performance and enhancements
• New structured streaming execution engine: continuous processing
• Data source v2 APIs for both structured streaming and Spark SQL
• ML on structured streaming
• Image reader
• Stable codegen engine
• Spark History Server V2
• Native ORC support
• Vectorized ORC and SQL cache readers
• Stream-stream Join
• UDF enhancements
• Various SQL enhancements
Speakers
Xiao Li, Software Engineer, Databricks
Wenchen Fan, Software Engineer, Databricks
Drivetribe is the world’s digital hub for motoring, as envisioned by Jeremy Clarkson, Richard Hammond, and James May. The Drivetribe platform was designed ground up with high scalability in mind. Built on top of the Event Sourcing/CQRS pattern, the platform uses Apache Kafka as its source of truth and Apache Flink as its processing backbone. This talk aims to introduce the architecture, and elaborate on how common problems in social media, such as counting big numbers and dealing with outliers, can be resolved by a healthy mix of Flink and functional programming.
KFServing, Model Monitoring with Apache Spark and a Feature StoreDatabricks
In recent years, MLOps has emerged to bring DevOps processes to the machine learning (ML) development process, aiming at more automation in the execution of repetitive tasks and at smoother interoperability between tools. Among the different stages in the ML lifecycle, model monitoring involves the supervision of model performance over time, involving the combination of techniques in four categories: outlier detection, data drift detection, explainability and adversarial attacks. Most existing model monitoring tools follow a scheduled batch processing approach or analyse model performance using isolated subsets of the inference data. However, for the continuous monitoring of models, stream processing platforms show several advantages, including support for continuous data analytics, scalable processing of large amounts of data and first-class support for window-based aggregations useful for concept drift detection.
In this talk, we present an open-source platform for serving and monitoring models at scale based on Kubeflow’s model serving framework, KFServing, the Hopsworks Online Feature Store for enriching feature vectors with transformer in KFServing, and Spark and Spark Streaming as general purpose frameworks for monitoring models in production.
We also show how Spark Streaming can use the Hopsworks Feature Store to implement continuous data drift detection, where the Feature Store provides statistics on the distribution of feature values in training, and Spark Streaming computes the statistics on live traffic to the model, alerting if the live traffic differs significantly from the training data. We will include a live demonstration of the platform in action.
Join Postgres experts Bruce Momjian and Marc Linster as they preview everything new in Postgres 12. You don’t want to miss this!
Highlights include:
- New compatibility features
- PostgreSQL: Table access methods
- Partitioning Improvements
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...Databricks
Spark makes possible just-in-time analytics – moving the data warehouse into the same environment that supports ETL and non-SQL analytics. This results in the benefits of elastic compute, schema-on-read, and Spark’s unified API for graph, streaming, and machine learning. However, even with this capability, challenges for interactivity, efficiency, and scalability remain. As just-in-time analytics becomes the norm, data scientists and engineers have had to take on the capacity planning, configuration, and performance tuning roles of the DBA as well.
The Algebraix Query Accelerator for Spark shims the existing Spark DataFrames and SQL APIs so that it can unobtrusively build a model of how the users’ queries relate to the data and to each other. The AQA uses this model to predict future query characteristics and deploy optimizations to speed up future queries. SQL queries and DataFrame programs are translated into SQL-DA, a data algebra representation, and stored together in a graph-like data structure called the algebraic cache, which serves as the core data structure in the model. An exemplary use case of materializing views into common expression patterns is discussed.
AQA for Spark is an artificially intelligent agent that helps data scientists and engineers focus on the analysis task by automating the performance tuning and resource management tasks of the DBA. The AQA functions as an inter-query optimizer that complements a traditional query optimizer (Catalyst) by creating additional speed-up, which is demonstrated with a benchmark analysis.
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleDatabricks
The increase in consumer data privacy laws brings continuing challenges to data teams all over the world which collect, store, and use data protected by these laws. The data engineering team at Mars Petcare is no exception, and in order to improve efficiency and accuracy in responding to these challenges they have built Gecko: an efficient, auditable, and simple CCPA compliance ecosystem designed for Spark and Delta Lake.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Which Change Data Capture Strategy is Right for You?Precisely
Change Data Capture or CDC is the practice of moving the changes made in an important transactional system to other systems, so that data is kept current and consistent across the enterprise. CDC keeps reporting and analytic systems working on the latest, most accurate data.
Many different CDC strategies exist. Each strategy has advantages and disadvantages. Some put an undue burden on the source database. They can cause queries or applications to become slow or even fail. Some bog down network bandwidth, or have big delays between change and replication.
Each business process has different requirements, as well. For some business needs, a replication delay of more than a second is too long. For others, a delay of less than 24 hours is excellent.
Which CDC strategy will match your business needs? How do you choose?
View this webcast on-demand to learn:
• Advantages and disadvantages of different CDC methods
• The replication latency your project requires
• How to keep data current in Big Data technologies like Hadoop
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
Since its release in 2018, KSQL has grown from interesting curiosity into ksqlDB - a production grade streaming system. What does it look like to run KSQL in the enterprise? How has the promise of the Kafka Streams with an SQL dialect worked in the wild?
Let's explore stream processing with ksqlDB in the enterprise. How is it used to rapid prototyping; for taking an idea to production. Using the flexible scripting to help teams with error discover and system introspection. Plus how extended teams can use KSQL as a stepping stone for building and sharing real-time scoring and streaming insights.
This session will cover production deployments of ksqlDB in banking, finance, transport and insurance. What can go wrong, and what can go right. See how teams embrace the technology to solve stream processing challenges.
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020HostedbyConfluent
One of the key metrics to monitor when working with Apache Kafka, as a data pipeline or a streaming platform, is Consumer Groups Lag.
Lag is the delta between the last produced message and the last committed message of a partition. In other words, lag indicates how far behind your application is in processing up-to-date information.
For a long time, we used our own service to keep track of these metrics, collect them and visualize them. But this didn’t scale well.
You had to perform many manual operations, redeploy it and to do other tedious manual tasks, but most importantly, the biggest gap for us, was that its output was represented in absolute numbers (e.g - your lag is 30K), which basically tells you nothing as a human being.
We understood that we had to find a more suitable solution that will give us better visibility and will allow us to measure the lag in a time-based format that we all understand.
In this talk, I’m going to go over the core concepts of Kafka offsets and lags, and explain why lag even matters and is an important KPI to measure. I’ll also talk about the kind of research we did to find the right tool, what the options in the market were at the time, and eventually why we chose Linkedin’s Burrow as the right tool for us. And finally, I’ll take a closer look at Burrow, its building blocks, how we build and deploy it, how we monitor better with it, and eventually the most important improvement - how we transformed its output from numbers to time-based metrics.
This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...Agile Testing Alliance
The presentation on Performance Testing and Non-Functional Testing Strategy for Big Data Applications was done during #ATAGTR2017, one of the largest global testing conference. All copyright belongs to the author.
Author and presenter : Abhinav Gupta
As more organizations look to Hadoop as the technology solution for big data analytics, common questions arise.
Join us in this case study look at an online services provider's experience with Big Data and how they answered the questions:
*What does big data analytics do that my existing BI software doesn’t?
*Will Hadoop replace my data warehouse?
*What about Hive?
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...ModusOptimum
Customers are looking for ways to streamline analytic decisioning, looking for quicker deployments, faster time to value, lower risks of failure and higher revenues/profits. The IBM & Hortonworks solution delivers on these customer needs.
https://event.on24.com/eventRegistration/EventLobbyServlet?target=reg20.jsp&eventid=1789452&sessionid=1&eventid=1789452&sessionid=1&mode=preview&key=E0F94DE1191C59223B6522A075023215
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
Analytic workloads and the ability to determine “what happened” are some of the most common use cases across enterprises today - helping you understand and adapt based on changing trends. However, for most businesses today, they are only able to see a piece of the story. Analytics are limited by the amount of data able to be stored and ultimately accessed, it’s time-intensive to bring in new datasets or fit unstructured data into rigid schemas, and user access is constrained to a select few who must already know the questions they’re trying to answer.
It’s no surprise that big data is disrupting this modus operandi for analytics. A modern, Hadoop-based platform is designed to help businesses break free of these analytic limitations, providing a new kind of adaptive, high-performance analytic database. The recent release of Cloudera 5.8 continues to advance Cloudera Enterprise as the foundation for these analytic workloads.
Join Justin Erickson, Senior Director of Product Management at Cloudera, and Andy Frey, Chief Technology Officer at Marketing Associates, as they discuss:
-What technology is needed to build a modern analytic database with Hadoop
-What’s new with Cloudera 5.8
-How to align your teams around agile analytics
-Real world success from Marketing Associates
-What’s next for Cloudera Enterprise’s Analytic Database
Disrupt Hackers With Robust User AuthenticationIntel IT Center
Hacks are constantly in the headlines, and a clear-cut strategy is needed to proactively secure large enterprises from intrusions before they happen. This session reveals a new approach to user authentication. Attendees will learn how to 1) leverage hardware for authentication, 2) utilize existing network environments to better protect user credentials and authentication policies and 3) provide an intuitive experience for end users.
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...Intel IT Center
With new “Hacked!” headlines happening every day, modernizing your companies’ endpoint security strategy has never been more important. Software alone is not enough. For cybersecurity, the way forward requires help from the hardware. This session will equip you with an understanding of one of the most promising approaches to enterprise security: hardware-enhanced identity protection and data protection, at the core of your fleet of end point devices.
Harness Digital Disruption to Create 2022’s Workplace TodayIntel IT Center
As the modern workplace evolves, modern devices play a critical role in simplifying work and creating an immersive, seamless experience. This session offers guidance on things to consider as you update your workplace into the secure, managed, collaborative environment employees demand and you require.
Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.Intel IT Center
Learn how security solutions built into Intel® Core™ vPro™ processors address top threat vectors. Our comprehensive approach to hardware-enhanced security starts with identity protection with Intel® Authenticate delivering customizable multi factor authentication options, and supports remote remediation with Intel® Active Management Technology.
Achieve Unconstrained Collaboration in a Digital WorldIntel IT Center
Technology is at the center of every digitally-savvy workplace, yet organizations are constrained with bridging current tools to more modern solutions. This session from Gartner Digital Workplace Summit will cover a new way to facilitate employee collaboration that is easy, engaging and gives IT an uncompromised security and management experience.
Intel® Xeon® Scalable Processors Enabled Applications Marketing GuideIntel IT Center
The Future-Ready Data Center platform is here. Whether you navigate in the High Performance Computing, Enterprise, Cloud, or Communications spheres, you will find an Intel® Xeon® processor that is ready to power your data center now and well into the future. An innovative approach to platform design in the Intel® Xeon® Scalable processor platform unlocks the power of scalable performance for today’s data centers—from the smallest workloads to your most mission-critical applications. Powerful convergence and capabilities across compute, storage, memory, network and security deliver unprecedented scale and highly optimized performance across a broad range of workloads—from high performance computing (HPC) and network functions virtualization, to advanced analytics and artificial intelligence (AI). Many examples here show how our software partner ecosystem has optimized their applications and/or taken advantage of inherent platform enhancements to deliver dramatic performance gains, that can translate into tangible business benefits.
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...Intel IT Center
At NAB, this session covered how technology will transform the way content is created and distributed and accelerate the rate of innovation in the industry. Intel, a revolutionary leader in technology and in transforming industries since 1968, works with other industry partners to enable the transition to new paradigms, infrastructures and technologies.
Join Jim Blakley, General Manager of Intel's Visual Cloud Division, and guests including Dave Ward (Chief Technology Officer, Cisco), AR Rahman (two-time Academy and Grammy Award winner), and Dave Andersen (School of Computer Science, Carnegie Mellon University) to learn more about how this revolution will make amazing visual cloud experiences possible for every person on Earth.
Making the digital workplace a reality requires a modern and strategic approach to identity protection. You will discover ways to build an IAM program that moves you from defense to offense. This presentation will offer practical guidance on how a hardware-based multi-factor authentication strategy is the future for identity protection.
Three Steps to Making a Digital Workplace a RealityIntel IT Center
The workplace is undergoing a dramatic evolution. Work styles are more mobile, changing the way we collaborate and share information while a more mobile workforce means a greater need to thwart cyber-attacks. You'll learn about Intel's three-part approach to help IT leaders sustainably embrace mobility and increase your security posture.
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...Intel IT Center
The workplace is undergoing a dramatic evolution. Workstyles are more mobile, changing the way we collaborate and share information while a more mobile workforce means a greater need to thwart cyber-attacks. In this presentation, you'll learn about Intel's three-part approach to help IT leaders sustainably embrace mobility and increase your security posture.
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel IT Center
This set of Intel® Xeon® processor E7-8800/4800 v4 family proof points spans several key business segments. The Intel® Xeon® processor E7-8800/4800 v4 product family delivers the horsepower for real-time, high-capacity data analysis that can help businesses derive rapid actionable insights to deliver innovative new services and customer experiences. With high performance, industry’s largest memory, robust reliability, and hardware-enhanced security features, the E7-8800/4800 v4 is optimal for scale-up platforms, delivering rapid in-memory computing for today’s most demanding real-time data and transaction-intensive workloads.
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel IT Center
The Intel Xeon processor E5-2600 v4 product family delivers the high performance, increased memory, and I/O bandwidth required for all forms of enterprise databases, is ideal for next-generation application workloads, and is the powerhouse for software-defined infrastructure (SDI) environments where automation and orchestration capabilities are foundational. See how database solutions deployed on the Intel® Xeon® processor E5 v4 product family can deliver increased performance and throughput, as demonstrated by key software partners.
Intel® Xeon® Processor E5-2600 v4 Core Business Applications ShowcaseIntel IT Center
Designed for architecting next-generation, software-defined data centers, the Intel® Xeon® processor E5-2600 v4 product family is supercharged for efficiency, performance, and agile services delivery across cloud-native and traditional applications. Intel® Intelligent Power Technology automatically regulates power consumption to combine industry-leading energy efficiency with intelligent performance that adapts to your workloads.
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications ShowcaseIntel IT Center
The Intel® Xeon® processor E5-2600 v4 product family delivers efficient resource utilization, service tiering, and optimal quality of service (QoS) levels for financial applications by processing faster transactions and delivering exceptional uptime and availability and reduced latency, providing a high-performing, highly scalable system for your most demanding workloads. Enhanced cryptographic speed with two new instructions for Intel® AES-NI for improved security, and the Intel® SSD Data Center Family for NVMe represents optimized management for the future software-defined data centers with industry standard software and drivers.
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications ShowcaseIntel IT Center
Cloud and telecommunication companies can deliver better end user experiences while improving cost models across their data centers with the Intel® Xeon® processor E5-2600 v4 product family. See how innovative technologies can deliver high throughput, low latency and more agile delivery of network services to the software-defined data center. Additionally, unparalleled versatility across diverse workloads, such as 4K video processing, editing, and decoding and encoding where improved bandwidth and reduced latency provide noticeable performance improvements.
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications ShowcaseIntel IT Center
Where breakthrough performance is expected, the Intel® Xeon® processor E5-2600 v4 product family, a key ingredient of the Intel® Scalable System Framework and the software-defined data center, is designed to deliver better performance and performance per watt than ever before. The combination of Intel Xeon processors, Intel® Omni-Path Architecture, Intel Solutions for Lustre* software, and storage technologies improves bandwidth and reduces latency, providing a high-performing, highly scalable system for your most demanding workloads.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
3. Outline
Integrated Analytics Vision
Data Ingestion Challenges
Solution
What we would like to do
What we did
Challenges
Need Help
Summary
3
4. Integrated Analytics Vision & Mission
Our Vision: Customers are empowered to easily make rapid, impactful business decisions and
uncover new revenue channels through connected data & analytics
Our Mission: Provide clean, relatable, integrated data using a consistent approach to deliver
business recommendations and insights through visual and interactive usage
Transformed and
Connected Data
Raw Data Advanced
Analytics
4
5. As Is – Data Ingestion Architecture
Firewall and
Proxy Channels
External Source
Systems
IT BI Hadoop Cluster
Gateway Node
Camel
Hadoop Storage
Internal Source
Systems
Logs
DataMart
EDW
DataMart
RDBMSFlat/CSV
Files
SFTP
Vendor
utility
Hadoop
Put
Python
script
HDFS Hive
Hadoop
Put
Custom
utility
Hadoop
Put
Hadoop
Put
Hadoop
Put
Data
Consumption
Transformation
Visualization
tools
Client Tools
Sales CRM
Marketing
campaign
management
Content
Tagging
Webinar
5
6. Data Ingestion Challenges
Ingesting a variety of internal/external data sources, such as enterprise data warehouse,
enterprise master data, spreadsheets, social media feeds, marketing data, retailer data, etc.
This resulted in variety of challenges including:
• Individual project teams instrumenting their own methods for ingesting data from various
sources and building their own data pipelines
• Operational Complexity to manage the individual pipelines
• No reusability as each project team created redundant methods/codebases for ingesting
data sources
• High development cost as each team built their own data ingestion pipelines
• Inconsistency in the quality of project teams’ data ingestion codebases impacting data
qualify and reliability
• Job failures resulting from data format, quality, schema evolution and availability issues
• Skillset challenges
6
No standardized reusable framework for data ingestion
7. Solution: Data Ingestion Architecture with Gobblin/Kite
Firewall and
Proxy Channels
External Source
Systems
IT BI Hadoop
Cluster
Gateway Node
DataMart
EDW
DataMart
Data Ingestion
Reusable Framework
Kafka
Validation
RestFul
APIs
And many
more….
Hadoop
Storage
Hive / HDFS /
Hbase
Internal Source
Systems
RDBMSFlat/CSV
Files
SFTP
Vendor
APIs
Gobblin
Interface
Logs
File
Adapter
Config
Files
Alert
CSV
Adapter
RDBMS
JDBC Connector
Data
Consumption
Visualization
tool
Client Tools
Sales CRM
Marketing
campaign
management
Content
tagging
Webinar
Retailer
Social media
feeds
K
i
t
e
7
UI
8. 8
What we set out to do?
Functionally evaluate Gobblin for ingesting and integrating data.
Prototype a non OOB source to extract data out of an “online campaign
automation provider”
Acceptance Criteria
Bulk RestAPI
Validate the correctness of data
Data Consistency from end to end
Notification, status and error logging
Ability to log kickout records
Training plan for implementation and adoption plan
9. 9
What we did
Data Scope
• 4 objects
• accounts
• contacts
• 9 activities
• 59 custom objects
Parallel load data
• Hive (not using compaction) *
• HDFS (BaseDataPublisher)
Functional UI ready
• Scheduling
• Job History
• Authoring job configurations
Functional backend ready
• Enterprise scheduler
• Gobblin Standalone
• Gobblin Map-Reduce *
Quality checking policies
• Row level
• Task level
Enterprise features
• Alerting
• Monitoring
• Profiling *
• Logging
* Needs more attention
10. 10
Process Flow
Establish connection
•Authentication
•Endpoint indirection
Object Determination
•Get Object Listing
•Get Schema Definition
•Slice Schema
Create Intent
•Create Exports
Establish size boundaries
•Create Syncs
•Poll Syncs
•Slice batches
Download
•Parallel batches
Rebuild data
•Reassemble
•Schema inferencing
•Data Conversion
Data Publishing
•Hive/Impala load
•View Definition
•Quality enforcement
Parallel download and reassembly of data blocks
11. 11
Gobblin Challenges
User Interface – Visual Execution and Evaluation
Data Routing – Complex enterprise integration patterns routing challenging to
implement
public enum Result {
PASSED, // The test passed
FAILED // The test failed
}
12. 12
Need Gobblin Community Help
Address adoption challenges
Intake process for third-party contributions.
– New Source - “online campaign automation provider”
– Spark based ingestion candidates (parquet, avro, json, JDBC, s3) and runtime
– Kite SDK
Partnership with key big data vendors – CDH, HDP, MAPR – for internalizing Gobblin
capability
– Deployment, Management, Metrics, and Lineage Integration
Implement queuing or pluggable schedulers that do not rely on PID and workdir states;
better integration with enterprise schedulers.
Make Hive publishers native; versus offline compactions.
Publish documentation for user community
13. 13
Summary
Gobblin is a robust data integration framework that meets the scale, quality,
enterprise readiness imperatives expected;
However, some features like usability, enterprise integration patterns,
scheduling, profiling, lineage, deployment, documentation could be improved.