Apache Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries. And, it offers the flexibility to run analytics on data stored in Hadoop, across data across object stores and within traditional databases. This makes Spark an ideal platform for accelerating cross-platform analytics on-premises and in the cloud. Building on the success of Spark 1.x release, Spark 2.x delivers major improvements in the areas of API, Performance, and Structured Streaming. In this paper, we will cover a high-level view of the Apache Spark framework, and then focus on what we consider to be very important improvements made in Apache Spark 2.x. We will then share the results of a real-world benchmark effort and share details on Spark and environment configuration changes made to our lab, discuss the results of the benchmark, and provide a reference architecture example for those interested in taking Spark 2.x for their own test drive. This presentation stresses the value of refreshing the Spark 1 with Spark 2 as performance testing results show 2.3x improvement with SparkSQL workloads similar to TPC Benchmark™ DS (TPC-DS). MARK LOCHBIHLER, Principal Architect, Hortonworks and VIPLAVA MADASU, Big Data Systems Engineer, Hewlett Packard Enterprise
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
This document provides an overview of the Databricks platform. It discusses how Databricks combines features of data warehouses and data lakes to create a "data lakehouse" that supports both business intelligence/reporting and data science/machine learning use cases. Key components of the Databricks platform include Apache Spark, Delta Lake, MLFlow, Jupyter notebooks, and Delta Live Tables. The platform aims to unify data engineering, data warehousing, streaming, and data science tasks on a single open-source platform.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
This document discusses using Datadog for observability in Elixir applications. It covers instrumenting applications to collect metrics using the Statix library and tracing requests using the Spandex library. Specific libraries are mentioned for integrating metrics and tracing into Phoenix, Absinthe, Ecto and custom code. Issues with tracing like overhead and asynchronously spawned processes are also noted.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
This document provides an overview of the Databricks platform. It discusses how Databricks combines features of data warehouses and data lakes to create a "data lakehouse" that supports both business intelligence/reporting and data science/machine learning use cases. Key components of the Databricks platform include Apache Spark, Delta Lake, MLFlow, Jupyter notebooks, and Delta Live Tables. The platform aims to unify data engineering, data warehousing, streaming, and data science tasks on a single open-source platform.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
This document discusses using Datadog for observability in Elixir applications. It covers instrumenting applications to collect metrics using the Statix library and tracing requests using the Spandex library. Specific libraries are mentioned for integrating metrics and tracing into Phoenix, Absinthe, Ecto and custom code. Issues with tracing like overhead and asynchronously spawned processes are also noted.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This is a brief technology introduction to Oracle Stream Analytics, and how to use the platform to develop streaming data pipelines that support a wide variety of industry use cases
Iceberg provides capabilities beyond traditional partitioning of data in Spark/Hive. It allows updating or deleting individual rows without rewriting partitions through mutable row operations (MOR). It also supports ACID transactions through versions, faster queries through statistics and sorting, and flexible schema changes. Iceberg manages metadata that traditional formats like Parquet do not, enabling these new capabilities. It is useful for workloads that require updating or filtering data at a granular record level, managing data history through versions, or frequent schema changes.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in enterprise production environment to deploy and operationalized?
The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story? This session will cover the Royal Bank of Canada’s (RBC) journey of moving away from traditional ETL batch processing with Teradata towards using the Hadoop ecosystem for ingesting data. One of the first systems to leverage this new approach was the Event Standardization Service (ESS). This service provides a centralized “client event” ingestion point for the bank’s internal systems through either a web service or text file daily batch feed. ESS allows down stream reporting applications and end users to query these centralized events.
We discuss the drivers and expected benefits of changing the existing event processing. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier.
Speakers
Darryl Sutton, T4G, Principal Consultant
Kenneth Poon, RBC, Director, Data Engineering
The document discusses a presentation about modern data platforms on AWS. It provides a brief history of major big data releases from 2004 to present. It then discusses how data platforms need to scale exponentially to handle growing amounts of data and users. The remainder of the document discusses various AWS databases, analytics tools, data lakes, data movement services, and how they can be used to build flexible, scalable data platforms.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
07. Analytics & Reporting Requirements TemplateAlan D. Duncan
This document template defines an outline structure for the clear and unambiguous definition of analytics & reporting outputs (including standard reports, ad hoc queries, Business Intelligence, analytical models etc).
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
(updated slides used for North Texas DAMA meetup Oct 2018) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 15 years and is now growing in popularity. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
This document outlines a playbook for implementing a data governance program. It begins with an introduction to data governance, discussing why data matters for organizations and defining key concepts. It then provides guidance on understanding business drivers to ensure the program aligns with strategic objectives. The playbook describes assessing the current state, developing a roadmap, defining the scope of key data, establishing governance models, policies and standards, and processes. It aims to help clients establish an effective enterprise-wide data governance program.
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j
The document discusses how graph databases like Neo4j can enable real-time analytics at massive scale by leveraging relationships in data. It notes that data is growing exponentially but traditional databases can't efficiently analyze relationships. Neo4j natively stores and queries relationships to allow analytics 1000x faster. The document advocates that graphs will form the foundation of modern data and analytics by enhancing machine learning models and enabling outcomes like building intelligent applications faster, gaining deeper insights, and scaling limitlessly without compromising data.
Cost Efficiency Strategies for Managed Apache Spark ServiceDatabricks
This document discusses cost efficiency strategies for managed Apache Spark services. It begins by outlining the motivations for focusing on costs and introduces tools for cost analysis and optimization. It then describes the Azure Databricks platform and how it can be used to run Spark workloads more cost efficiently compared to infrastructure as a service options. The document details various Databricks pricing plans and units. Finally, it provides several cost optimization strategies, such as pre-purchasing plans, selecting efficient runtimes and frameworks, avoiding unnecessary storage, setting spending limits, and enabling auto-scaling.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Estimating the Total Costs of Your Cloud Analytics Platform DATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multi-faceted needs by offering multi-function Data Management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They need a worry-free experience with the architecture and its components.
A complete machine learning infrastructure cost for the first modern use case at a midsize to large enterprise will be anywhere from $2M to $14M. Get this data point as you take the next steps on your journey.
Learn how to accurately scope analytics migrations that comes in on time and on budget. See the recording and download this deck: https://senturus.com/resources/prepare-bi-migration/
Senturus offers a full spectrum of services for business analytics. Our Knowledge Center has hundreds of free live and recorded webinars, blog posts, demos and unbiased product reviews available on our website at: https://senturus.com/resources/
AWS re:Invent 2016: Strategic Planning for Long-Term Data Archiving with Amaz...Amazon Web Services
Without careful planning, data management can quickly turn complex with a runaway cost structure. Enterprise customers are turning to the cloud to solve long-term data archive needs such as reliability, compliance, and agility while optimizing the overall cost. Come to this session and hear how AWS customers are using Amazon Glacier to simplify their archiving strategy. Learn how customers architect their cloud archiving applications and share integration to streamline their organization's data management and establish successful IT best practices.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
This document provides an overview and best practices for big data architectures. It discusses big data challenges and principles like building decoupled systems and using the right tool for the job. It outlines different data types (hot, warm, cold) and storage technologies like stream processing, databases, search, and file storage. It also covers processing frameworks and reference architectures using patterns like materialized views and immutable logs. Finally, it provides a customer story about a telecommunications company transforming to real-time analytics.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Luca Canali presented on using flame graphs to investigate performance improvements in Spark 2.0 over Spark 1.6 for a CPU-intensive workload. Flame graphs of the Spark 1.6 and 2.0 executions showed Spark 2.0 spending less time in core Spark functions and more time in whole stage code generation functions, indicating improved optimizations. Additional tools like Linux perf confirmed Spark 2.0 utilized CPU and memory throughput better. The presentation demonstrated how flame graphs and other profiling tools can help pinpoint performance bottlenecks and understand the impact of changes like Spark 2.0's code generation optimizations.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This is a brief technology introduction to Oracle Stream Analytics, and how to use the platform to develop streaming data pipelines that support a wide variety of industry use cases
Iceberg provides capabilities beyond traditional partitioning of data in Spark/Hive. It allows updating or deleting individual rows without rewriting partitions through mutable row operations (MOR). It also supports ACID transactions through versions, faster queries through statistics and sorting, and flexible schema changes. Iceberg manages metadata that traditional formats like Parquet do not, enabling these new capabilities. It is useful for workloads that require updating or filtering data at a granular record level, managing data history through versions, or frequent schema changes.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in enterprise production environment to deploy and operationalized?
The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story? This session will cover the Royal Bank of Canada’s (RBC) journey of moving away from traditional ETL batch processing with Teradata towards using the Hadoop ecosystem for ingesting data. One of the first systems to leverage this new approach was the Event Standardization Service (ESS). This service provides a centralized “client event” ingestion point for the bank’s internal systems through either a web service or text file daily batch feed. ESS allows down stream reporting applications and end users to query these centralized events.
We discuss the drivers and expected benefits of changing the existing event processing. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier.
Speakers
Darryl Sutton, T4G, Principal Consultant
Kenneth Poon, RBC, Director, Data Engineering
The document discusses a presentation about modern data platforms on AWS. It provides a brief history of major big data releases from 2004 to present. It then discusses how data platforms need to scale exponentially to handle growing amounts of data and users. The remainder of the document discusses various AWS databases, analytics tools, data lakes, data movement services, and how they can be used to build flexible, scalable data platforms.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
07. Analytics & Reporting Requirements TemplateAlan D. Duncan
This document template defines an outline structure for the clear and unambiguous definition of analytics & reporting outputs (including standard reports, ad hoc queries, Business Intelligence, analytical models etc).
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
(updated slides used for North Texas DAMA meetup Oct 2018) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 15 years and is now growing in popularity. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
This document outlines a playbook for implementing a data governance program. It begins with an introduction to data governance, discussing why data matters for organizations and defining key concepts. It then provides guidance on understanding business drivers to ensure the program aligns with strategic objectives. The playbook describes assessing the current state, developing a roadmap, defining the scope of key data, establishing governance models, policies and standards, and processes. It aims to help clients establish an effective enterprise-wide data governance program.
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j
The document discusses how graph databases like Neo4j can enable real-time analytics at massive scale by leveraging relationships in data. It notes that data is growing exponentially but traditional databases can't efficiently analyze relationships. Neo4j natively stores and queries relationships to allow analytics 1000x faster. The document advocates that graphs will form the foundation of modern data and analytics by enhancing machine learning models and enabling outcomes like building intelligent applications faster, gaining deeper insights, and scaling limitlessly without compromising data.
Cost Efficiency Strategies for Managed Apache Spark ServiceDatabricks
This document discusses cost efficiency strategies for managed Apache Spark services. It begins by outlining the motivations for focusing on costs and introduces tools for cost analysis and optimization. It then describes the Azure Databricks platform and how it can be used to run Spark workloads more cost efficiently compared to infrastructure as a service options. The document details various Databricks pricing plans and units. Finally, it provides several cost optimization strategies, such as pre-purchasing plans, selecting efficient runtimes and frameworks, avoiding unnecessary storage, setting spending limits, and enabling auto-scaling.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Estimating the Total Costs of Your Cloud Analytics Platform DATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multi-faceted needs by offering multi-function Data Management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They need a worry-free experience with the architecture and its components.
A complete machine learning infrastructure cost for the first modern use case at a midsize to large enterprise will be anywhere from $2M to $14M. Get this data point as you take the next steps on your journey.
Learn how to accurately scope analytics migrations that comes in on time and on budget. See the recording and download this deck: https://senturus.com/resources/prepare-bi-migration/
Senturus offers a full spectrum of services for business analytics. Our Knowledge Center has hundreds of free live and recorded webinars, blog posts, demos and unbiased product reviews available on our website at: https://senturus.com/resources/
AWS re:Invent 2016: Strategic Planning for Long-Term Data Archiving with Amaz...Amazon Web Services
Without careful planning, data management can quickly turn complex with a runaway cost structure. Enterprise customers are turning to the cloud to solve long-term data archive needs such as reliability, compliance, and agility while optimizing the overall cost. Come to this session and hear how AWS customers are using Amazon Glacier to simplify their archiving strategy. Learn how customers architect their cloud archiving applications and share integration to streamline their organization's data management and establish successful IT best practices.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
This document provides an overview and best practices for big data architectures. It discusses big data challenges and principles like building decoupled systems and using the right tool for the job. It outlines different data types (hot, warm, cold) and storage technologies like stream processing, databases, search, and file storage. It also covers processing frameworks and reference architectures using patterns like materialized views and immutable logs. Finally, it provides a customer story about a telecommunications company transforming to real-time analytics.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Luca Canali presented on using flame graphs to investigate performance improvements in Spark 2.0 over Spark 1.6 for a CPU-intensive workload. Flame graphs of the Spark 1.6 and 2.0 executions showed Spark 2.0 spending less time in core Spark functions and more time in whole stage code generation functions, indicating improved optimizations. Additional tools like Linux perf confirmed Spark 2.0 utilized CPU and memory throughput better. The presentation demonstrated how flame graphs and other profiling tools can help pinpoint performance bottlenecks and understand the impact of changes like Spark 2.0's code generation optimizations.
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev
This document summarizes accelerating Apache Spark with DAOS (Distributed Asynchronous Object Storage) on Aurora. It describes using DAOS as a Hadoop filesystem for Spark input/output storage and as a shuffle data store. It shows how the DAOS Hadoop filesystem delivers similar throughput to DAOS DFS. It also introduces a DAOS object-based Spark shuffle manager that improves shuffle read throughput by up to 1.5x compared to using the local filesystem, especially for smaller shuffle blocks. Future plans include optimizing the DAOS Spark shuffle manager using async APIs and simplifying user configuration.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Spark is a fast, general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R for distributed tasks including SQL, streaming, and machine learning. Spark improves on MapReduce by keeping data in-memory, allowing iterative algorithms to run faster than disk-based approaches. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, acting as a fault-tolerant collection of elements that can be operated on in parallel.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points:
- SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes.
- Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL.
- The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and BI integration help meet requirements for timely processing and quick responses.
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points:
- SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes.
- Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL.
- The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick
In order to provide prompt results and efficiently deal with data-intensive workloads, Big Data applications execute their jobs on compute slots across large clusters. Also, for optimal performance, these applications should be as close as possible to the data they use. Data-aware scheduling is the way to achieve that optimization and can conveniently be set up using Kubernetes. We’ll present two different use cases: First, we’ll make use of how Big Data applications like Hadoop and Spark can use their native HDFS protocol for data-aware scheduling. Second, we’ll demonstrate an efficient way to write a data-aware scheduler for Kubernetes that satisfies not just your application’s requirements, but also keeps your admins happy. As a bonus, it’ll also allows us to run data-aware scheduling on applications other than Big Data.
Event: Kubernetes Meetup Rhein-Neckar, 18.10.2017
Speaker: Johannes M. Scheuermann
weiter Tech-Vorträge: https://www.inovex.de/de/content-pool/vortraege/
Tech-Artikel in unserem Blog: https://www.inovex.de/blog/
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkLenovo Data Center
Some configurations deserve their own SlideShare entry: this is one of them. When the indsutry's first 100TB Spark SQL benchmark was reached, the media took notice. For good reason.
Intel, Mellanox, Lenovo and IBM came together to investigate a topology that leveraged advances in CPU, memory, storage and networking to assess the readiness of Spark SQL to harness new capabilities -- and speeds.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiDatabricks
At Linkedin, we have thousands of Hadoop and Spark users ranging from amateurs to experts who run a variety of jobs on our huge 2000-plus node clusters. In just a few years, the number of Hadoop and Spark jobs have grown from hundreds to thousands. With this ever increasing number of users and jobs, it becomes very crucial to have an efficient way to find answers to frequently asked questions like:
1) Why is my job running slow?
2) Why did my job get killed?
3) Can you send me an alert when my job is about to fail or miss SLA?
4) Do we have enough resources on the Hadoop cluster?
Having this information available with us will help in quicker debugging, alert based on anomalies, perform root cause analysis(RCA), identify workload patterns and perform capacity planning. To address this problem, we at Linkedin have built a Unified Grid Metrics Platform that captures and stores, current and historical job metrics. In our experience debugging and tuning jobs and interacting with our users, we have learnt a lot of lessons and have been integrating ideas and solutions into this system. For example, we have learned that capturing and storing the complete set of metrics and its history though fascinating is actually rarely useful just like the verbose logs in Spark. We have come up with some derived metrics and curated list of metrics which we track very closely at LinkedIn.
In this talk, we will discuss the architecture of how we built this platform for both Hadoop and Spark along with the huge challenges in collecting all the standard, derived and custom user metrics in real-time. We will see how this system allows users to build reporting dashboards, perform trend analysis, dimension analysis and view correlated metrics together.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Spark is a cluster computing framework designed to be fast, general-purpose, and able to handle a wide range of workloads including batch processing, iterative algorithms, interactive queries, and streaming. It is faster than Hadoop for interactive queries and complex applications by running computations in-memory when possible. Spark also simplifies combining different processing types through a single engine. It offers APIs in Java, Python, Scala and SQL and integrates closely with other big data tools like Hadoop. Spark is commonly used for interactive queries on large datasets, streaming data processing, and machine learning tasks.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
This document outlines the agenda and content for a presentation on xPatterns, a tool that provides APIs and tools for ingesting, transforming, querying and exporting large datasets on Apache Spark, Shark, Tachyon and Mesos. The presentation demonstrates how xPatterns has evolved its infrastructure to leverage these big data technologies for improved performance, including distributed data ingestion, transformation APIs, an interactive Shark query server, and exporting data to NoSQL databases. It also provides examples of how xPatterns has been used to build applications on large healthcare datasets.
Similar to Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfleebarnesutopia
So… you want to become a Test Automation Engineer (or hire and develop one)? While there’s quite a bit of information available about important technical and tool skills to master, there’s not enough discussion around the path to becoming an effective Test Automation Engineer that knows how to add VALUE. In my experience this had led to a proliferation of engineers who are proficient with tools and building frameworks but have skill and knowledge gaps, especially in software testing, that reduce the value they deliver with test automation.
In this talk, Lee will share his lessons learned from over 30 years of working with, and mentoring, hundreds of Test Automation Engineers. Whether you’re looking to get started in test automation or just want to improve your trade, this talk will give you a solid foundation and roadmap for ensuring your test automation efforts continuously add value. This talk is equally valuable for both aspiring Test Automation Engineers and those managing them! All attendees will take away a set of key foundational knowledge and a high-level learning path for leveling up test automation skills and ensuring they add value to their organizations.
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxSunil Jagani
Discover how AI is transforming the workplace and learn strategies for reskilling and upskilling employees to stay ahead. This comprehensive guide covers the impact of AI on jobs, essential skills for the future, and successful case studies from industry leaders. Embrace AI-driven changes, foster continuous learning, and build a future-ready workforce.
Read More - https://bit.ly/3VKly70
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
1. Uncovering an Apache Spark 2 Benchmark -
Configuration, Tuning and Test Results
• Mark Lochbihler, Hortonworks - Principal Architect
• Viplava Madasu, HPE - Big Data Systems Engineer
San Jose, California
JUNE 17–21, 2018
1
Tuesday, June 19
4:00 PM - 4:40 PM
Executive Ballroom
210C/D/G/H
2. Today’s Agenda
• What’s New with Spark 2.x – Mark
• Spark Architecture
• Spark on YARN
• What’s New
• Spark 2.x Benchmark - Viplava
• What was Benchmarked
• Configuration and Tuning
• Infrastructure Used
• Results
• Questions / More Info – Mark and Viplava
San Jose, California
JUNE 17–21, 20182
3. Apache Spark
Apache Spark is a fast general-purpose engine for large-scale data
processing. Spark was developed in response to limitations in Hadoop’s
two-stage disk-based MapReduce processing framework.
Orchestration:
Spark’s standalone cluster manager, Apache Mesos,
or Hadoop YARN San Jose, California
JUNE 17–21, 2018
3
4. Spark on Hadoop YARN
YARN has the concept of labels for groupings of Hadoop Worker nodes.
Spark on YARN is an optimal way to schedule and run Spark jobs on a Hadoop cluster alongside a variety of
other data-processing frameworks, leveraging existing clusters using queue placement policies, and enabling
security by running on Kerberos-enabled clusters.
Client Mode Cluster Mode
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
San Jose, California
JUNE 17–21, 20184
5. Spark 2.x vs Spark 1.x
Apache Spark 2.x is a major release update of Spark 1.x and includes
significant updates in the following areas:
• API usability
• SQL 2003 support
• Performance improvements
• Structured streaming
• R UDF support
• Operational improvements
San Jose, California
JUNE 17–21, 2018
5
6. Spark 2.x – New and Updated APIs
Including:
• Unifying DataFrame and Dataset APIs providing type safety for
DataFrames
• New SparkSession API with a new entry point that replaces the old
SQLContext and HiveContext for DataFrame and Dataset APIs
• New streamlined configuration API for SparkSession
• New improved Aggregator API for typed aggregation in Datasets
San Jose, California
JUNE 17–21, 2018
6
7. Spark 2.x – Improved SQL Functionality
• ANSI SQL 2003 support
• Enables running all 99 TPC-DS queries
• A native SQL parser that supports both ANSI-SQL as well as Hive QL
• Native DDL command implementations
• Subquery support
• Native CSV data source
• Off-heap memory management for both caching and runtime
execution
• Hive-style bucketing support
San Jose, California
JUNE 17–21, 2018
7
8. Spark 2.x – Performance Improvements
• By implementing a new technique called “whole stage code
generation”, Spark 2.x improves the performance 2-10 times for
common operators in SQL and DataFrames.
• Other performance improvements include:
• Improved Parquet scan throughput through vectorization
• Improved ORC performance
• Many improvements in the Catalyst query optimizer for common workloads
• Improved window function performance via native implementations
for all window functions.
San Jose, California
JUNE 17–21, 2018
8
9. Spark 2.x – Spark Machine Learning API
• Spark 2.x replaces the RDD-based APIs in the spark.mllib package (put in
maintenance mode) with the DataFrame-based API in the spark.ml
package.
• New features in the Spark 2.x Machine Learning API include:
• ML persistence to support saving and loading ML models and Pipelines
• New MLlib APIs in R for generalized linear models
• Naive Bayes
• K-Means Clustering
• Survival regression
• New MLlib APIs in Python for
• LDA, Gaussian Mixture Model, Generalized Linear Regression, etc.
San Jose, California
JUNE 17–21, 2018
9
10. Spark 2.x – Spark Streaming
• Spark 2.x introduced a new high-level streaming API, called
Structured Streaming, built on top of Spark SQL and the Catalyst
optimizer.
• Structured Streaming enables users to program against streaming
sources and sinks using the same DataFrame/Dataset API as in static
data sources, leveraging the Catalyst optimizer to automatically
incrementalize the query plans.
San Jose, California
JUNE 17–21, 2018
10
11. 11
Hortonworks Data Platform 2.6.5 – Just Released
HDP 2.6.5 / 3.0 includes Apache Spark 2.3
ORC/Parquet Feature Parity
– Spark extends its vectorized read capability to ORC data sources.
– Structured streaming officially supports ORC data source with API and documentation
Python Pandas UDF, with good performance and easy to use for Pandas users. This feature supports
financial analysis use cases.
Structured streaming now supports stream-stream joins.
Structured streaming that goes to millisecond latency (Alpha). New continuous processing mode
provides the best performance by minimizing the latency without waiting in idle status.
San Jose, California
JUNE 17–21, 2018
12. Evaluation of Spark SQL with Spark 2.x versus Spark 1.6
• Benchmark Performed
• Hive testbench, which is similar to TPC-DS benchmark
• Tuning for the benchmark
San Jose, California
JUNE 17–21, 2018
12
13. Why Cluster tuning matters
• Spark/Hadoop default configurations are not optimal for most enterprise
applications
• Large number of configuration parameters
• Tuning cluster will benefit all the applications
• Can further tune job level configuration
• More important if using disaggregated compute/storage layers as in HPE
Reference Architecture
• Useful for cloud too
San Jose, California
JUNE 17–21, 2018
13
14. Factors to consider for Spark performance tuning
• Hardware
• CPU, Memory, Storage systems, Local disks, Network
• Hadoop configuration
• HDFS
• YARN
• Spark configuration
• Executor cores, Executor memory, Shuffle partitions, Compression etc.
San Jose, California
JUNE 17–21, 2018
14
15. General Hardware Guidelines
• Sizing hardware for Spark depends on the use case, but Spark benefits from
• More CPU cores
• More memory
• Flash storage for temporary storage
• Faster network fabric
• CPU Cores
• Spark scales well to tens of CPU cores per machine
• Most Spark applications are CPU bound, so at least 8-16 cores per machine.
• Memory
• Spark can make use of hundreds of gigabytes of memory per machine
• Allocate only at most 75% of the memory for Spark; leave the rest for the operating
system and buffer cache.
• Storage tab of Spark’s monitoring UI will help.
• Max 200GB per executor.
San Jose, California
JUNE 17–21, 2018
15
16. General Hardware Guidelines …
• Network
• For Group-By, Reduce-By, and SQL join operations, network performance
becomes important due to the Shuffles involved
• 10 Gigabit network is the recommended choice
• Local Disks
• Spark uses local disks to store data that doesn’t fit in RAM, as well as to preserve
intermediate output between stages
• SSDs are recommended
• Mount disks with noatime option to reduce unnecessary writes
San Jose, California
JUNE 17–21, 2018
16
18. Useful HDFS configuration settings
• Increase the dfs.blocksize value to allow more data to be processed by
each map task
• Also reduces NameNode memory consumption
• dfs.blocksize 256/512MB
• Increase the dfs.namenode.handler.count value to better manage
multiple HDFS operations from multiple clients
• dfs.namenode.handler.count 100
• To eliminate timeout exceptions (java.io.IOException: Unable to close file
close file because the last block does not have enough number of replicas),
San Jose, California
JUNE 17–21, 2018
18
19. Useful YARN configuration settings
• YARN is the popular cluster manager for Spark on Hadoop, so it is
important that YARN and Spark configurations are tuned in tandem.
• Settings of Spark executor memory and executor cores result in
allocation requests to YARN with the same values and YARN should be
configured to accommodate the desired Spark settings
• Amount of physical memory that can be allocated for containers per
node
• yarn.nodemanager.resource.memory-mb 384 GiB
• Amount of vcores available on a compute node that can be allocated for
containers
• yarn.nodemanager.resource.cpu-vcores 48
San Jose, California
JUNE 17–21, 2018
19
20. YARN tuning …
• Number of YARN containers depends on the nature of the workload
• Assuming total of 384 GiB on each node, a workload that needs 24 GiB containers
will result in 16 total containers
• Assuming 12 worker nodes, number of 24 GiB containers = 16 * 12 – 1 = 191
• One container per YARN application master
• General guideline is to configure containers in a way that maximizes the
utilization of the memory and vcores on each node in the cluster
San Jose, California
JUNE 17–21, 2018
20
21. YARN tuning …
• Location of YARN intermediate files on the compute nodes
• yarn.nodemanager.local-dirs /data1/hadoop/yarn/local, /data2/hadoop/yarn/local,
/data3/hadoop/yarn/local, /data4/hadoop/yarn/local
• Setting of spark.local.dir is ignored for YARN cluster mode
• The node-locality-delay specifies how many scheduling intervals to let
pass attempting to find a node local slot to run on prior to searching for a
rack local slot
• Important for small jobs that do not have a large number of tasks as it will better
utilize the compute nodes
• yarn.scheduler.capacity.node-locality-delay 1
San Jose, California
JUNE 17–21, 2018
21
22. Tuning Spark – Executor cores
• Unlike Hadoop MapReduce where each map or reduce task is always started in a new
process, Spark can efficiently use process threads (cores) to distribute task processing
• Results in a need to tune Spark executors with respect to the amount of memory
and number of cores each executor can use
• Has to work within the configuration boundaries of YARN
• Number of cores per executor can be controlled by
• the configuration setting spark.executor.cores
• the --executor-cores option of the spark-submit command
• The default is 1 for Spark on YARN
San Jose, California
JUNE 17–21, 2018
22
23. Tuning Spark – Executor cores
• Simplest but inefficient approach would be to configure one executor per core and divide the memory
equally among the number of executors
• Since each partition cannot be computed on more than one executor, the size of each partition is
limited and causes memory problems, or spilling to disk for shuffles
• If the executors have only one core, then at most one task can run in each executor, which throws
away the benefits of broadcast variables, which have to be sent to each executor once.
• Each executor has some memory overhead (minimum of 384MB) – so, if we have many small
executors, results in lot of memory overhead
• Giving many cores to each executor also has issues
• GC issues - since a larger JVM heap will delay the time until a GC event is triggered resulting in
larger GC pauses
• Results in poor HDSF throughput issues because of handling many concurrent threads
• spark.executor.cores – experiment and set this based on your workloads. We found 9 was
was the right setting for this configuration and bench test in our lab.
San Jose, California
JUNE 17–21, 2018
23
24. Tuning Spark – Memory
• Memory for each Spark job is application specific
• Configure Executor memory in proportion to the number of partitions and cores per
executor
• Divide the total amount of memory on each node by the number of executors on the node
• Should be less than the maximum YARN container size - so YARN maximum container size may
need to be adjusted accordingly
• Configuration setting spark.executor.memory or the --executor-memory option of the spark-
submit command
• JVM runs into issues with very large heaps (above 80GB).
• Spark Driver memory
• If driver collects too much data, the job may run into OOM errors.
• Increase the driver memory using spark.driver.maxResultSize
San Jose, California
JUNE 17–21, 2018
24
25. Spark 2.x – Memory Model
• Each executor has memory overhead for things like VM
overheads, interned strings, other native overheads
• spark.yarn.executor.memoryOverhead
• Default value is spark.executor.memory * 0.10, with minimum of
384MB.
• Prior to Spark 1.6, separate tuning was needed for
storage (RDD) memory and execution/shuffle memory
via spark.storage.memoryFraction and
spark.shuffle.memoryFraction
• Spark 1.6 introduced a new “UnifiedMemoryManager”
• When no Storage memory is used, Execution can acquire all the
available memory and vice versa
• As a result, applications that do not use caching can use the
entire space for execution, obviating unnecessary disk spills.
• Applications that do use caching can reserve a minimum storage
space where their data blocks are immune to being evicted
• spark.memory.storageFraction tunable, but good out-of-the-box
performance
San Jose, California
JUNE 17–21, 2018
25
26. Tuning Spark – Shuffle partitions
• Spark SQL, by default, sets the number of reduce side
partitions to 200 when doing a shuffle for wide
transformations, e.g., groupByKey, reduceByKey,
sortByKey etc.
• Not optimal for many cases as it will use only 200 cores for
processing tasks after the shuffle
• For large datasets, this might result in shuffle block overflow
resulting in job failures
• The number of shuffle partitions should be at least
equal to the number of total executor cores or a
multiple of it in case of large data sets.
• spark.sql.shuffle.partitions setting
• Also – helps to partition in prime numbers in terms of
hash effectiveness.
San Jose, California
JUNE 17–21, 2018
26
27. Tuning Spark – Compression
• Using compression in Spark can improve performance in a meaningful
way as compression results in less disk I/O and network I/O
• Even though compressing the data results in some CPU cycles being
used, the performance improvements with compression outweigh the
CPU overhead when a large amount of data is involved
• Also compression results in reduced storage requirements for storing
data on disk, e.g., intermediate shuffle files
San Jose, California
JUNE 17–21, 2018
27
28. Tuning Spark – Compression
• spark.io.compression.codec setting to decide the codec
• three codecs provided: lz4, lzf, and snappy
• default codec is lz4
• Four main places where Spark makes use of compression
• Compress map output files during a shuffle operation using
spark.shuffle.compress setting (Default true)
• Compress data spilled during shuffles using spark.shuffle.spill.compress setting
(Default true)
• Compress broadcast variables before sending them using
spark.broadcast.compress setting (Default true)
• Compress serialized RDD partitions using spark.rdd.compress setting
(Default false)
San Jose, California
JUNE 17–21, 2018
28
29. Tuning Spark – Serialization type
• Serialization plays an important role in the performance of any distributed application
• Spark memory usage is greatly affected by storage level and serialization format
• By default, Spark serializes objects using Java Serializer which can work with any class that implements
java.io.Serializable interface
• For custom data types, Kryo Serialization is more compact and efficient than Java Serialization
• but user classes need to be explicitly registered with the Kryo Serializer
• spark.serializer org.apache.spark.serializer.KryoSerializer
• Spark SQL automatically uses Kryo serialization for DataFrames internally in Spark 2.x
• For customer applications that still use RDDs, Kryo Serialization should result in a significant
performance boost
San Jose, California
JUNE 17–21, 2018
29
30. Tuning Spark – Other configuration settings
• When using ORC/parquet format for the data, Spark SQL can push the filter
down to ORC/parquet, thus avoiding large data transfer.
• spark.sql.orc.filterPushdown (Default false)
• spark.sql.parquet.filterPushdown (Default true)
• For large data sets, you may encounter various network timeouts. Can tune
different timeout values
• spark.core.connection.ack.wait.timeout
• spark.storage.blockManagerSlaveTimeoutMs
• spark.shuffle.io.connectionTimeout
• spark.rpc.askTimeout
• spark.rpc.lookupTimeout
• “Umbrella” setting for all these timeouts, spark.network.timeout (Default is
120 seconds). For 10TB dataset, this value should be something like 600
seconds.
San Jose, California
JUNE 17–21, 2018
30
31. HPE’s Elastic Platform for Big Data Analytics (EPA)
Modular building blocks of compute and storage optimized for modern workloads
Apollo 2000
Compute
DL360 Apollo 6500
w/ NVIDIA GPU
Synergy
Storage
Apollo 4200DL380 Apollo 4510
Hot Cold Object
Purpose - built
Network FlexFabric 5950/5940
San Jose, California
JUNE 17–21, 201831
32. HPE EPA - Single-Rack Reference Architecture for Spark 2.x
San Jose, California
JUNE 17–21, 201832
33. HPE EPA - Single-Rack Reference Architecture for Spark 2.x
San Jose, California
JUNE 17–21, 201833
34. San Jose, California
JUNE 17–21, 2018
Base Rack
• (1) DL360 Control Block – (1)
Management Node, (2) Head Nodes
• (8) Apollo 2000 Compute Blocks –
(32) XL170r Worker Nodes
• (10) Apollo 4200 Storage Blocks –
(10) Apollo 4200 Data Nodes
• (1) Network Block
• (1) Rack Block
Aggregation Rack
• (8) Apollo 2000 Compute Blocks – (32)
XL170r Worker Nodes
• (10) Apollo 4200 Storage Blocks – (10)
Apollo 4200 Data Nodes
• (1) Network Block
• (1) Aggregation Switch Block - (2) HPE
5950 32QSFP28
• (1) Rack Block
Expansion Rack
• (8) Apollo 2000 Compute Blocks –
(32) XL170r Worker Nodes
• (10) Apollo 4200 Storage Blocks –
(10) Apollo 4200 Data Nodes
• (1) Network Block
• (1) Rack Block
Expansion Rack
• (8) Apollo 2000 Compute Blocks –
(32) XL170r Worker Nodes
• (10) Apollo 4200 Storage Blocks –
(10) Apollo 4200 Data Nodes
• (1) Network Block
• (1) Rack Block
HPE EPA - Multi-Rack configuration
34
35. Spark 2.x - Effect of cores per executor on query performance
San Jose, California
JUNE 17–21, 2018
35
36. Spark 2.x – Effect of shuffle partitions on query performance
San Jose, California
JUNE 17–21, 201836
37. Spark 2.x – Effect of compression codec on query performance
San Jose, California
JUNE 17–21, 2018
37
38. Evaluation of Spark SQL with Spark 2.x versus Spark 1.6
• Hive testbench(similar to TPC/DS) with 1000 SF (1TB size) and 10000 SF
(10TB size) used for testing
• Hive testbench used to generate the data
• ORC format used for storing the data
• ANSI SQL compatibility
• Spark 2.x could run all Hive testbench queries whereas Spark 1.6 could run only
50 queries
• Spark SQL robustness
• With 10TB dataset size, Spark 2.x could finish all of the queries whereas Spark 1.6
could finish only about 40 queries
San Jose, California
JUNE 17–21, 2018
38
39. Spark 2.x performance improvements over Spark 1.6
- with 10000 SF (10TB)
San Jose, California
JUNE 17–21, 2018
39
40. Spark 2.x performance improvements over Spark 1.6
- with 10000 SF (10TB)
San Jose, California
JUNE 17–21, 2018
40
41. Spark 2.x - Scaling performance by adding Compute Nodes
without data rebalancing
San Jose, California
JUNE 17–21, 2018
41
43. Questions ?
San Jose, California
JUNE 17–21, 2018
43
Mark Lochbihler, Hortonworks - Principal Architect
mlochbihler@hortonworks.com
Viplava Madasu, HPE - Big Data Systems Engineer
viplava.madasu@hpe.com