Apache Spark is a general engine for processing data on a large scale. Employing this tool in a distributed environment to process large data sets is undeniably beneficial.
But what about fast feedback loop while developing such application with Apache Spark? Testing it on a cluster is essential, but it does not seem to be what most developers accustomed to TDD workflow would like to do.
In the talk, ŁLLukasz will share with you some tips on how to write the unit and integration tests, and how Docker can be applied to test Spark application on a local machine.
Examples will be presented within the ScalaTest framework, and it should be easy to grasp by people who know Scala and other JVM languages.
Optimising Geospatial Queries with Dynamic File PruningDatabricks
One of the most significant benefits provided by Databricks Delta is the ability to use z-ordering and dynamic file pruning to significantly reduce the amount of data that is retrieved from blob storage and therefore drastically improve query times, sometimes by an order of magnitude.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Operating and Supporting Delta Lake in ProductionDatabricks
The document discusses strategies for optimizing and managing metadata in Delta Lake. It provides an overview of optimize, auto-optimize, and optimize write strategies and how to choose the appropriate strategy based on factors like workload, data size, and cluster resources. It also discusses Delta Lake transaction logs, configurations like log retention duration, and tips for working with Delta Lake metadata.
The document compares three table formats for large scale data storage and analytics: ACID ORC, Apache Iceberg, and Delta Lake. ACID ORC provides ACID transactions for Hive tables stored as ORC files but has slow metadata operations. Iceberg supports multiple file formats, robust schema changes, and time travel capabilities but lacks commercial support. Delta Lake has great Spark integration, SQL merge syntax, and time travel with optimized compaction, but only supports Parquet files and multicluster writes on HDFS.
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
The document discusses code optimization techniques in Spark SQL's Catalyst optimizer. It describes how function outlining can improve performance of generated Java code by splitting large routines into smaller ones. The document outlines a Spark SQL query optimization case study where outlining a 300+ line routine from Catalyst code generation improved query performance by up to 19% on a Power8 cluster. Overall, the document examines how function outlining and other code generation optimizations in Catalyst can help the Java JIT compiler better optimize Spark SQL queries.
Optimising Geospatial Queries with Dynamic File PruningDatabricks
One of the most significant benefits provided by Databricks Delta is the ability to use z-ordering and dynamic file pruning to significantly reduce the amount of data that is retrieved from blob storage and therefore drastically improve query times, sometimes by an order of magnitude.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Operating and Supporting Delta Lake in ProductionDatabricks
The document discusses strategies for optimizing and managing metadata in Delta Lake. It provides an overview of optimize, auto-optimize, and optimize write strategies and how to choose the appropriate strategy based on factors like workload, data size, and cluster resources. It also discusses Delta Lake transaction logs, configurations like log retention duration, and tips for working with Delta Lake metadata.
The document compares three table formats for large scale data storage and analytics: ACID ORC, Apache Iceberg, and Delta Lake. ACID ORC provides ACID transactions for Hive tables stored as ORC files but has slow metadata operations. Iceberg supports multiple file formats, robust schema changes, and time travel capabilities but lacks commercial support. Delta Lake has great Spark integration, SQL merge syntax, and time travel with optimized compaction, but only supports Parquet files and multicluster writes on HDFS.
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
The document discusses code optimization techniques in Spark SQL's Catalyst optimizer. It describes how function outlining can improve performance of generated Java code by splitting large routines into smaller ones. The document outlines a Spark SQL query optimization case study where outlining a 300+ line routine from Catalyst code generation improved query performance by up to 19% on a Power8 cluster. Overall, the document examines how function outlining and other code generation optimizations in Catalyst can help the Java JIT compiler better optimize Spark SQL queries.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.
Deep Dive into the New Features of Apache Spark 3.1Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.
The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.
Enhancements that will make your sql database roar sp1 edition sql bits 2017Bob Ward
This document provides information about various SQL Server features and editions. It includes a list of features available in each edition like row-level security, dynamic data masking, and in-memory OLTP. It also includes memory limits, MAXDOP settings, and pushdown capabilities for different editions. The document discusses lightweight query profiling improvements in SQL Server 2016 SP1 and provides details on predicate pushdown indicators in showplans.
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalDatabricks
Ingesting data from variety of sources like Mysql, Oracle, Kafka, Sales Force, Big Query, S3, SaaS applications, OSS etc. with billions of records into datalake (for reporting, adhoc analytics, ML jobs) with reliability, consistency, schema evolution support and within expected SLA has always been a challenging job. Also ingestion may have different flavors like full ingestion, incremental ingestion with and without compaction/de-duplication and transformations with their own complexity of state management and performance. Not to mention dependency management where hundreds / thousands of downstream jobs are dependent on this ingested data and hence data availability on time is of utmost importance. Most data teams end up creating adhoc ingestion pipelines written in different languages and technologies which adds operational overheads and knowledge is mostly limited to few.
In this session, I will talk about how we leveraged Sparks Dataframe abstraction for creating generic ingestion platform capable of ingesting data from varied sources with reliability, consistency, auto schema evolution and transformations support. Will also discuss about how we developed spark based data sanity as one of the core components of this platform to ensure 100% correctness of ingested data and auto-recovery in case of inconsistencies found. This talk will also focus how Hive table creation and schema modification was part of this platform and provided read time consistencies without locking while Spark Ingestion jobs were writing on the same Hive tables and how we maintained different versions of ingested data to do any rollback if required and also allow users of this ingested data to go back in time and read snapshot of ingested data at that moment.
Post this talk one should be able to understand challenges involved in ingesting data reliably from different sources and how one can leverage Spark’s Dataframe abstraction to solve this in unified way.
Brk3043 azure sql db intelligent cloud database for app developers - wash dcBob Ward
Make building and maintaining applications easier and more productive. With built-in intelligence that learns app patterns and adapts to maximize performance, reliability, and data protection, SQL Database is a cloud database built for developers. The session covers our most advanced features to-date including Threat Detection, auto-tuned performance and actionable recommendations across performance and security aspects. Case studies and live demos help you understand how choosing SQL Database will make a difference for your app and your company.
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit
This document discusses Spark Streaming and how it can push throughput limits in a reactive way. It describes how Spark Streaming works by breaking streams into micro-batches and processing them through Spark. It also discusses how Spark Streaming can be made more reactive by incorporating principles from Reactive Streams, including composable back pressure. The document concludes by discussing challenges like data locality and providing resources for further information.
Composable Data Processing with Apache SparkDatabricks
As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App.
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
The document discusses Pinterest migrating their Apache Spark clusters from HDFS to S3 storage. Some key points:
1) Migrating to S3 provided significantly better performance due to the higher IOPS of modern EC2 instances compared to their older HDFS nodes. Jobs saw 25-35% improvements on average.
2) S3 is eventually consistent while HDFS is strongly consistent, so they implemented the S3Committer to handle output consistency issues during job failures.
3) Metadata operations like file moves were very slow in S3, so they optimized jobs to reduce unnecessary moves using techniques like multipart uploads to S3.
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Solving low latency query over big data with Spark SQLJulien Pierre
This document provides an overview of client data, capabilities, and architecture for a data analytics platform. It discusses data size and query latency, processing and storage using Cosmos, SparkSQL and HDFS, a Mesos cluster architecture with Zookeeper, and interactive analytics using Zeppelin and Avocado notebooks. The platform aims to provide a unified environment for data ingestion, transformation, storage, processing and analytics to enable intelligent data products and experiences.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Speakers:
Ian Meyers, AWS Solutions Architect
Toby Moore, Chief Technology Officer, Space Ape
Mapping Data Flows Perf Tuning April 2021Mark Kromer
This document discusses optimizing performance for data flows in Azure Data Factory. It provides sample timing results for various scenarios and recommends settings to improve performance. Some best practices include using memory optimized Azure integration runtimes, maintaining current partitioning, scaling virtual cores, and optimizing transformations and sources/sinks. The document also covers monitoring flows to identify bottlenecks and global settings that affect performance.
The document discusses the challenges of migrating a production pipeline from a legacy Big Data platform to Spark. It presents an approach using CyFlow, a framework built on Spark that allows component reuse and defines dependencies through a directed acyclic graph (DAG). Key challenges addressed include maintaining semantics during code conversion, meeting real-time constraints, and reducing costs. Metrics for validation include Jaccard similarity and precision/recall. Performance is tuned by aggregating state, modifying partitions, caching data, and unpersisting unneeded dataframes.
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
This document discusses Coursera's use of AWS services like Amazon Redshift, EMR, and Data Pipeline to consolidate their data from various sources, make the data easier for analysts and users to access, and increase the reliability of their data infrastructure. It describes how Coursera programmatically defined ETL pipelines using these services to extract, transform, and load data between sources like MySQL, Cassandra, S3, and Redshift. It also discusses how they built reporting and visualization tools to provide self-service access to the data and ensure high data quality and availability.
RealityMine collects digital user behavior data to help companies with marketing, product development, and analyzing user patterns. They are migrating from an on-premise SQL Server data warehouse to Amazon Redshift to handle doubling data volumes. Redshift provides better performance and scalability at lower cost compared to other options. It requires extracting raw data from SQL Server without encoding issues, loading to S3, and transforming in Redshift using a star schema with careful consideration of distribution and sort keys for query performance. Ongoing database maintenance and backups are also different in Redshift.
This document discusses NoSQL databases and Azure Cosmos DB. It notes that Cosmos DB supports key-value, column, document and graph data models. It guarantees high availability and throughput while offering customizable pricing based on throughput. Cosmos DB uses the Atom-Record-Sequence data model and provides SQL and table APIs to access and query data. The document provides an example of how 12 relational tables could be collapsed into 3 document collections in Cosmos DB.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Knoldus organized a Meetup on 1 April 2015. In this Meetup, we introduced Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. Spark is used at a wide range of organizations to process large datasets.
This document summarizes Apache Spark batch APIs, provides real-world examples of Spark jobs, addresses shortcomings of the Spark APIs, and outlines how to run and configure Spark jobs on AWS EMR. The document introduces the RDD, SQL, DataFrame and Dataset APIs in Spark and compares them. It then gives examples of enriching and shredding data with Spark. It discusses type-safe APIs to address issues in the default Spark APIs. Finally, it outlines the configuration needed to run optimized Spark jobs on EMR, including memory, parallelism and allocation settings.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.
Deep Dive into the New Features of Apache Spark 3.1Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.
The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.
Enhancements that will make your sql database roar sp1 edition sql bits 2017Bob Ward
This document provides information about various SQL Server features and editions. It includes a list of features available in each edition like row-level security, dynamic data masking, and in-memory OLTP. It also includes memory limits, MAXDOP settings, and pushdown capabilities for different editions. The document discusses lightweight query profiling improvements in SQL Server 2016 SP1 and provides details on predicate pushdown indicators in showplans.
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalDatabricks
Ingesting data from variety of sources like Mysql, Oracle, Kafka, Sales Force, Big Query, S3, SaaS applications, OSS etc. with billions of records into datalake (for reporting, adhoc analytics, ML jobs) with reliability, consistency, schema evolution support and within expected SLA has always been a challenging job. Also ingestion may have different flavors like full ingestion, incremental ingestion with and without compaction/de-duplication and transformations with their own complexity of state management and performance. Not to mention dependency management where hundreds / thousands of downstream jobs are dependent on this ingested data and hence data availability on time is of utmost importance. Most data teams end up creating adhoc ingestion pipelines written in different languages and technologies which adds operational overheads and knowledge is mostly limited to few.
In this session, I will talk about how we leveraged Sparks Dataframe abstraction for creating generic ingestion platform capable of ingesting data from varied sources with reliability, consistency, auto schema evolution and transformations support. Will also discuss about how we developed spark based data sanity as one of the core components of this platform to ensure 100% correctness of ingested data and auto-recovery in case of inconsistencies found. This talk will also focus how Hive table creation and schema modification was part of this platform and provided read time consistencies without locking while Spark Ingestion jobs were writing on the same Hive tables and how we maintained different versions of ingested data to do any rollback if required and also allow users of this ingested data to go back in time and read snapshot of ingested data at that moment.
Post this talk one should be able to understand challenges involved in ingesting data reliably from different sources and how one can leverage Spark’s Dataframe abstraction to solve this in unified way.
Brk3043 azure sql db intelligent cloud database for app developers - wash dcBob Ward
Make building and maintaining applications easier and more productive. With built-in intelligence that learns app patterns and adapts to maximize performance, reliability, and data protection, SQL Database is a cloud database built for developers. The session covers our most advanced features to-date including Threat Detection, auto-tuned performance and actionable recommendations across performance and security aspects. Case studies and live demos help you understand how choosing SQL Database will make a difference for your app and your company.
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit
This document discusses Spark Streaming and how it can push throughput limits in a reactive way. It describes how Spark Streaming works by breaking streams into micro-batches and processing them through Spark. It also discusses how Spark Streaming can be made more reactive by incorporating principles from Reactive Streams, including composable back pressure. The document concludes by discussing challenges like data locality and providing resources for further information.
Composable Data Processing with Apache SparkDatabricks
As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App.
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
The document discusses Pinterest migrating their Apache Spark clusters from HDFS to S3 storage. Some key points:
1) Migrating to S3 provided significantly better performance due to the higher IOPS of modern EC2 instances compared to their older HDFS nodes. Jobs saw 25-35% improvements on average.
2) S3 is eventually consistent while HDFS is strongly consistent, so they implemented the S3Committer to handle output consistency issues during job failures.
3) Metadata operations like file moves were very slow in S3, so they optimized jobs to reduce unnecessary moves using techniques like multipart uploads to S3.
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Solving low latency query over big data with Spark SQLJulien Pierre
This document provides an overview of client data, capabilities, and architecture for a data analytics platform. It discusses data size and query latency, processing and storage using Cosmos, SparkSQL and HDFS, a Mesos cluster architecture with Zookeeper, and interactive analytics using Zeppelin and Avocado notebooks. The platform aims to provide a unified environment for data ingestion, transformation, storage, processing and analytics to enable intelligent data products and experiences.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Speakers:
Ian Meyers, AWS Solutions Architect
Toby Moore, Chief Technology Officer, Space Ape
Mapping Data Flows Perf Tuning April 2021Mark Kromer
This document discusses optimizing performance for data flows in Azure Data Factory. It provides sample timing results for various scenarios and recommends settings to improve performance. Some best practices include using memory optimized Azure integration runtimes, maintaining current partitioning, scaling virtual cores, and optimizing transformations and sources/sinks. The document also covers monitoring flows to identify bottlenecks and global settings that affect performance.
The document discusses the challenges of migrating a production pipeline from a legacy Big Data platform to Spark. It presents an approach using CyFlow, a framework built on Spark that allows component reuse and defines dependencies through a directed acyclic graph (DAG). Key challenges addressed include maintaining semantics during code conversion, meeting real-time constraints, and reducing costs. Metrics for validation include Jaccard similarity and precision/recall. Performance is tuned by aggregating state, modifying partitions, caching data, and unpersisting unneeded dataframes.
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
This document discusses Coursera's use of AWS services like Amazon Redshift, EMR, and Data Pipeline to consolidate their data from various sources, make the data easier for analysts and users to access, and increase the reliability of their data infrastructure. It describes how Coursera programmatically defined ETL pipelines using these services to extract, transform, and load data between sources like MySQL, Cassandra, S3, and Redshift. It also discusses how they built reporting and visualization tools to provide self-service access to the data and ensure high data quality and availability.
RealityMine collects digital user behavior data to help companies with marketing, product development, and analyzing user patterns. They are migrating from an on-premise SQL Server data warehouse to Amazon Redshift to handle doubling data volumes. Redshift provides better performance and scalability at lower cost compared to other options. It requires extracting raw data from SQL Server without encoding issues, loading to S3, and transforming in Redshift using a star schema with careful consideration of distribution and sort keys for query performance. Ongoing database maintenance and backups are also different in Redshift.
This document discusses NoSQL databases and Azure Cosmos DB. It notes that Cosmos DB supports key-value, column, document and graph data models. It guarantees high availability and throughput while offering customizable pricing based on throughput. Cosmos DB uses the Atom-Record-Sequence data model and provides SQL and table APIs to access and query data. The document provides an example of how 12 relational tables could be collapsed into 3 document collections in Cosmos DB.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Knoldus organized a Meetup on 1 April 2015. In this Meetup, we introduced Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. Spark is used at a wide range of organizations to process large datasets.
This document summarizes Apache Spark batch APIs, provides real-world examples of Spark jobs, addresses shortcomings of the Spark APIs, and outlines how to run and configure Spark jobs on AWS EMR. The document introduces the RDD, SQL, DataFrame and Dataset APIs in Spark and compares them. It then gives examples of enriching and shredding data with Spark. It discusses type-safe APIs to address issues in the default Spark APIs. Finally, it outlines the configuration needed to run optimized Spark jobs on EMR, including memory, parallelism and allocation settings.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
This document summarizes a presentation about unit testing Spark applications. The presentation discusses why it is important to run Spark locally and as unit tests instead of just on a cluster for faster feedback and easier debugging. It provides examples of how to run Spark locally in an IDE and as ScalaTest unit tests, including how to create test RDDs and DataFrames and supply test data. It also discusses testing concepts for streaming applications, MLlib, GraphX, and integration testing with technologies like HBase and Kafka.
Apache Spark is a fast and general cluster computing system that improves efficiency through in-memory computing and usability through rich APIs. Spark SQL provides a way to work with structured data and transform RDDs using SQL. It can read data from sources like Parquet and JSON files, Hive, and write query results to Parquet for efficient querying. Spark SQL also allows machine learning pipelines to be built by connecting SQL queries to MLlib algorithms.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
Spark is an open-source cluster computing framework. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark supports batch, streaming, and interactive computations in a unified framework. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across a cluster for parallel processing. RDDs support transformations like map and filter that return new RDDs and actions that return values to the driver program.
Apache Spark, the Next Generation Cluster ComputingGerger
This document provides a 3 sentence summary of the key points:
Apache Spark is an open source cluster computing framework that is faster than Hadoop MapReduce by running computations in memory through RDDs, DataFrames and Datasets. It provides high-level APIs for batch, streaming and interactive queries along with libraries for machine learning. Spark's performance is improved through techniques like Catalyst query optimization, Tungsten in-memory columnar formats, and whole stage code generation.
Spark is a fast and general engine for large-scale data processing. It runs on Hadoop clusters through YARN and Mesos, and can also run standalone. Spark is up to 100x faster than Hadoop for certain applications because it keeps data in memory rather than disk, and it supports iterative algorithms through its Resilient Distributed Dataset (RDD) abstraction. The presenter provides a demo of Spark's word count algorithm in Scala, Java, and Python to illustrate how easy it is to use Spark across languages.
Using spark 1.2 with Java 8 and CassandraDenis Dus
Brief introduction in Spark data processing ideology, comparison Java 7 and Java 8 usage with Spark. Examples of loading and processing data with Spark Cassandra Loader.
These slides were presented by Hossein Falaki of Databricks to the Atlanta Apache Spark User Group on Thursday, March 9, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238120227/
This document provides an overview of Apache Spark, including:
- The problems of big data that Spark addresses like large volumes of data from various sources.
- A comparison of Spark to existing techniques like Hadoop, noting Spark allows for better developer productivity and performance.
- An overview of the Spark ecosystem and how Spark can integrate with an existing enterprise.
- Details about Spark's programming model including its RDD abstraction and use of transformations and actions.
- A discussion of Spark's execution model involving stages and tasks.
Our product uses third generation Big Data technologies and Spark Structured Streaming to enable comprehensive Digital Transformation. It provides a unified streaming API that allows for continuous processing, interactive queries, joins with static data, continuous aggregations, stateful operations, and low latency. The presentation introduces Spark Structured Streaming's basic concepts including loading from stream sources like Kafka, writing to sinks, triggers, SQL integration, and mixing streaming with batch processing. It also covers continuous aggregations with windows, stateful operations with checkpointing, reading from and writing to Kafka, and benchmarks compared to other streaming frameworks.
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
This document provides an overview of the Scala programming language. Some key points:
- Scala runs on the Java Virtual Machine and was created by Martin Odersky at EPFL.
- It has been around since 2003 and the current stable release is 2.7.7. Release 2.8 beta 1 is due out soon.
- Scala combines object-oriented and functional programming. It has features like pattern matching, actors, XML literals, and more that differ from Java. Everything in Scala is an object.
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
This document provides an overview of Apache Spark, an open-source cluster computing framework. It discusses Spark's history and community growth. Key aspects covered include Resilient Distributed Datasets (RDDs) which allow transformations like map and filter, fault tolerance through lineage tracking, and caching data in memory or disk. Example applications demonstrated include log mining, machine learning algorithms, and Spark's libraries for SQL, streaming, and machine learning.
Similar to Testing batch and streaming Spark applications (20)
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfUndress Baby
The quest for the best AI face swap solution is marked by an amalgamation of technological prowess and artistic finesse, where cutting-edge algorithms seamlessly replace faces in images or videos with striking realism. Leveraging advanced deep learning techniques, the best AI face swap tools meticulously analyze facial features, lighting conditions, and expressions to execute flawless transformations, ensuring natural-looking results that blur the line between reality and illusion, captivating users with their ingenuity and sophistication.
Web:- https://undressbaby.com/
SMS API Integration in Saudi Arabia| Best SMS API ServiceYara Milbes
Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
2. Overview
• Why to run aplication outside of a cluster?
• Spark in nutshell
• Unit and integration tests
• Tools
• Spark Streaming integration tests
• Best practices and pitfalls
17. Example – word count
WordCount maps (extracts) words from an input source and reduces
(summarizes) the results, returning a count of each word.
18. object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
19. object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
20. object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
})
21. object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
22. object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
23. object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap(WordsCount.extractWords)
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
25. Example unit test
class S00_UnitTest extends FunSpec with Matchers {
it("should split a sentence into words") {
val line = "Ala ma kota"
val words: Array[String] = WordCount.extractWords(line = line)
val expected = Array("Ala", "ma", "kota")
words should be (expected)
}
}
26. Example unit test
class S00_UnitTest extends FunSpec with Matchers {
it("should split a sentence into words") {
val line = "Ala ma kota"
val words: Array[String] = WordCount.extractWords(line = line)
val expected = Array("Ala", "ma", "kota")
words should be (expected)
}
}
28. Example unit test
class S00_UnitTest extends BasicScalaTest {
it("should split a sentence into words") {
val line = "Ala ma kota"
val words: Array[String] = WordCount.extractWords(line = line)
val expected = Array("Ala", "ma", "kota")
words should be (expected)
}
}
29. Things to note
• Extract anonymous functions so they will be testable
• what can be unit tested?
• Executor and driver code not related to Spark
• Udf functions
31. Production code vs test code
Production code
• distributed mode
Test code
• local mode
32. Production code vs test code
Production code
• distributed mode
• RDD from storage
Test code
• local mode
• RDD from resources/memory
33. Production code vs test code
Production code
• distributed mode
• RDD from storage
• Evaluate transformations on RDD
or DStream API.
Test code
• local mode
• RDD from resources/memory
• Evaluate transformations on RDD
or DStream API.
34. Production code vs test code
Production code
• distributed mode
• RDD from storage
• Evaluate transformations on RDD
or DStream API.
• Store outcomes
Test code
• local mode
• RDD from resources/memory
• Evaluate transformations on RDD
or DStream API.
• Assert outcomes
36. What to test in integration tests?
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
37. What to test in integration tests?
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
40. class S01_IntegrationTest extends SparkSessionBase {
it("should count words occurence in all lines") {
Given("RDD of sentences")
val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i
Lolek", "Ala ma psa"))
When("extract and count words")
val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd)
val actual: Map[String, Int] = wordsCountRdd.collectAsMap()
Then("words should be counted")
val expected = Map(
"Ala" -> 2,
"ma" -> 2,
"kota" -> 1,
................
)
actual should be(expected)
41. class S01_IntegrationTest extends SparkSessionBase {
it("should count words occurence in all lines") {
Given("RDD of sentences")
val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i
Lolek", "Ala ma psa"))
When("extract and count words")
val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd)
val actual: Map[String, Int] = wordsCountRdd.collectAsMap()
Then("words should be counted")
val expected = Map(
"Ala" -> 2,
"ma" -> 2,
"kota" -> 1,
................
)
actual should be(expected)
42. class SparkSessionBase extends FunSpec with BeforeAndAfterAll with Matchers with
GivenWhenThen {
var ss: SparkSession = _
override def beforeAll() {
val conf = new SparkConf()
.setMaster("local[4]")
ss = SparkSession.builder()
.appName("TestApp" + System.currentTimeMillis())
.config(conf)
.getOrCreate()
}
override def afterAll() {
ss.stop()
ss = null
}
43. class S01_IntegrationTest extends SparkSessionBase {
it("should count words occurence in all lines") {
Given("RDD of sentences")
val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i
Lolek", "Ala ma psa"))
When("extract and count words")
val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd)
val actual: Map[String, Int] = wordsCountRdd.collectAsMap()
Then("words should be counted")
val expected = Map(
"Ala" -> 2,
"ma" -> 2,
"kota" -> 1,
................
)
actual should equal(expected)
45. it("should count words occurence in all lines") {
Given("few lines of sentences")
val schema = StructType(List(
StructField("line", StringType, true)
))
val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json"))
When("extract and count words")
val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf)
val wordCount: Array[Row] = wordsCountDf.collect()
Then("filtered words should be counted")
val actualWordCount = wordCount
.map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count")))
.toMap
val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1)
actualWordCount should be(expectedWordCount)
}
46. it("should count words occurence in all lines") {
Given("few lines of sentences")
val schema = StructType(List(
StructField("line", StringType, true)
))
val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json"))
When("extract and count words")
val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf)
val wordCount: Array[Row] = wordsCountDf.collect()
Then("filtered words should be counted")
val actualWordCount = wordCount
.map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count")))
.toMap
val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1)
actualWordCount should be(expectedWordCount)
}
48. it("should return total count of Ala and Bolek words in all lines of text") {
Given("few sentences")
implicit val lineEncoder = product[Line]
val lines = List(
Line(text = "Ala ma kota"),
Line(text = "Bolek i Lolek"),
Line(text = "Ala ma psa"))
val linesDs: Dataset[Line] = ss.createDataset(lines)
When("extract and count words")
val wordsCountDs: Dataset[WordCount] = WordsCount
.extractFilterAndCountWordsDataset(linesDs)
val actualWordCount: Array[WordCount] = wordsCountDs.collect()
Then("filtered words should be counted")
val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1))
actualWordCount should contain theSameElementsAs expectedWordCount
}
49. it("should return total count of Ala and Bolek words in all lines of text") {
import spark.implicits._
Given("few sentences")
implicit val lineEncoder = product[Line]
val linesDs: Dataset[Lines] = List(
Line(text = "Ala ma kota"),
Line(text = "Bolek i Lolek"),
Line(text = "Ala ma psa")).toDS()
When("extract and count words")
val wordsCountDs: Dataset[WordCount] = WordsCount
.extractFilterAndCountWordsDataset(linesDs)
val actualWordCount: Array[WordCount] = wordsCountDs.collect()
Then("filtered words should be counted")
val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1))
actualWordCount should contain theSameElementsAs expectedWordCount
}
50. Things to note
• What can be tested in integration tests?
• Single transformation on Spark abstractions
• Chain of transformations
• Integration with external services e.g. Kafka, HDFS, YARN
• Embedded instances
• Docker environment
• Prefer Datasets over RDDs or DataFrames
52. spark-fast-tests
class S04_IntegrationDatasetFastTest extends SparkSessionBase with DatasetComparer {
it("should return total count of Ala and Bolek words in all lines of text ") {
Given("few lines of sentences")
implicit val lineEncoder = product[Line]
implicit val wordEncoder = product[WordCount]
val lines = List(Line(text = "Ala ma kota"),Line(text = "Bolek i Lolek"),Line(text = "Ala ma
psa"))
val linesDs: Dataset[Line] = ss.createDataset(lines)
When("extract and count words")
val wordsCountDs: Dataset[WordCount] = WordsCount
.extractFilterAndCountWordsDataset(linesDs)
Then("filtered words should be counted")
val expectedDs = ss.createDataset(Array(WordCount("Ala", 2),WordCount("Bolek", 1)))
assertSmallDatasetEquality(wordsCountDs, expectedDs, orderedComparison = false)
54. Spark Testing Base
class S06_01_IntegrationDatasetSparkTestingBaseTest extends FunSpec with DatasetSuiteBase with
GivenWhenThen {
it("counting word occurences on few lines of text should return count Ala and Bolek words in this
text") {
Given("few lines of sentences")
implicit val lineEncoder = product[Line]
implicit val wordEncoder = product[WordCount]
val lines = List(Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa"))
val linesDs: Dataset[Line] = spark.createDataset(lines)
When("extract and count words")
val wordsCountDs: Dataset[WordCount] = WordsCount.extractFilterAndCountWordsDataset(linesDs)
Then("filtered words should be counted")
val expectedDs: Dataset[WordCount] = spark.createDataset(Seq(WordCount("Bolek",
1),WordCount("Ala", 2)))
assertDatasetEquals(expected = expectedDs, result = wordsCountDs)
55. Spark Testing Base – not so nice failure
messages
• Different length
1 did not equal 2 Length not EqualScalaTestFailureLocation:
com.holdenkarau.spark.testing.TestSuite$class at
• Different order of elements
Tuple2;((0,(WordCount(Ala,2),WordCount(Bolek,1))),
(1,(WordCount(Bolek,1),WordCount(Ala,2)))) was not empty
• Differente values
Tuple2;((0,(WordCount(Bole,1),WordCount(Bolek,1)))) was not empty
61. Streaming – spark testing base
class S06_02_StreamingTest_SparkTestingBase extends FunSuite with
StreamingSuiteBase {
test("count words") {
val input = List(List("a b"))
val expected = List(List(("a", 1), ("b", 1)))
testOperation[String, (String, Int)](input, count _, expected, ordered = false)
}
// This is the sample operation we are testing
def count(lines: DStream[String]): DStream[(String, Int)] = {
lines.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
}
}
62. How to design easy testable Spark code?
• Extract functions so they will be reusable and testable
• Single transformation should do one thing
• Compose transformations using „transform” function
• Prefer Column based functions over UDFs
• Column based functions
• Dataset operators
• UDF functions
69. Pitfalls you should look out
• cannot refer to one RDD inside another RDD
• processing batch of data, not single message or domain entity
• case classes defined in test class body - throws SerializationException
• Spark reads json based on http://jsonlines.org/ specification
Driver program - runs user main function and executes various parallel operations on a cluster
RDDs - colleciton of elements partitioned across nodes of the cluster that can be operated in paralllel
Worker - manages resources on cluster node
Executor - JVM process which stores and executes tasks
Tasks - Executes RDD operations
jak sie nazywa taka suma przsuwajaca?
ScalaTest
ScalaTest
zmienic kolejnosc slajdsow jsona, bo pokazany jest json a nie bylo zadenj shiosritu z nim zwiazanje,
zmienic kolejnosc slajdsow jsona, bo pokazany jest json a nie bylo zadenj shiosritu z nim zwiazanje,