Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
This document discusses using HyperLogLog (HLL) in Hive to efficiently estimate the number of unique elements or cardinality in big datasets. It describes how HLL provides fast approximate counting using probabilistic data structures. It covers implementing HLL as user-defined functions in Hive, comparing different open source implementations, and examples of using HLL to estimate unique visitors per day and in a rolling window.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Python is popular amongst data scientists and engineers for data processing tasks. The big data ecosystem has traditionally been rather JVM centric. Often Java (or Scala) are the only viable option to implement data processing pipelines. That sometimes poses an adoption barrier for organizations that have already invested in other language ecosystems. The Apache Beam project provides a unified programming model for data processing and its ongoing portability effort aims to enable multiple language SDKs (currently Java, Python and Go) on a common set of runners. The combination of Python streaming on the Apache Flink runner is one example. Let’s take a look how the Flink runner translates the Beam model into the native DataStream (or DataSet) API, how the runner is changing to support portable pipelines, how Python user code execution is coordinated with gRPC based services and how a sample pipeline runs on Flink.
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
This document provides a summary of migrating to ClickHouse for analytics use cases. It discusses the author's background and company's requirements, including ingesting 10 billion events per day and retaining data for 3 months. It evaluates ClickHouse limitations and provides recommendations on schema design, data ingestion, sharding, and SQL. Example queries demonstrate ClickHouse performance on large datasets. The document outlines the company's migration timeline and challenges addressed. It concludes with potential future integrations between ClickHouse and MySQL.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
This document discusses using HyperLogLog (HLL) in Hive to efficiently estimate the number of unique elements or cardinality in big datasets. It describes how HLL provides fast approximate counting using probabilistic data structures. It covers implementing HLL as user-defined functions in Hive, comparing different open source implementations, and examples of using HLL to estimate unique visitors per day and in a rolling window.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Python is popular amongst data scientists and engineers for data processing tasks. The big data ecosystem has traditionally been rather JVM centric. Often Java (or Scala) are the only viable option to implement data processing pipelines. That sometimes poses an adoption barrier for organizations that have already invested in other language ecosystems. The Apache Beam project provides a unified programming model for data processing and its ongoing portability effort aims to enable multiple language SDKs (currently Java, Python and Go) on a common set of runners. The combination of Python streaming on the Apache Flink runner is one example. Let’s take a look how the Flink runner translates the Beam model into the native DataStream (or DataSet) API, how the runner is changing to support portable pipelines, how Python user code execution is coordinated with gRPC based services and how a sample pipeline runs on Flink.
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
This document provides a summary of migrating to ClickHouse for analytics use cases. It discusses the author's background and company's requirements, including ingesting 10 billion events per day and retaining data for 3 months. It evaluates ClickHouse limitations and provides recommendations on schema design, data ingestion, sharding, and SQL. Example queries demonstrate ClickHouse performance on large datasets. The document outlines the company's migration timeline and challenges addressed. It concludes with potential future integrations between ClickHouse and MySQL.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
This document provides an implementation and user's guide for Oracle Global Order Promising. It contains 7 chapters that describe how to set up and use ATP functionality based on collected data or planning output, including configuration, product family, and multi-level supply chain ATP. It also covers ATP inquiry, order scheduling, a diagnostic ATP tool, and an order backlog workbench for scheduling order lines.
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
Google annonced its serverless solution in early March, and developers could easily build microservices from zero to planet scale, all without managing infrastructure. Peter will talk about Google's solution in general and how we can deploy and debug a serverless application.
1. The document provides steps to create a repository in Oracle BI 11g using the Administration Tool by importing metadata from the BISAMPLE schema.
2. Key steps include importing the BISAMPLE schema tables, verifying the connection by updating row counts, creating aliases for the tables, and defining physical keys and joins between the tables.
3. The document outlines creating a new repository, importing metadata from the BISAMPLE schema, verifying the connection works properly, creating aliases for the tables with descriptive names, and defining the foreign key relationships between the tables in the Physical layer.
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
This document discusses speeding up OLAP cube building in Apache Kylin using Spark. Cubing with MapReduce can be slow due to serialization overhead and repeated job submissions. Spark allows caching data in memory across cuboid layers in one job, significantly reducing build times compared to MapReduce as shown in a benchmark on a 160 million row dataset. Spark simplifies Kylin development and brings capabilities for real-time OLAP and cloud integration.
SAP Global Available to Promise (gATP) 101: Global Visibility vs. Global Avai...Plan4Demand
For more information on SAP's gATP visit http://www.plan4demand.com, call 866-P4D-INFO, or email info@plan4demand.com
Unfortunately, many companies today make promises to customers without knowing if they can actually meet the demand. Global Available to Promise (gATP) is often considered the most difficult to explain and, as a result, the most difficult module to understand when to implement and what to expect when you do.
SAP Global Available to Promise (gATP) is a powerful tool that allows you to have the system look anywhere in the supply chain you have designated for available product, in real-time. GATP can give you the world at your fingertips.
This session will cover key things to consider when setting up the logic on what you want to see, when you want to see it, and how to establish a logical flow to best suit your needs without creating too much complexity.
Sharon Nelson and Charlie MacMaster, combine over 30 years of SAP experience to discuss the capabilities of ATP, gATP, their similarities and differences, how SD and gATP work together, explore the concept of availability vs. global visibility and building a logic to capitalize on both.
Key take-a-ways will include:
• How to know if gATP is a fit for your organization
• Understanding how gATP interacts with ECC, PP/DS and other APO Components
• Overview of different Types of ATP checks and sequences
• What materials, products, regions should be "checked" in gATP
• When and where people must engage beyond the system
• How to define rules in gATP for what you want and really need
This document provides an overview of Apache Airflow, an open-source workflow management system. It describes Airflow's key features like workflow definition using directed acyclic graphs (DAGs), rich UI, scheduler, operators for tasks like databases and web services, and use of Jinja templating. The document also discusses Airflow's architecture with parallel execution, UI, command line operations like backfilling, and security features. Airflow is used by over 200 companies for workflows like ETL, analytics, and machine learning pipelines.
Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.
Connecting and using PostgreSQL database with psycopg2 [Python 2.7]Dinesh Neupane
This presentation covers the basic idea of connecting postgresql database with python and psycopg2 module.
Covered Topics:
1. Psycopg2 Installation
2. Connecting to PostgreSQL Database
3. Connection Parameters
4. Create and Drop Table
5. Adaptation of Python Values to SQL Types
6. SQL Transactions
7. DML
This document discusses BW/4HANA migration options and scenarios. It outlines various activities customers can do now to prepare for migration such as using tools to check compatibility and generate reports. Potential challenges are mentioned along with time-consuming tasks like testing and documentation updates. The presentation aims to help customers understand their migration options and develop a readiness plan.
Kernel Recipes 2018 - CPU Idle Loop Rework - Rafael J. WysockiAnne Nicolas
The CPU idle loop is the piece of code executed by logical CPUs if they have no tasks to run. If the CPU supports idle states allowing it to draw less power while not executing any instructions, the idle loop invokes a CPU idle governor to select the most suitable idle state for the CPU and it puts the CPU into the selected idle state with the help of a CPU idle driver. Generally speaking, the idle state selection carried out by the CPU idle governor is based on predicting the duration of the idle time for the CPU, so it is not deterministic.
That turned out to be problematic due to a design issue in the CPU idle loop which tended to stop the scheduler tick prematurely and it often was stopped when there was no need to stop it. That led either to excessive overhead related to the unnecessary stopping and re-starting of the scheduler tick, or to situations in which the CPU might be put into an idle state that was to shallow and, in consequence, it might draw too much power for a relatively long time. That issue was addressed during the 4.17 kernel development cycle by redesigning the idle loop so that the scheduler tick is only stopped, if necessary, after the idle state for the CPU has been selected which involved resolving a Catch-22 dependency between the idle duration prediction by the governor and the next timer event related to the scheduler tick.
I will explain the design of the CPU idle loop in Linux and how the problem with it was fixed. I also will show some test results demonstrating the achieved improvements and I will discuss some possible future improvements in the area in question.
This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.
The document describes the ASAP 8 methodology for implementing SAP systems. The methodology has 6 phases: Project Preparation, Business Blueprint, Realization, Testing/Final Preparation, Go Live & Support, and Operate. Each phase is described in 1-3 sentences with the purpose, timeline, and key activities. The Realization phase implements the system configuration and can take 3-6 months. Testing resolves issues to prepare for going live within 30-60 days.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
1) The document discusses memory management in Spark applications and summarizes different approaches tried by developers to address out of memory errors in Spark executors.
2) It analyzes the root causes of memory issues like executor overheads and data sizes, and evaluates fixes like increasing memory overhead, reducing cores, frequent garbage collection.
3) The document dives into Spark and JVM level configuration options for memory like storage pool sizes, caching formats, and garbage collection settings to improve reliability, efficiency and performance of Spark jobs.
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...HostedbyConfluent
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu | Current 2022
Stream processing is becoming increasingly essential for extracting business value from data in real-time. To achieve strict user-defined SLAs under constantly changing workloads, modern streaming systems have started taking advantage of the cloud for scalable and resilient resources. New demand opens new opportunities and challenges for state management, which is at the core of streaming systems. Existing approaches typically use embedded key-value storage so that each worker can access it locally to achieve high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic serverless boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help boost feelings of calmness, happiness and focus.
The document summarizes a study that tracked medical students' empathy skills with curriculum changes at a university school of medicine. It found that:
1) Empathy scores modestly but significantly improved with the introduction of a new measurement tool (MIRS) and additional practice/training using this tool over multiple years.
2) Empathy scores improved further with the addition of a specific video assignment focusing on demonstrating empathy using the NURS model.
3) The improvements were found to be statistically significant based on analysis of variance testing, but the study was limited to one school with a particular curriculum and measurement approach.
This document discusses a slideshow created by Rickey Lowe. The slideshow focuses on aesthetics and uses emotional documentary-style images. It conveys key information about the creator, topic, style and format in a concise 3 sentence summary as requested.
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
This document provides an implementation and user's guide for Oracle Global Order Promising. It contains 7 chapters that describe how to set up and use ATP functionality based on collected data or planning output, including configuration, product family, and multi-level supply chain ATP. It also covers ATP inquiry, order scheduling, a diagnostic ATP tool, and an order backlog workbench for scheduling order lines.
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
Google annonced its serverless solution in early March, and developers could easily build microservices from zero to planet scale, all without managing infrastructure. Peter will talk about Google's solution in general and how we can deploy and debug a serverless application.
1. The document provides steps to create a repository in Oracle BI 11g using the Administration Tool by importing metadata from the BISAMPLE schema.
2. Key steps include importing the BISAMPLE schema tables, verifying the connection by updating row counts, creating aliases for the tables, and defining physical keys and joins between the tables.
3. The document outlines creating a new repository, importing metadata from the BISAMPLE schema, verifying the connection works properly, creating aliases for the tables with descriptive names, and defining the foreign key relationships between the tables in the Physical layer.
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
This document discusses speeding up OLAP cube building in Apache Kylin using Spark. Cubing with MapReduce can be slow due to serialization overhead and repeated job submissions. Spark allows caching data in memory across cuboid layers in one job, significantly reducing build times compared to MapReduce as shown in a benchmark on a 160 million row dataset. Spark simplifies Kylin development and brings capabilities for real-time OLAP and cloud integration.
SAP Global Available to Promise (gATP) 101: Global Visibility vs. Global Avai...Plan4Demand
For more information on SAP's gATP visit http://www.plan4demand.com, call 866-P4D-INFO, or email info@plan4demand.com
Unfortunately, many companies today make promises to customers without knowing if they can actually meet the demand. Global Available to Promise (gATP) is often considered the most difficult to explain and, as a result, the most difficult module to understand when to implement and what to expect when you do.
SAP Global Available to Promise (gATP) is a powerful tool that allows you to have the system look anywhere in the supply chain you have designated for available product, in real-time. GATP can give you the world at your fingertips.
This session will cover key things to consider when setting up the logic on what you want to see, when you want to see it, and how to establish a logical flow to best suit your needs without creating too much complexity.
Sharon Nelson and Charlie MacMaster, combine over 30 years of SAP experience to discuss the capabilities of ATP, gATP, their similarities and differences, how SD and gATP work together, explore the concept of availability vs. global visibility and building a logic to capitalize on both.
Key take-a-ways will include:
• How to know if gATP is a fit for your organization
• Understanding how gATP interacts with ECC, PP/DS and other APO Components
• Overview of different Types of ATP checks and sequences
• What materials, products, regions should be "checked" in gATP
• When and where people must engage beyond the system
• How to define rules in gATP for what you want and really need
This document provides an overview of Apache Airflow, an open-source workflow management system. It describes Airflow's key features like workflow definition using directed acyclic graphs (DAGs), rich UI, scheduler, operators for tasks like databases and web services, and use of Jinja templating. The document also discusses Airflow's architecture with parallel execution, UI, command line operations like backfilling, and security features. Airflow is used by over 200 companies for workflows like ETL, analytics, and machine learning pipelines.
Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.
Connecting and using PostgreSQL database with psycopg2 [Python 2.7]Dinesh Neupane
This presentation covers the basic idea of connecting postgresql database with python and psycopg2 module.
Covered Topics:
1. Psycopg2 Installation
2. Connecting to PostgreSQL Database
3. Connection Parameters
4. Create and Drop Table
5. Adaptation of Python Values to SQL Types
6. SQL Transactions
7. DML
This document discusses BW/4HANA migration options and scenarios. It outlines various activities customers can do now to prepare for migration such as using tools to check compatibility and generate reports. Potential challenges are mentioned along with time-consuming tasks like testing and documentation updates. The presentation aims to help customers understand their migration options and develop a readiness plan.
Kernel Recipes 2018 - CPU Idle Loop Rework - Rafael J. WysockiAnne Nicolas
The CPU idle loop is the piece of code executed by logical CPUs if they have no tasks to run. If the CPU supports idle states allowing it to draw less power while not executing any instructions, the idle loop invokes a CPU idle governor to select the most suitable idle state for the CPU and it puts the CPU into the selected idle state with the help of a CPU idle driver. Generally speaking, the idle state selection carried out by the CPU idle governor is based on predicting the duration of the idle time for the CPU, so it is not deterministic.
That turned out to be problematic due to a design issue in the CPU idle loop which tended to stop the scheduler tick prematurely and it often was stopped when there was no need to stop it. That led either to excessive overhead related to the unnecessary stopping and re-starting of the scheduler tick, or to situations in which the CPU might be put into an idle state that was to shallow and, in consequence, it might draw too much power for a relatively long time. That issue was addressed during the 4.17 kernel development cycle by redesigning the idle loop so that the scheduler tick is only stopped, if necessary, after the idle state for the CPU has been selected which involved resolving a Catch-22 dependency between the idle duration prediction by the governor and the next timer event related to the scheduler tick.
I will explain the design of the CPU idle loop in Linux and how the problem with it was fixed. I also will show some test results demonstrating the achieved improvements and I will discuss some possible future improvements in the area in question.
This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.
The document describes the ASAP 8 methodology for implementing SAP systems. The methodology has 6 phases: Project Preparation, Business Blueprint, Realization, Testing/Final Preparation, Go Live & Support, and Operate. Each phase is described in 1-3 sentences with the purpose, timeline, and key activities. The Realization phase implements the system configuration and can take 3-6 months. Testing resolves issues to prepare for going live within 30-60 days.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
1) The document discusses memory management in Spark applications and summarizes different approaches tried by developers to address out of memory errors in Spark executors.
2) It analyzes the root causes of memory issues like executor overheads and data sizes, and evaluates fixes like increasing memory overhead, reducing cores, frequent garbage collection.
3) The document dives into Spark and JVM level configuration options for memory like storage pool sizes, caching formats, and garbage collection settings to improve reliability, efficiency and performance of Spark jobs.
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...HostedbyConfluent
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu | Current 2022
Stream processing is becoming increasingly essential for extracting business value from data in real-time. To achieve strict user-defined SLAs under constantly changing workloads, modern streaming systems have started taking advantage of the cloud for scalable and resilient resources. New demand opens new opportunities and challenges for state management, which is at the core of streaming systems. Existing approaches typically use embedded key-value storage so that each worker can access it locally to achieve high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic serverless boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help boost feelings of calmness, happiness and focus.
The document summarizes a study that tracked medical students' empathy skills with curriculum changes at a university school of medicine. It found that:
1) Empathy scores modestly but significantly improved with the introduction of a new measurement tool (MIRS) and additional practice/training using this tool over multiple years.
2) Empathy scores improved further with the addition of a specific video assignment focusing on demonstrating empathy using the NURS model.
3) The improvements were found to be statistically significant based on analysis of variance testing, but the study was limited to one school with a particular curriculum and measurement approach.
This document discusses a slideshow created by Rickey Lowe. The slideshow focuses on aesthetics and uses emotional documentary-style images. It conveys key information about the creator, topic, style and format in a concise 3 sentence summary as requested.
2013 рік - підсумки і плани ДемАльянсуVera Gruzova
Демократичний Альянс закликає до наступальних дій в 2014 році
http://dem-alliance.org/news/demokratichnii-aljans-zaklikaje-do-nastupalnih-dii-v-2014-roci.html
Молодь ДемАльянсу запрошує в гори - на семінар зі студентського самоврядування
http://dem-alliance.org/anons/molod-demaljansu-zaproshuje-seminar-studentske-samovrjaduvannja.html
Vědecké publikování v 21. století: otevřeně a bez predátorůTereza Simandlová
Přednáška pro doktorandy a akademiky 1. lékařské fakulty UK a Všeobecné fakultní nemocnice na téma open access a predátorských časopisů / vydavatelů / praktik.
Přednáška se uskuteční(la) 14. června 2016 v Ústavu vědeckých informací 1. LF UK.