Apache Impala is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK
몇년 전부터 Data Architecture의 변화가 빠르게 진행되고 있고,
그 중 Cloud DW는 기존 Data Lake(Hadoop 기반)의 한계(성능, 비용, 운영 등)에 대한 대안으로 주목받으며,
많은 기업들이 이미 도입했거나, 도입을 검토하고 있다.
본 자료는 이러한 Cloud DW에 대해서 개념적으로 이해하고,
시장에 존재하는 다양한 Cloud DW 중에서 기업의 환경에 맞는 제품이 어떤 것인지 성능/비용 관점으로 비교했다.
- 왜기업들은 CloudDW에주목하는가?
- 시장에는어떤 제품들이 있는가?
- 우리Biz환경에서는 어떤 제품을 도입해야 하는가?
- CloudDW솔루션의 성능은?
- 기존DataLake(EMR)대비 성능은?
- 유사CloudDW(snowflake vs redshift) 대비성능은?
앞으로도 Data를 둘러싼 시장은 Cloud DW를 기반으로 ELT, Mata Mesh, Reverse ETL등 새로운 생테계가 급속하게 발전할 것이고,
이를 위한 데이터 엔지니어/데이터 아키텍트 관점의 기술적 검토와 고민이 필요할 것 같다.
https://blog.naver.com/freepsw/222654809552
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
요즘 Hadoop 보다 더 뜨고 있는 Spark.
그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다.
RDD가 어떻게 동작하는지, 원 논문을 리뷰하며 살펴보도록 합시다.
http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
You know you need Cassandra for it's uptime and scaling, but what about that data model? Let's bridge that gap and get you building your game changing app. We'll break down topics like storing objects and indexing for fast retrieval. You will see by understanding a few things about Cassandra internals, you can put your data model in the spotlight. The goal of this talk is to get you comfortable working with data in Cassandra throughout the application lifecycle. What are you waiting for? The cameras are waiting!
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
ClickHouse clusters depend on ZooKeeper to handle replication and distributed DDL commands. In this Altinity webinar, we’ll explain why ZooKeeper is necessary, how it works, and introduce the new built-in replacement named ClickHouse Keeper. You’ll learn practical tips to care for ZooKeeper in sickness and health. You’ll also learn how/when to use ClickHouse Keeper. We will share our recommendations for keeping that happy as well.
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management, tune your queries, and use Amazon Redshift's interleaved sorting features.You’ll then hear from a customer who has leveraged Redshift in their industry and how they have adopted many of the best practices. Learn More: https://aws.amazon.com/government-education/
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK
몇년 전부터 Data Architecture의 변화가 빠르게 진행되고 있고,
그 중 Cloud DW는 기존 Data Lake(Hadoop 기반)의 한계(성능, 비용, 운영 등)에 대한 대안으로 주목받으며,
많은 기업들이 이미 도입했거나, 도입을 검토하고 있다.
본 자료는 이러한 Cloud DW에 대해서 개념적으로 이해하고,
시장에 존재하는 다양한 Cloud DW 중에서 기업의 환경에 맞는 제품이 어떤 것인지 성능/비용 관점으로 비교했다.
- 왜기업들은 CloudDW에주목하는가?
- 시장에는어떤 제품들이 있는가?
- 우리Biz환경에서는 어떤 제품을 도입해야 하는가?
- CloudDW솔루션의 성능은?
- 기존DataLake(EMR)대비 성능은?
- 유사CloudDW(snowflake vs redshift) 대비성능은?
앞으로도 Data를 둘러싼 시장은 Cloud DW를 기반으로 ELT, Mata Mesh, Reverse ETL등 새로운 생테계가 급속하게 발전할 것이고,
이를 위한 데이터 엔지니어/데이터 아키텍트 관점의 기술적 검토와 고민이 필요할 것 같다.
https://blog.naver.com/freepsw/222654809552
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
요즘 Hadoop 보다 더 뜨고 있는 Spark.
그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다.
RDD가 어떻게 동작하는지, 원 논문을 리뷰하며 살펴보도록 합시다.
http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
You know you need Cassandra for it's uptime and scaling, but what about that data model? Let's bridge that gap and get you building your game changing app. We'll break down topics like storing objects and indexing for fast retrieval. You will see by understanding a few things about Cassandra internals, you can put your data model in the spotlight. The goal of this talk is to get you comfortable working with data in Cassandra throughout the application lifecycle. What are you waiting for? The cameras are waiting!
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
ClickHouse clusters depend on ZooKeeper to handle replication and distributed DDL commands. In this Altinity webinar, we’ll explain why ZooKeeper is necessary, how it works, and introduce the new built-in replacement named ClickHouse Keeper. You’ll learn practical tips to care for ZooKeeper in sickness and health. You’ll also learn how/when to use ClickHouse Keeper. We will share our recommendations for keeping that happy as well.
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management, tune your queries, and use Amazon Redshift's interleaved sorting features.You’ll then hear from a customer who has leveraged Redshift in their industry and how they have adopted many of the best practices. Learn More: https://aws.amazon.com/government-education/
This presentation was prepared for a Webcast where John Yerhot, Engine Yard US Support Lead, and Chris Kelly, Technical Evangelist at New Relic discussed how you can scale and improve the performance of your Ruby web apps. They shared detailed guidance on issues like:
Caching strategies
Slow database queries
Background processing
Profiling Ruby applications
Picking the right Ruby web server
Sharding data
Attendees will learn how to:
Gain visibility on site performance
Improve scalability and uptime
Find and fix key bottlenecks
See the on-demand replay:
http://pages.engineyard.com/6TipsforImprovingRubyApplicationPerformance.html
Swift distributed tracing method and tools v2zhang hua
A proposal of Swift session for OpenStack Atlanta design summit.
http://junodesignsummit.sched.org/event/0f185cd5bcc2c9b58c639bba25bc0025#.U3SZRa1dXd4
http://summit.openstack.org/cfp/details/354
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
Dean will provide practical tips and techniques learned from helping hundreds of customers deploy InfluxDB and InfluxDB Enterprise. This includes hardware and architecture choices, schema design, configuration setup, and running queries.
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
Antes de migrar de 10g a 11g o 12c, tome en cuenta las siguientes consideraciones. No es tan sencillo como simplemente cambiar de motor de base de datos, se necesita hacer consideraciones a nivel del aplicativo.
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other initiatives that are coming in the future. In this talk, we want to share with the Bogota Spark community an overview of Spark 3.0 features and enhancements.
In particular, we will touch upon the following areas:
* Performance Improvement Features
* Improved Useability Features
* ANSI SQL Compliance
* Pandas UDFs
* Compatibility and migration considerations
* Spark Ecosystem: Delta Lake, Project Hydrogen, and Project Zen
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Jim Czuprynski
Autonomous Transaction Processing (ATP) - the second in the family of Oracle’s Autonomous Databases – offers Oracle DBAs the ability to apply a force multiplier for their OLTP database application workloads. However, it’s important to understand both the benefits and limitations of ATP before migrating any workloads to that environment. I'll offer a quick but deep dive into how best to take advantage of ATP - including how to load data quickly into the underlying database – and some ideas on how ATP will impact the role of Oracle DBA in the immediate future. (Hint: Think automatic transmission instead of stick-shift.)
Determining the root cause of performance issues is a critical task for Operations. In this webinar, we'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Influence of Marketing Strategy and Market Competition on Business Plan
How to use Impala query plan and profile to fix performance issues
1. How to use Impala query plan and profile to
fix performance issues
Juan Yu
Impala Field Engineer, Cloudera
2. Profiles?! Explain plans!? Arggghh…
§ For the end user, understanding Impala performance is like…
- Lots of commonality between requests, e.g.
- What’s the bottleneck for this query?
- Why this run is fast but that run is slow?
- How can I tune to improve this query’s performance.
3. Agenda
- What are query plan and profile
- What kind of issues query plan and profile can help you solve
- Structure of query plan and profile
- Basic troubleshooting
- Advanced query tuning and troubleshooting
4. Why did the following queries take different time?
SELECT AVG(ss_list_price) FROM
store_sales WHERE ss_sold_date_sk
BETWEEN 2451959 AND 2451989;
Fetched 1 row(s) in 0.28s
SELECT AVG(ss_list_price) FROM
store_sales;
Fetched 1 row(s) in 3.60s
Partition Pruning
reduces scan work
and intermediate
result.
5. Where to get Query plan and profile
- Cloudera Manager Query history page
- Impala webUI queries page
- Impala-shell
- Profile examples: https://github.com/yjwater/strata-sj
6. Query Planning: Goals
- Lower execution cost and get better performance
- Reduce unnecessary work as much as possible by partition pruning, predicate
pushdown, runtime filter,etc.
- Maximize scan locality using DN block metadata.
- Minimize data movement (optimize join order, choose better join strategy)
- Parallel tasks and run them on many nodes to shorten execution time.
7. Query profile - Execution Results
- Provide metrics and counters to show
- how execution is ongoing.
- how much resources are used.
- how long each piece takes.
- Is there any bottleneck and/or abnormal behavior.
8. What issues query plan and profile can help you solve?
§ Plan:
- Missing stats
- Partition pruning
- Predicate pushdown
- Join order
- Join strategy
- Parallelism
§ Profile:
- Identify Bottleneck
- Runtime filter effectiveness
- Memory usage/Spill to disk
- Network slowness
- Skew
- Client side issues
- Metadata loading
9. Flow of a SQL Query
Compile Query
Execute Query
Client
Client
SQL Text
Executable Plan
Query Results
Impala Frontend
(Java)
Impala Backend
(C++)
Query Profile
12. Query Planning
- Single-Node Plan (SET NUM_NODES=1):
- Assigns predicates to lowest plan node.
- Prunes irrelevant columns and partitions (if applicable).
- Optimizes join order.
- Determine effective runtime filters
- Distributed Plan:
- Provides list of best Impalads to do the scan work.
- Picks an execution strategy for each join (broadcast vs hash-partitioned)
- Introduces exchange operators and groups nodes in plan fragments (unit of work).
13. Compile query
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
SELECT t1.dept, SUM(t2.revenue)
FROM LargeHdfsTable t1
JOIN SmallHdfsTable t2 ON (t1.id1 = t2.id)
JOIN LargeHbaseTable t3 ON (t1.id2 = t3.id)
WHERE t3.category = 'Online‘ AND t1.id > 10
GROUP BY t1.dept
HAVING COUNT(t2.revenue) > 10
ORDER BY revenue LIMIT 10
Hint: set num_nodes=1; to explain the single node plan
14. HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
Single-Node
Plan
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Pre-Agg
Exchange
Exchange
Exchange
Exchange
Merge-Agg
ExchangeTopN
Distributed Plan
This is what EXPLAIN
shows by default
Single to Distributed Node Plan
15. HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Pre-Agg
Exchange
Exchange
Exchange
Exchange
Merge-Agg
ExchangeTopN
PlanFragment
execution model subject to change
Plan Fragmentation
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
Single-Node
Plan Distributed Plan
16. Query Plan Visualization
select c_name, c_custkey, o_orderkey, o_orderdate,
o_totalprice, sum(l_quantity)
from customer, orders, lineitem
where o_orderkey in (
select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 300 )
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by c_name, c_custkey, o_orderkey, o_orderdate,
o_totalprice
order by o_totalprice desc, o_orderdate
limit 100
18. Plan and Profile structure
Query Summary
|- Basic info: state, type, user, statement, coordinator
|- Query plan
|- Execution summary
|- Timeline
Client side info
Execution details
|- Runtime Filter table
|- Coordinator Fragment
|- Instance
|- Operator node A
…
|- Average Fragment 3
|- Fragment instance 0
|- Operator node B
...
|- Fragment instance 1
...
|- Average Fragment 2
|- Average Fragment 0
explain_level = 3
19. Basic Query information
Query (id=5349daec57a4d786:9536a60900000000):
Summary:
Session ID: 5245f03b5b3c17ec:1dbd50ff83471f95
Session Type: HIVESERVER2
HiveServer2 Protocol Version: V6
Start Time: 2018-02-09 13:17:31.162274000
End Time: 2018-02-09 13:20:05.281900000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 2.11.0-cdh5.14.0
User: REDACTED
Connected User: REDACTED
Delegated User:
Network Address: REDACTED:54129
Default Db: tpch_100_parquet
Sql Statement: select * from lineitems limit 100
Coordinator: REDACTED:22000
Query Options (set by configuration): ABORT_ON_ERROR=1,MEM_LIMIT=5658116096
Query Options (set by configuration and planner): ABORT_ON_ERROR=1,MEM_LIMIT=5658116096,MT_DOP=0
20. Look into the Timeline
Planner Timeline: 218.105ms
- Analysis finished: 159.486ms (159.486ms)
- Value transfer graph computed: 159.522ms (35.958us)
- Single node plan created: 162.606ms (3.083ms)
- Runtime filters computed: 162.869ms (262.984us)
- Distributed plan created: 162.908ms (39.628us)
- Lineage info computed: 163.006ms (97.845us)
- Planning finished: 218.105ms (55.098ms)
Query Timeline: 2m34s
- Query submitted: 12.693ms (12.693ms)
- Planning finished: 350.339ms (337.646ms)
- Submit for admission: 422.437ms (72.097ms)
- Completed admission: 433.900ms (11.462ms)
- Ready to start on 8 backends: 648.182ms (214.282ms)
- All 8 execution backends (47 fragment instances) started: 5s683ms (5s035ms)
- First dynamic filter received: 1m50s (1m44s)
- Rows available: 2m32s (41s835ms)
- First row fetched: 2m32s (447.280ms)
- Unregister query: 2m34s (1s232ms)
21. Client side
- Avoid large data extract.
- It’s usually not a good idea to dump lots of data out using JDBC/ODBC.
- For Impala-shell, use the –B option to fetch lots of data.
Total query time
Idle time: client isn’t fetching
27. Advanced Query Tuning
- Common issues
- Query succeeds but very slow
- Query fails with OOM error
-
- How to address them
- Examine the logic of the query and validate the Explain Plan
- Use Query Profile to identify bottlenecks.
31. Query Tuning Basics - Join
§ Validate join order and join strategy
- Optimal Join Order
- RHS should be smaller than LHS
- Minimize intermediate results
- Strategy – Broadcast vs. Partitioned
- Network costs (partition and send lhs+rhs or broadcast rhs)
- Memory costs
- RHS must fit in memory!
32. Join-Order Optimization
SELECT … FROM T1, T2, T3, T4
WHERE T1.id = T2.id AND T2.id = T3.id AND T3.id = T4.id;
T1
T2
T3
T4
HJ
HJ
HJ
33. Use of Statistics during Plan Generation
- Table statistics: Number of rows per partition/table.
- Column statistics: Number of distinct values per column.
- Use of table and column statistics
- Estimate selectivity of predicates (esp. scan predicates like “month=10”).
- Estimate selectivity of joins à join cardinality (#rows)
- Heuristic FK/PK detection.
- Pick distributed join strategy: broadcast vs. partitioned
- Join inputs have very different size à broadcast small side.
- Join inputs have roughly equal size à partitioned.
34. Join Execution Strategy
§ Broadcast Join § Repartition Join
A join B
A join B
A join B
Scan and
Broadcast
Table B
A join B
A join B
A join B
Scan and repartition both
tables according to join keys
Local
scan for
Table A
35. Join Strategy – what’s the cost
- Impala choses the right strategy based on stats (collect stats!)
- Use the join strategy that minimizes data transfer
- Use explain plan to see the data size
- Join Hint: [shuffle] or [broadcast]
Join strategy cost Network Traffic Memory Usage
(HashTable)
Broadcast Join RHS table size1 x
number of node
RHS table size1
x number of node
Partitioned Join LHS + RHS table size1 RHS table size1
1 table size refers to the data flowed from the child node (i.e. only required column data after filtering counts).
36. Join - Exercise
§ TPCH 100GB data on 8-node cluster
- Table1: lineitem. cardinality=600M, row size ~ 260B
- Table2: orders. cardinality=150M, row size ~ 200B
- orderkey is bigint.
§ For the following query, what’s the optimal join order and join strategy.
- SELECT count(*) FROM lineitem JOIN orders ON l_orderkey = o_orderkey;
- SELECT lineitem.* FROM lineitem JOIN orders ON l_orderkey = o_orderkey;
- SELECT orders.* FROM lineitem JOIN orders ON l_orderkey = o_orderkey;
§ How about on a 100-node cluster?
37. Runtime Filter - Check how selective the join is
Large hash join output (~2.8B),
the output of upstream one is
much smaller (32M, reduced to
only ~1%). This indicates runtime
filter can be helpful.
39. Why Runtime Filter doesn’t work sometimes?
Filter doesn‘t take effect because scan was
short and finished before filter arrived.
HDFS_SCAN_NODE (id=3):
Filter 1 (1.00 MB):
- Rows processed: 16.38K (16383)
- Rows rejected: 0 (0)
- Rows total: 16.38K (16384)
40. How to tune it?
- Increase RUNTIME_FILTER_WAIT_TIME_MS to 5000ms to let Scan Node 03
wait longer time for the filter.
HDFS_SCAN_NODE (id=3)
Filter 1 (1.00 MB)
- InactiveTotalTime: 0
- Rows processed: 73049
- Rows rejected: 73018
- Rows total: 73049
- TotalTime: 0
- If the cluster is relatively busy, consider increasing the wait time too so that
complicated queries do not miss opportunities for optimization.
41. Runtime filter profile examples
- Profile walkthrough
- Effective vs Non-Effective
- Local vs Global
- Filter Wait Time
- Filter Memory Usage
44. Memory – Insert strategy
§ Non-shuffle § Shuffle
Node A
Partition data 1, 3, 5, 6
Node B
Partition data 2, 3, 6, 7
Node C
Partition data 3, 4, 8
Node D
Partition data 1, 4, 7, 8
4 writers
4 writers
3 writers
4 writers
Node A
Partition data 1, 5
Node B
Partition data 2, 6
Node C
Partition data 3, 7
Node D
Partition data 4, 8
2 writers
2 writers
2 writers
2 writers
45. Memory – Group By
§ SELECT product, count(1), sold_date FROM sales_history GROUP BY product,
sold_date;
§ We have one million different products, table contains data in the past 5 years.
Total groups = 1M * 5 * 365 = ~1.8B
46. Memory Usage – Estimation
§ EXPLAIN’s memory estimation issues
- Can be way off – much higher or much lower.
- Group-by/distinct estimate can be particularly off – when there’s a large number of group
by columns (independence assumption)
- Memory estimate = NDV of group by column 1 * NDV of group by column 2 * … NDV of
group by column n
- Ignore EXPLAIN’s estimation if it’s too high!
§ Do your own estimate for group by
- GROUP BY memory usage = (total number of groups * size of each row) + (total
number of groups * size of each row) / number node
47. Memory Usage – Hitting Mem-limit
- Gigantic group by
- The total number of distinct groups is huge, such as group by userid, phone number.
- For a simple query, you can try this advanced workaround – per-partition agg
- Requires the partition key to be part of the group by
SELECT part_key, col1, col2, …agg(..) FROM tbl WHERE part_key in
(1,2,3)
UNION ALL
SELECT part_key, col1, col2, …agg(..) FROM tbl WHERE part_key in
(4,5,6)
48. Memory Usage – Hitting Mem-limit
- Big-table joining big-table
- Big-table (after decompression, filtering, and projection) is a table that is bigger than
total cluster memory size.
- For a simple query, you can try this advanced workaround – per-partition join
- Requires the partition key to be part of the join key
SELECT … FROM BigTbl_A a JOIN BigTbl_B b WHERE a.part_key =
b.part_key AND a.part_key IN (1,2,3)
UNION ALL
SELECT … FROM BigTbl_A a JOIN BigTbl_B b WHERE a.part_key =
b.part_key AND a.part_key IN (4,5,6)
49. Query Execution – Typical Speed
- In a typical query, we observed following processing rate:
- Scan node 8~10M rows per core
- Join node ~10M rows per sec per core
- Agg node ~5M rows per sec per core
- Sort node ~17MB per sec per core
- Row materialization in coordinator should be tiny
- Parquet writer 1~5MB per sec per core
- If your processing rate is much lower than that, it’s worth a deeper look
50. Performance analysis – Example 3
~8M rows per second per core, this is
normal aggregation speed.
52. Performance analysis – Example 5
Possible root cause:
1. The host has a slow/bad
disk
2. The host has trouble to
communicate to NN
3. Hotspotting, the host does
more IO (not just this single
query, but overall) than
others
4. …
53. Performance analysis – Example 6
Only two parquet files to scan, not enough parallelism. Try using
smaller parquet block size for this table to increase parallelism.
54. Performance analysis – Example 7
SELECT colE, colU, colR, ... FROM temp WHERE year = 2017 AND month = 03
GROUP BY colE, colU, colR;
Slow network
55. - Impala uses Thrift to transmit compressed data between plan fragments
- Exchange operator is the receiver of the data
- DataStreamSender transmits the output rows of a plan fragment
DataStreamSender (dst_id=3)
- BytesSent: 402 B (402)
- InactiveTotalTime: 0ns (0)
- NetworkThroughput(*): 87.3 KiB/s (89436)
- OverallThroughput: 283.6 KiB/s (290417)
- PeakMemoryUsage: 120.6 KiB (123488)
- RowsReturned: 24 (24)
- SerializeBatchTime: 0ns (0)
- TotalTime: 888.97us (888969)
- TransmitDataRPCTime: 222.24us (222241)
- UncompressedRowBatchSize: 504 B (504)
EXCHANGE_NODE (id=3)
- ConvertRowBatchTime: 0ns (0)
- InactiveTotalTime: 0ns (0)
- RowsReturned: 24 (24)
- RowsReturnedRate: 7 per second (7)
- TotalTime: 3.02s (3016602849)
DataStreamReceiver
- BytesReceived: 402 B (402)
- DeserializeRowBatchTimer: 0ns (0)
- FirstBatchArrivalWaitTime: 1.83s (1830275553)
- InactiveTotalTime: 0ns (0)
- SendersBlockedTimer: 0ns (0)
- SendersBlockedTotalTimer(*): 0ns (0)
Exchange and DataStreamSender
55
56. Network slowness
§ Exchange performance issues
- Too much data across network:
- Check the query on data size reduction.
- Check join order and join strategy; Wrong join order/strategy can have a serious effect
on network!
- For agg, check the number of groups – affect memory too!
- Remove unused columns.
- Keep in mind that network is typically at most 10Gbit
§ Cross-rack network slowness
57. More examples/Exercises
- Parquet min/max filter, dictionary filter
- Spill to disk
- CPU saturated
- Scanner thread
- Detect small files
- Sink (DML queries)
58. Recap: Query Tuning
- Performance: Examine the logic of the query and validate the Explain Plan
Validate join order and join strategy.
Validate partition pruning and/or predicate pushdown.
Validate plan parallelism.
- Memory usage: Top causes (in order) of hitting mem-limit:
Lack of statistics
Lots of joins within a single query
Big-table joining big-table
Gigantic group by (e.g., distinct)
Parquet Writer
59. More resources
§ Impala cookbook https://blog.cloudera.com/blog/2017/02/latest-impala-cookbook/
§ Impala perf guideline
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_cookbook.ht
ml
§ Perf optimization https://conferences.oreilly.com/strata/strata-ca-
2017/public/schedule/detail/55783