Flink's stateful stream processing engine presents a huge variety of optional features and configuration choices to the user. Figuring out the "optimal" choices for any production environment and use-case can therefore often be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to robustness and performance.
We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), file systems, TTL state, and considerations for the network stack. This also includes a discussion about estimating memory requirements and memory partitioning.
Stefan Richter
Flink Forward 2018
Berlin
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Hive on spark is blazing fast or is it finalHortonworks
This presentation was given at the Strata + Hadoop World, 2015 in San Jose.
Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.
In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.
Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.
The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.
And showed just how fast Hive on Spark really is.
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
Compare and contrast using Spark, Hive and Pig for transformation processing requirements. Video of my "talk" at https://www.youtube.com/watch?v=36_MayK5eU4.
Conference page for the talk is at https://devnexus.com/s/devnexus2017/presentations/17533.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!Timo Walther
Four years ago, the Apache Flink community started adding SQL support to ease and unify the processing of static and streaming data. Today, Flink runs business critical batch and streaming SQL queries at Alibaba, Huawei, Lyft, Uber, Yelp, and many others. Although the community made significant progress in the past years, there are still many things on the roadmap and the development is still speeding up. In the past months, several significant improvements and extensions were added including support for DDL statements, refactorings of the type system and the catalog interface, as well as Apache Hive integration. Since it is difficult to follow all development efforts that happen around Flink SQL and its ecosystem, it is time for an update. This session will focus on a comprehensive demo of what is possible with Flink SQL in 2020. Based on a realistic use case scenario, we'll show how to define tables which are backed by various storage systems and how to solve common tasks with streaming SQL queries. We will demonstrate Flink's Hive integration and show how to define and use user-defined functions. We'll close the session with an outlook of upcoming features.
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Hive on spark is blazing fast or is it finalHortonworks
This presentation was given at the Strata + Hadoop World, 2015 in San Jose.
Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.
In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.
Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.
The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.
And showed just how fast Hive on Spark really is.
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
Compare and contrast using Spark, Hive and Pig for transformation processing requirements. Video of my "talk" at https://www.youtube.com/watch?v=36_MayK5eU4.
Conference page for the talk is at https://devnexus.com/s/devnexus2017/presentations/17533.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!Timo Walther
Four years ago, the Apache Flink community started adding SQL support to ease and unify the processing of static and streaming data. Today, Flink runs business critical batch and streaming SQL queries at Alibaba, Huawei, Lyft, Uber, Yelp, and many others. Although the community made significant progress in the past years, there are still many things on the roadmap and the development is still speeding up. In the past months, several significant improvements and extensions were added including support for DDL statements, refactorings of the type system and the catalog interface, as well as Apache Hive integration. Since it is difficult to follow all development efforts that happen around Flink SQL and its ecosystem, it is time for an update. This session will focus on a comprehensive demo of what is possible with Flink SQL in 2020. Based on a realistic use case scenario, we'll show how to define tables which are backed by various storage systems and how to solve common tasks with streaming SQL queries. We will demonstrate Flink's Hive integration and show how to define and use user-defined functions. We'll close the session with an outlook of upcoming features.
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
Apache CarbonData & Spark Meetup
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Originally presented at the BDOOP and Spark Barcelona meetup groups: http://meetu.ps/3bwCTM
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. The talk compares:
• The performance of both v1 and v2 for Spark and Hive
• PaaS cloud services: Azure HDinsight, Amazon Web Services EMR, Google Cloud Dataproc
• Out-of-the-box support for Spark and Hive versions from providers
• PaaS reliability, scalability, and price-performance of the solutions
Using BigBench, the new Big Data benchmark standard. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
TMW Systems, A TRIMBLE Company, is the industry-leading transportation management software. 3PLs, brokers, distribution and supply operations, dedicated and private fleets, commercial carriers, and energy service providers rely on our transportation management systems, our fleet maintenance management software, or our routing and scheduling software to make them more efficient and profitable. Billions of data points exist in the trucking industry, and we at TMW Systems are pioneers of tracking millions of trucks, freights, and assets.
The architecture team at TMW leverages Nifi and SAM to deliver the immense volume of data in real-time. In this session, you will get a thorough understanding of all the streaming components. We have utilized Apache Kafka, Apache Nifi, and Streaming Analytics Manager to build our real-time data pipeline. We will also discuss the real-time event processing using SAM and Schema Registry. Lastly, we will show custom processors in Nifi and SAM that helped us with complex event processing.
Speaker
Krishna Potluri, TMW Systems, A Trimble Company, Big Data Architect
Donnie Wheat, Trimble, Senior Big Data Architect
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
Apache Flink is a world class stateful stream processor presents a huge variety of optional features and configuration choices to the user. Determining out the optimal choice for any production environment and use-case be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to state and state backends.
We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), serializers and Apache Flink's new state migration capabilities.
How YugaByte DB Implements Distributed PostgreSQLYugabyte
Building applications on PostgreSQL that require automatic data sharding and replication, fault tolerance, distributed transactions and geographic data distribution has been hard. In this 3 hour workshop, we will look at how to do this using a real-world example running on top of YugaByte DB, a distributed database that is fully wire-compatible with PostgreSQL and NoSQL APIs (Apache Cassandra and Redis). We will look at the design and architecture of YugaByte DB and how it reuses the PostgreSQL codebase to achieve full API compatibility. YugaByte DB support for PostgreSQL includes most data types, queries, stored procedures, etc. We will also take a look at how to build applications that are planet scale (requiring geographic distribution of data) and how to run them in cloud-native environments (for example, Kubernetes, hybrid or multi-cloud deployments).
Apache CarbonData & Spark Meetup
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Originally presented at the BDOOP and Spark Barcelona meetup groups: http://meetu.ps/3bwCTM
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. The talk compares:
• The performance of both v1 and v2 for Spark and Hive
• PaaS cloud services: Azure HDinsight, Amazon Web Services EMR, Google Cloud Dataproc
• Out-of-the-box support for Spark and Hive versions from providers
• PaaS reliability, scalability, and price-performance of the solutions
Using BigBench, the new Big Data benchmark standard. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
TMW Systems, A TRIMBLE Company, is the industry-leading transportation management software. 3PLs, brokers, distribution and supply operations, dedicated and private fleets, commercial carriers, and energy service providers rely on our transportation management systems, our fleet maintenance management software, or our routing and scheduling software to make them more efficient and profitable. Billions of data points exist in the trucking industry, and we at TMW Systems are pioneers of tracking millions of trucks, freights, and assets.
The architecture team at TMW leverages Nifi and SAM to deliver the immense volume of data in real-time. In this session, you will get a thorough understanding of all the streaming components. We have utilized Apache Kafka, Apache Nifi, and Streaming Analytics Manager to build our real-time data pipeline. We will also discuss the real-time event processing using SAM and Schema Registry. Lastly, we will show custom processors in Nifi and SAM that helped us with complex event processing.
Speaker
Krishna Potluri, TMW Systems, A Trimble Company, Big Data Architect
Donnie Wheat, Trimble, Senior Big Data Architect
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
Apache Flink is a world class stateful stream processor presents a huge variety of optional features and configuration choices to the user. Determining out the optimal choice for any production environment and use-case be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to state and state backends.
We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), serializers and Apache Flink's new state migration capabilities.
How YugaByte DB Implements Distributed PostgreSQLYugabyte
Building applications on PostgreSQL that require automatic data sharding and replication, fault tolerance, distributed transactions and geographic data distribution has been hard. In this 3 hour workshop, we will look at how to do this using a real-world example running on top of YugaByte DB, a distributed database that is fully wire-compatible with PostgreSQL and NoSQL APIs (Apache Cassandra and Redis). We will look at the design and architecture of YugaByte DB and how it reuses the PostgreSQL codebase to achieve full API compatibility. YugaByte DB support for PostgreSQL includes most data types, queries, stored procedures, etc. We will also take a look at how to build applications that are planet scale (requiring geographic distribution of data) and how to run them in cloud-native environments (for example, Kubernetes, hybrid or multi-cloud deployments).
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Karan Desai - Solutions Architect, AWS
Neel Mitra - Solutions Architect, AWS
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Level: Beginner
Speakers:
Jay Formosa - Solutions Architect, AWS
Aser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
Data Analytics Week at the San Francisco Loft
Data Warehousing with Amazon Redshift
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWSA closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Jay Formosa - Solutions Architect, AWS
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftAmazon Web Services
Data Warehousing with Amazon Redshift: Data Analytics Week at the San Francisco Loft
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Level: Beginner
Speakers:
Jay Formosa - Solutions Architect, AWS
Sudhir Gupta - Partner Solutions Architect, Redshift Specialist, AWS
Healthcare Claim Reimbursement using Apache SparkDatabricks
Optum Inc helps hospitals accurately calculate the claim reimbursement, detect underpayment from the Insurance company. Optum receives millions of claims per day which needs to be evaluated in less than 8 hours and the results need to be sent back to the hospitals for revenue recovery purposes.
by Taz Sayed, Sr Technical Account Manager AWS and Marie Yap, Enterprise Solutions Architect AWS
AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.
Brief overview of ZGC delivered during the Japan Java User's Group Night Seminar, March 28th 2019.
In this short introduction we cover:
- introduction to ZGC and its goals
- brief overview of the implementation
- performance results
- usage
- future roadmap
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services
We will walk through how to migrate and modernise your legacy data warehouse, moving from an on-premises server or application, to the cloud. You will learn how to easily migrate your data by leveraging serverless ETL, data cataloging as well as the techniques needed to successfully modernise your data warehouse, reduce costs, and increase performance and scalability.
Speaker: Paul Macey, Specialist Solutions Architect, AWS
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
Loading Data into Redshift: Data Analytics Week at the San Francisco Loft
How do you get data from your sources into your Redshift data warehouse? We'll show how to use AWS Glue and Amazon Kinesis Firehose to make it easy to automate the work to get data loaded.
Level: Intermediate
Speakers:
Aser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Vikram Gangulavoipalyam - Enterprise Solutions Architect, AWS
by Andre Hass, Specialist Technical Account Manager, AWS
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Data Analytics Week at the San Francisco Loft
Loading Data Into Redshift
How do you get data from your sources into your Redshift data warehouse? We'll show how to use AWS Glue and Amazon Kinesis Firehose to make it easy to automate the work to get data loaded.
Speakers:
Jay Formosa - Solutions Architect, AWS
Asser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
by Manish Mohite, Solutions Architect, AWS
How do you get data from your sources into your Redshift data warehouse? We'll show how to use AWS Glue and Amazon Kinesis Firehose to make it easy to automate the work to get data loaded.
by Marie Yap, Enterprise Solutions Architect, AWS
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Similar to Tuning Flink For Robustness And Performance (20)
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Launch Your Streaming Platforms in MinutesRoshan Dwivedi
The claim of launching a streaming platform in minutes might be a bit of an exaggeration, but there are services that can significantly streamline the process. Here's a breakdown:
Pros of Speedy Streaming Platform Launch Services:
No coding required: These services often use drag-and-drop interfaces or pre-built templates, eliminating the need for programming knowledge.
Faster setup: Compared to building from scratch, these platforms can get you up and running much quicker.
All-in-one solutions: Many services offer features like content management systems (CMS), video players, and monetization tools, reducing the need for multiple integrations.
Things to Consider:
Limited customization: These platforms may offer less flexibility in design and functionality compared to custom-built solutions.
Scalability: As your audience grows, you might need to upgrade to a more robust platform or encounter limitations with the "quick launch" option.
Features: Carefully evaluate which features are included and if they meet your specific needs (e.g., live streaming, subscription options).
Examples of Services for Launching Streaming Platforms:
Muvi [muvi com]
Uscreen [usencreen tv]
Alternatives to Consider:
Existing Streaming platforms: Platforms like YouTube or Twitch might be suitable for basic streaming needs, though monetization options might be limited.
Custom Development: While more time-consuming, custom development offers the most control and flexibility for your platform.
Overall, launching a streaming platform in minutes might not be entirely realistic, but these services can significantly speed up the process compared to building from scratch. Carefully consider your needs and budget when choosing the best option for you.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.