This document discusses benchmarking Hive at Yahoo scale. Some key points:
- Hive is the fastest growing product on Yahoo's Hadoop clusters which process 750k jobs per day across 32500 nodes.
- Benchmarking was done using TPC-H queries on 100GB, 1TB, and 10TB datasets stored in ORC format.
- Significant performance improvements were seen over earlier Hive versions, with 18x speedup over Hive 0.10 on text files for the 100GB dataset.
- Average query time was reduced from 530 seconds to 28 seconds for the 100GB dataset, and from 729 seconds to 172 seconds for the 1TB dataset.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
Igor Anishchenko
Odessa Java TechTalks
Lohika - May, 2012
Let's take a step back and compare data serialization formats, of which there are plenty. What are the key differences between Apache Thrift, Google Protocol Buffers and Apache Avro. Which is "The Best"? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
Igor Anishchenko
Odessa Java TechTalks
Lohika - May, 2012
Let's take a step back and compare data serialization formats, of which there are plenty. What are the key differences between Apache Thrift, Google Protocol Buffers and Apache Avro. Which is "The Best"? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
Here's the talk that we presented at the Hadoop Summit 2015, in San Jose. This was an inside look at how we at Yahoo scaled Hive to work at Yahoo's data/metadata scale.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
Here's the talk that we presented at the Hadoop Summit 2015, in San Jose. This was an inside look at how we at Yahoo scaled Hive to work at Yahoo's data/metadata scale.
Experimentation plays a vital role in business growth at eBay by providing valuable insights and prediction on how users will reach to changes made to the eBay website and applications. On a given day, eBay has several hundred experiments running at the same time. Our experimentation data processing pipeline handles billions of rows user behavioral and transactional data per day to generate detailed reports covering 100+ metrics over 50 dimensions.
In this session, we will share our journey of how we moved this complex process from Data warehouse to Hadoop. We will give an overview of the experimentation platform and data processing pipeline. We will highlight the challenges and learnings we faced implementing this platform in Hadoop and how this transformation led us to build a scalable, flexible and reliable data processing workflow in Hadoop. We will cover our work done on performance optimizations, methods to establish resilience and configurability, efficient storage formats and choices of different frameworks used in the pipeline.
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldOutlyer
As machine learning moves from niche to mainstream tech stacks how do DevOps engineers prepare for a very different set of problems. A brief look at the new issues that arise from machine learning, an overview of cutting-edge "old school" solutions and how to drag data science (kicking and screaming) into a world of automation.
Video: https://www.youtube.com/watch?v=KHxZCRajRiA
Join DevOps Exchange London here: http://meetup.com/DevOps-Exchange-London/
Follow DOXLON on twitter http://www.twitter.com/doxlon
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
Charles Allen covers data processing, analytics, and insights systems at Snap. Strength points for Druid use cases are called out as are differences in some of the processing systems used.
This is the slide collection from the second talk from:
https://www.meetup.com/druidio-la/events/254080924/
Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
Blueprint Series: Expedia Partner Solutions, Data PlatformMatt Stubbs
Join Anselmo for an engaging overview of the new end-to-end data architecture at Expedia Group, taking a journey through cloud and on-prem data lakes, real-time and batch processes and streamlined access for data producers and consumers. Find out how the new architecture unifies a complex mix of data sources and feeds the data science development cycle. Expedia might appear to be a market-leading travel company – in reality, it’s a highly successful technology and data science company.
Using real time big data analytics for competitive advantageAmazon Web Services
Many organisations find it challenging to successfully perform real-time data analytics using their own on premise IT infrastructure. Building a system that can adapt and scale rapidly to handle dramatic increases in transaction loads can potentially be quite a costly and time consuming exercise.
Most of the time, infrastructure is under-utilised and it’s near impossible for organisations to forecast the amount of computing power they will need in the future to serve their customers and suppliers.
To overcome these challenges, organisations can instead utilise the cloud to support their real-time data analytics activities. Scalable, agile and secure, cloud-based infrastructure enables organisations to quickly spin up infrastructure to support their data analytics projects exactly when it is needed. Importantly, they can ‘switch off’ infrastructure when it is not.
BluePi Consulting and Amazon Web Services (AWS) are giving you the opportunity to discover how organisations are using real time data analytics to gain new insights from their information to improve the customer experience and drive competitive advantage.
Hadoop and the Relational Database: The Best of Both WorldsInside Analysis
The Briefing Room with Dr. Robin Bloor and Splice Machine
Live Webcast on August 5, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=71551d669454741c8bd56f2349bdf140
As the pressure of Big Data collides with the reality of daily operations, many organizations are trying to solve the challenge of meeting new requirements without disrupting the flow of business. One solution focuses on the data layer itself, by combining the well known functionality of relational database technology with the scale-out capabilities of Hadoop.
Register for this episode of The Briefing Room to hear from veteran Analyst Dr. Robin Bloor as he outlines the critical components of a business-ready data layer. He’ll be briefed by John Leach and Rich Reimer of Splice Machine who will explain how their solution delivers the best of both data worlds: the trusted capabilities of relational with the infinite scalability of Hadoop. They will also discuss how Hadoop has transformed from a batch-oriented workhorse into a scale-out layer capable of supporting real-time applications and operational analytics using traditional SQL.
Visit InsideAnlaysis.com for more information.
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...Big Data Montreal
Despite how fantastic pigs look with lipstick on and how magical elephants look with wings attached, there remains a large gap between what popular big data stacks offer and what end users demand in terms of reporting agility and speed. Join us to learn how Montreal-based AdGear, an advertising technology company, faced challenges as its data volume increased. You will hear how AdGear's data stack evolved to meet these challenges, and how HP Vertica's architecture and features changed the game.
(by Mina Naguib, Technical Director of Platform Engineering at AdGear).
https://youtu.be/tzQUUCuVjVc
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
Real time analytics is a beautiful thing, especially if you can build it in quick, scalable & robust way. We built a digital command center for our marketing team, which provided real time analytics on social media, clickstream and google search term in a span of couple of months. This solution was entirely build on open source technologies, using a combination of Apache Nifi, Elastic search & Hadoop. Simple but very effective. In this presentation i would like to share the architecture, learning and business benefits of this solution.
Similar to Hive and Apache Tez: Benchmarked at Yahoo! Scale (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Hive and Apache Tez: Benchmarked at Yahoo! Scale
1. Benchmarking Hive at Yahoo Scale
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n ⎪ J u n e 4 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
2. About myself
2
HCatalog Committer, Hive
contributor
› Metastore, Notifications, HCatalog APIs
› Integration with Oozie, Data Ingestion
Other odds and ends
› DistCp
mithun@apache.org
2014 Hadoop Summit, San Jose, California
3. About this talk
3
Introduction to “Yahoo Scale”
The use-case in Yahoo
The Benchmark
The Setup
The Observations (and, possibly, lessons)
Fisticuffs
2014 Hadoop Summit, San Jose, California
4. The Y!Grid
4
16 Hadoop Clusters in YGrid
› 32500 Nodes
› 750K jobs a day
Hadoop 0.23.10.x, 2.4.x
Large Datasets
› Daily, hourly, minute-level frequencies
› Terabytes of data, 1000s of files, per dataset instance
Pig 0.11
Hive 0.10 / HCatalog 0.5
› => Hive 0.12
2014 Hadoop Summit, San Jose, California
5. Data Processing Use cases
5 2014 Hadoop Summit, San Jose, California
Pig for Data Pipelines
› Imperative paradigm
› ~45% Hadoop Jobs on Production Clusters
• M/R + Oozie = 41%
Hive for Ad hoc queries
› SQL
› Relatively smaller number of jobs
• *Major* Uptick
Use HCatalog for Inter-op
6. 6 Yahoo Confidential & Proprietary
Hive is Currently the Fastest Growing Product on the Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
HiveJobs(%ofAllJobs)
AllGridJobs(inMillions)
All Jobs Hive (% of all jobs)
2.4 million
Hive jobs
7. Business Intelligence Tools
7
{Tableau, MicroStrategy, Excel, … }
Challenges:
› Security
• ACLs, Authentication, Encryption over the wire, Full-disk Encryption
› Bandwidth
• Transporting results over ODBC
› Query Latency
• Query execution time
• Cost of query “optimizations”
• “Bad” queries
2014 Hadoop Summit, San Jose, California
8. The Benchmark
8
TPC-h
› Industry standard (tpc.org/tpch)
› 22 queries
› dbgen –s 1000 –S 3
• Parallelizable
Reynold Xin’s excellent work:
› https://github.com/rxin
› Transliterated queries to suit Hive 0.9
2014 Hadoop Summit, San Jose, California
9. Relational Diagram
9 2014 Hadoop Summit, San Jose, California
PARTKEY
NAME
MFGR
BRAND
TYPE
SIZE
CONTAINER
COMMENT
RETAILPRICE
PARTKEY
SUPPKEY
AVAILQTY
SUPPLYCOST
COMMENT
SUPPKEY
NAME
ADDRESS
NATIONKEY
PHONE
ACCTBAL
COMMENT
ORDERKEY
PARTKEY
SUPPKEY
LINENUMBER
RETURNFLAG
LINESTATUS
SHIPDATE
COMMITDATE
RECEIPTDATE
SHIPINSTRUCT
SHIPMODE
COMMENT
CUSTKEY
ORDERSTATUS
TOTALPRICE
ORDERDATE
ORDER-
PRIORITY
SHIP-
PRIORITY
CLERK
COMMENT
CUSTKEY
NAME
ADDRESS
PHONE
ACCTBAL
MKTSEGMENT
COMMENT
PART (P_)
SF*200,000
PARTSUPP (PS_)
SF*800,000
LINEITEM (L_)
SF*6,000,000
ORDERS (O_)
SF*1,500,000
CUSTOMER (C_)
SF*150,000
SUPPLIER (S_)
SF*10,000
ORDERKEY
NATIONKEY
EXTENDEDPRICE
DISCOUNT
TAX
QUANTITY
NATIONKEY
NAME
REGIONKEY
NATION (N_)
25
COMMENT
REGIONKEY
NAME
COMMENT
REGION (R_)
5
10. The Setup
10
› 350 Node cluster
• Xeon boxen: 2 Slots with E5530s => 16 CPUs
• 24GB memory
– NUMA enabled
• 6 SATA drives, 2TB, 7200 RPM Seagates
• RHEL 6.4
• JRE 1.7 (-d64)
• Hadoop 0.23.7+/2.3+, Security turned off
• Tez 0.3.x
• 128MB HDFS block-size
› Downscale tests: 100 Node cluster
• hdfs-balancer.sh
2014 Hadoop Summit, San Jose, California
11. The Prep
11
Data generation:
› Text data: dbgen on MapReduce
› Transcode to RCFile and ORC: Hive on MR
• insert overwrite table orc_table partition( … ) select * from text_table;
› Partitioning:
• Only for 1TB, 10TB cases
• Perils of dynamic partitioning
› ORC File:
• 64MB stripes, ZLIB Compression
2014 Hadoop Summit, San Jose, California
14. 100 GB
14
› 18x speedup over Hive 0.10 (Textfile)
• 6-50x
› 11.8x speedup over Hive 0.10 (RCFile)
• 5-30x
› Average query time: 28 seconds
• Down from 530 (Hive 0.10 Text)
› 85% queries completed in under a minute
2014 Hadoop Summit, San Jose, California
16. 1 TB
16
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes
› 81% queries completed in under 4 minutes
2014 Hadoop Summit, San Jose, California
18. 10 TB
18
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 1.6-10x
› Average query time: 908 seconds (426 seconds excluding outliers)
• Down from 2129 seconds with Hive 0.10 RCFile
– (1712 seconds excluding outliers)
› 61% queries completed in under 5 minutes
› 71% queries completed in under 10 minutes
› Q6 still completes in 12 seconds!
2014 Hadoop Summit, San Jose, California
19. Explaining the speed-ups
19
Hadoop 2.x, et al.
Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Temporary data and the HDFS
› Feedback loop
› Smart scheduling
› Container re-use
› Pipelined job start-up
Hive
› Statistics
› “Vector-ized” Execution
ORC
› PPD
2014 Hadoop Summit, San Jose, California
20. 20 2014 Hadoop Summit, San Jose, California
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
mary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hiveq4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hive
q7_volume_shipping.hive
q8_na
onal_m
arket_share.hive
q9_product_type_profit.hive
q10_returned_item
.hive
q11_im
portant_stock.hiveq12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volum
e_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
Vectoriza on
Hive 0.13 Tez ORC
Hive 0.13 Tez ORC Vec
21. 21 2014 Hadoop Summit, San Jose, California
ORC File Layout
Data is composed of multiple streams per
column
Index allows for skipping rows (default to
every 10,000 rows), keeping position in
each stream, and min-max for each
column
Footer contains directory of stream
locations, and the encoding for each
column
Integer columns are serialized using run-
length encoding
String columns are serialized using
dictionary for column values, and the
same run length encoding
Stripe footer is used to find the requested
column’s data streams and adjacent
stream reads are merged File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
22. 22 2014 Hadoop Summit, San Jose, California
ORC Usage
CREATE TABLE addresses (
name string,
street string,
city string,
state string,
zip int
)
STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");
LOCATION ‘/path/to/addresses’;
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc
SET hive.default.fileformat = orc
SET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
Key Default Comments
orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy)
orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk
orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32
MB to cut down on disk I/O)
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size
increases the probability of not being able to skip the stride, for a predicate.
orc.create.index true whether to create row indexes. This is for predicate push-down. If data is frequently
accessed/filtered on a certain column, then sorting on the column and using index-filters
makes column filters work faster
25. Configuring ORC
25
set hive.merge.mapredfiles=true
set hive.merge.mapfiles=true
set orc.stripe.size=67,108,864
› Half the HDFS block-size
• Prevent cross-block stripe-read
• Tangent: DistCp
set orc.compress=???
› Depends on size and distribution
› Snappy compression hasn’t been explored
YMMV
› Experiment
2014 Hadoop Summit, San Jose, California
26. 26 2014 Hadoop Summit, San Jose, California
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
m
ary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volum
e.hive
q6_forecast_revenue_change.hive
q7_volum
e_shipping.hive
q8_na
onal_m
arket_share.hive
q9_product_type_profit.hive
q10_returned_item
.hive
q11_im
portant_stock.hive
q12_shipping.hive
q13_custom
er_distribu
on.hive
q14_prom
o
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_sm
all_quan
ty_order_revenue.hive
q18_large_volum
e_custom
er.hive
q19_discounted_revenue.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_w
ho_kept_orders_w
ai
ng.hive
q22_global_sales_opportunity.hive
Time(inseconds)
100 vs 350 Nodes
Hive 0.13 100 Nodes
Hive 0.13 350 Nodes
28. Y!Grid sticking with Hive
28
Familiarity
› Existing ecosystem
Community
Scale
Multitenant
Coming down the pike
› CBO
› In-memory caching solutions atop HDFS
• RAMfs a la Tachyon?
2014 Hadoop Summit, San Jose, California
29. We’re not done yet
29
SQL compliance
Scaling up the metastore
performance
Better BI Tool integration
Faster transport
› HiveServer2 result-sets
2014 Hadoop Summit, San Jose, California
30. References
30
The YDN blog post:
› http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-
tez-and-yarn
Code:
› https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils)
› https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen)
› https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive)
› https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA)
2014 Hadoop Summit, San Jose, California
33. Sharky comments
33
Testing with Shark 0.7.x and Shark 0.8
› Compatible with Hive Metastore 0.9
› 100GB datasets : Admirable performance
› 1TB/10TB: Tests did not run completely
• Failures, especially in 10TB cases
• Hangs while shuffling data
• Scaled back to 100 nodes -> More tests ran through, but not completely
› nReducers: Not inferred
Miscellany
› Security
› Multi-tenancy
› Compatibility
2014 Hadoop Summit, San Jose, California
Editor's Notes
Gopal was supposed to be presenting this with me, to talk about Tez. Point to Gopal/Jitendra’s talk on Hive/Tez for details on things I’ll have to skim over.
Also, acknowledge Thomas Graves, who’s talking today about the excellent work he’s doing on driving Spark on Yarn.
There are several sides to query latency:
Query execution time : Addressed in the physical query-execution layer.
Query optimizations: The first step while optimizing the query plan seems to be to query for all partition instances. Very expensive for “Project Benzene”.
Bad queries : Tableau, I’m looking at you.
The Transaction Processing Performance Council (inexplicably abbreviated to TPC) suggests a set of benchmarks for query processing. Many have adopted TPC-DS to showcase performance. We chose TPC-h to complement. (Also, 22 much smaller number to deal with than… 90?)
Transliteration: Evita and Kylie Minogue
Lineitem and Orders are extremely large Fact tables. Nation and Region are the smallest dimension tables.
Tangent: Funny story:
1. About hard-drives: Can set up MR intermediate directories and HDFS data-node directories to be on different disks. Traffic from one doesn’t affect the other. But on the other hand, total read bandwidth might be reduced.
Line-item: Partitioned on Ship-date.
Orders: Order-date
Customers: By market-segment
Suppliers: On their region-key.
Q5 and q21 are anomalous.
Q21: Hit a trailing reducer across all versions of Hive tested. Perhaps this can be improved with a better plan.
Q5: Slow reducer that hit only Hive 13. Could be a bad plan. Could be a difference in data distribution when data was regenerated for Hadoop 2 cluster.
Tez : Scheduling. Playing the gaps, like Beethoven’s Fifth.
Vectorization: On average: 1.2x.
Except for a few outliers, ZLIB compression actually reduced performance for a 1TB dataset. Uncompressed was 1.3x faster than Compressed.
The situation reverses at the 10 TB level. The gains from decompression are actually offset by the disk-read time.
The long-tail in 10TB/q21 threw the scale of the graph off, so I’ve excluded it in the results.
Talk about file-coalesce, small-file generation, Namenode pressure and parallelism.
You don’t want to read an ORC stripe from a different node.
Talk about distcp –pgrub, for ORC files.
Mention that SNAPPY’s license is not Apache.
Also, Yoda.
At 100 nodes, it performs at 0.9x the 350 node performance.
We’ve seen Hive and Tez scale down for latency, scale up for data-size, and scale out across larger clusters.
Familiarity : We have an existing ecosystem with Hive, HCatalog, Pig and Oozie that delivers revenue to Yahoo today. It’s hard to rock the boat.
Community: The Apache Hive community is large, active and thriving. They’ve been solving issues with query latency for ages now. The switch to using the Tez execution engine was a solution within the Apache Hive project. This wasn’t a fork of Hive. This is Hive, proper.
Scale: We’ve seen Hive and Tez perform at scale. Heck, we’ve seen Pig perform on Tez.
Multitenant: Yahoo’s use-case is unique, and not just because of data-scale. There’s hundreds of active users and genuine multitenancy and security concerns.
Design: We think the Hive community has tackled the right problems first, rather than throw RAM at the problem.
Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.
Security: Kerberos support was patched in, after the benchmarks were run.
Multi-tenancy:
Data needs to be explicitly pinned into memory as RDDs.
In a multi-tenant system, how would pinning work? Eviction policy for data.
Compatibility:
Needs to work with Metastore versions 12 and 13. Shark’s gone to 0.11 just recently.
Integration with the rest of the stack: Oozie and Pig.
Overall, we wanted a solution that works with high-dynamic range. i.e. works well with small datasets (100s of GBs), as well as scale to multi-terabyte datasets. We have a familiar system that seems to fit that bill. It doesn’t quite rock the boat. It’s not perfect yet. There are bugs that we’re working on. And we still haven’t solved the problem of data-volume/BI.
By the way, I really like the idea of BlinkDB. I saw the JIRA.