This document provides an overview of Apache Kylin, an open source distributed analytics engine that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop for extremely large datasets. It discusses Kylin's features such as fast OLAP capabilities, ANSI SQL interface, integration with BI tools, and job management. The document also covers Kylin's architecture, cube building process, storage in HBase, and query planning.
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
Traditionally database systems were optimized either for OLAP either for OLTP workloads. Such mainstream DBMSes like Postgres,MySQL,... are mostly used for OLTP, while Greenplum, Vertica, Clickhouse, SparkSQL,... are oriented on analytic queries. But right now many companies do not want to have two different data stores for OLAP/OLTP and need to perform analytic queries on most recent data. I want to discuss which features should be added to Postgres to efficiently handle HTAP workload.
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
Learn about the latest advancements in Apache Kylin and how its OLAP technology is making analytics faster and insights more actionable.
Learn more about Apache Kylin: https://kyligence.io/apache-kylin-overview/
Learn more about Apache Kylin's enterprise version Kyligence: https://kyligence.io/
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
Traditionally database systems were optimized either for OLAP either for OLTP workloads. Such mainstream DBMSes like Postgres,MySQL,... are mostly used for OLTP, while Greenplum, Vertica, Clickhouse, SparkSQL,... are oriented on analytic queries. But right now many companies do not want to have two different data stores for OLAP/OLTP and need to perform analytic queries on most recent data. I want to discuss which features should be added to Postgres to efficiently handle HTAP workload.
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
Learn about the latest advancements in Apache Kylin and how its OLAP technology is making analytics faster and insights more actionable.
Learn more about Apache Kylin: https://kyligence.io/apache-kylin-overview/
Learn more about Apache Kylin's enterprise version Kyligence: https://kyligence.io/
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
Apache Flink is a world class stateful stream processor presents a huge variety of optional features and configuration choices to the user. Determining out the optimal choice for any production environment and use-case be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to state and state backends.
We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), serializers and Apache Flink's new state migration capabilities.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives.
In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presenters: Julian Hyde and Stamatis Zampetakis
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd
From webinar on December 3, 2019
New users of ClickHouse love the speed but may run into a few surprises when designing applications. Column storage turns classic SQL design precepts on their heads. This talk shares our favorite tricks for building great applications. We'll talk about fact tables and dimensions, materialized views, codecs, arrays, and skip indexes, to name a few of our favorites. We'll show examples of each and also reserve time to handle questions. Join us to take your next step to ClickHouse guruhood!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
Apache Flink is a world class stateful stream processor presents a huge variety of optional features and configuration choices to the user. Determining out the optimal choice for any production environment and use-case be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to state and state backends.
We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), serializers and Apache Flink's new state migration capabilities.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives.
In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presenters: Julian Hyde and Stamatis Zampetakis
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd
From webinar on December 3, 2019
New users of ClickHouse love the speed but may run into a few surprises when designing applications. Column storage turns classic SQL design precepts on their heads. This talk shares our favorite tricks for building great applications. We'll talk about fact tables and dimensions, materialized views, codecs, arrays, and skip indexes, to name a few of our favorites. We'll show examples of each and also reserve time to handle questions. Join us to take your next step to ClickHouse guruhood!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Apache Kylin general introduction, including background, business needs and technical challenges, theory and architecture, features and some tech detail. Following with performance and benchmark, finally, ecosystem and roadmap.
More detail, please visit http://kylin.io or follow @ApacheKylin.
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets and subsecond query latency.
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
Kylin is an open source distributed analytics engine contributed by eBay that provides a SQL interface and OLAP on Hadoop supporting extremely large datasets. Kylin's pre-built MOLAP cubes (stored in HBase), distributed architecture, and high concurrency helps users analyze multidimensional queries via SQL and other BI tools. During this session, you'll learn how Kylin uses HBase's key-value store to serve SQL queries with relational schema.
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Kylin Open Source Web Site: http://kylin.io
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
Leaving the Ivory Tower: Research in the Real WorldArmonDadgar
Academic research often has a reputation of being insular and seldom being used in the real world. At HashiCorp, we've had a long tradition of basing our tools and products on academic research. We look at research for the initial design of products, and for ongoing development of new features. Our industrial research group, HashiCorp Research, has even published novel work. In this talk we cover why we care, how we incorporate research, and what has been particularly useful for us.
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk
Besides seeing the newest features in Splunk Enterprise and learning the best practices for data models and pivot, we will show you how to use a handful of search commands that will solve most search needs. Learn these well and become a ninja.
Similar to Apache Kylin - Balance Between Space and Time (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
2. About us
§Debashis Saha (@debashis_saha)
- VP, eBay Cloud Services (Platform, Infrastructure, Data)
§Luke Han (@lukehq)
- Sr. Product Manager, Analytics Data Infrastructure
- Committer & PMC Member of Apache Kylin
4. What
kylin
/
ˈkiːˈlɪn
/
麒麟
-‐-‐n.
(in
Chinese
art)
a
mythical
animal
of
composite
form
http://kylin.io
5. What
kylin
/
ˈkiːˈlɪn
/
麒麟
-‐-‐n.
(in
Chinese
art)
a
mythical
animal
of
composite
form
http://kylin.io
@ApacheKylin
6. What
kylin
/
ˈkiːˈlɪn
/
麒麟
-‐-‐n.
(in
Chinese
art)
a
mythical
animal
of
composite
form
Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine, contributed
by eBay Inc., provides SQL interface and multi-dimensional
analysis (OLAP) on Hadoop supporting extremely large datasets
http://kylin.io
@ApacheKylin
7. What
kylin
/
ˈkiːˈlɪn
/
麒麟
-‐-‐n.
(in
Chinese
art)
a
mythical
animal
of
composite
form
Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine, contributed
by eBay Inc., provides SQL interface and multi-dimensional
analysis (OLAP) on Hadoop supporting extremely large datasets
http://kylin.io
• Open
Sourced
on
Oct
1st,
2014
• Accepted
as
Apache
Incubator
Project
on
Nov
25th,
2014
• http://kylin.io
(http://kylin.incubator.apache.org)
@ApacheKylin
8. § External
§ 25+ contributors in community
§ Adoption:
§ On Production: Baidu Map
§ Evaluation: Huawei, Bloomberg Law, British Gas, JD.com
Microsoft, Tableau…
§ eBay Internal Cases
- 90% ile query < 5 seconds
Case Cube
Size Raw
Records
User
Session
Analysis 26
TB 28+
billion
rows
Traffic
Analysis 21
TB 20+
billion
rows
Behavior
Analysis 560
GB 1.2+
billion
rows
—from mailing list
Who are using Kylin?
http://kylin.io
24. Feature Highlights
• Extremely Fast OLAP Engine at scale
• ANSI SQL Interface on Hadoop
• Seamless Integration with BI Tools, like Tableau
• Interactive Query Capability
• MOLAP Cube
• Incremental Build of Cubes
• Approximate Query Capability for Distinct Count (HyperLogLog)
• Leverage HBase Coprocessor for query latency
• Job Management and Monitoring
• User friendly Web GUI for manage, build, monitor and query cubes
• Security capability to set ACL at Cube/Project Level
• Support LDAP Integration
http://kylin.io
50. http://kylin.io
Cube
Build
Engine
(MapReduce…)
SQL
Low
Latency
-‐
SecondsMid
Latency
-‐
Minutes
Routing
3rd
Party
App
(Web
App,
Mobile…)
Metadata
SQL-‐Based
Tool
(BI
Tools:
Tableau…)
Query
Engine
Hadoop
Hive
REST
API JDBC/ODBC
➢Online
Analysis
Data
Flow
➢Offline
Data
Flow
➢Clients/Users
interactive
with
Kylin
via
SQL
➢OLAP
Cube
is
transparent
to
users
Star
Schema
Data Key
Value
Data
Data
Cube
OLAP
Cube
(HBase)
SQL
REST
Server
Kylin Architecture Overview
51. http://kylin.io
Cube:
…
Fact
Table:
…
Dimensions:
…
Measures:
…
Storage(HBase):
…
Fact
Dim Dim
Dim
Source
Star
Schema
row
A
row
B
row
C
Column
Family
Val
1
Val
2
Val
3
Row
Key Column
Target
HBase
Storage
Mapping
Cube
Metadata
End
User Cube
Modeler Admin
Data Modeling
55. http://kylin.io
SELECT
test_cal_dt.week_beg_dt,
test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,
test_kylin_fact.lstg_format_name,
test_sites.site_name,
SUM(test_kylin_fact.price)
AS
GMV,
COUNT(*)
AS
TRANS_CNT
FROM
test_kylin_fact
LEFT
JOIN
test_cal_dt
ON
test_kylin_fact.cal_dt
=
test_cal_dt.cal_dt
LEFT
JOIN
test_category
ON
test_kylin_fact.leaf_categ_id
=
test_category.leaf_categ_id
AND
test_kylin_fact.lstg_site_id
=
test_category.site_id
LEFT
JOIN
test_sites
ON
test_kylin_fact.lstg_site_id
=
test_sites.site_id
WHERE
test_kylin_fact.seller_id
=
123456OR
test_kylin_fact.lstg_format_name
=
’New'
GROUP
BY
test_cal_dt.week_beg_dt,
test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,
test_kylin_fact.lstg_format_name,test_sites.site_name
Kylin Query Engine - Explain Plan
56. http://kylin.io
SELECT
test_cal_dt.week_beg_dt,
test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,
test_kylin_fact.lstg_format_name,
test_sites.site_name,
SUM(test_kylin_fact.price)
AS
GMV,
COUNT(*)
AS
TRANS_CNT
FROM
test_kylin_fact
LEFT
JOIN
test_cal_dt
ON
test_kylin_fact.cal_dt
=
test_cal_dt.cal_dt
LEFT
JOIN
test_category
ON
test_kylin_fact.leaf_categ_id
=
test_category.leaf_categ_id
AND
test_kylin_fact.lstg_site_id
=
test_category.site_id
LEFT
JOIN
test_sites
ON
test_kylin_fact.lstg_site_id
=
test_sites.site_id
WHERE
test_kylin_fact.seller_id
=
123456OR
test_kylin_fact.lstg_format_name
=
’New'
GROUP
BY
test_cal_dt.week_beg_dt,
test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,
test_kylin_fact.lstg_format_name,test_sites.site_name
OLAPToEnumerableConverter
OLAPProjectRel(WEEK_BEG_DT=[$0],
category_name=[$1],
CATEG_LVL2_NAME=[$2],
CATEG_LVL3_NAME=[$3],
LSTG_FORMAT_NAME=[$4],
SITE_NAME=[$5],
GMV=[CASE(=($7,
0),
null,
$6)],
TRANS_CNT=[$8])
OLAPAggregateRel(group=[{0,
1,
2,
3,
4,
5}],
agg#0=[$SUM0($6)],
agg#1=[COUNT($6)],
TRANS_CNT=[COUNT()])
OLAPProjectRel(WEEK_BEG_DT=[$13],
category_name=[$21],
CATEG_LVL2_NAME=[$15],
CATEG_LVL3_NAME=[$14],
LSTG_FORMAT_NAME=[$5],
SITE_NAME=[$23],
PRICE=[$0])
OLAPFilterRel(condition=[OR(=($3,
123456),
=($5,
’New'))])
OLAPJoinRel(condition=[=($2,
$25)],
joinType=[left])
OLAPJoinRel(condition=[AND(=($6,
$22),
=($2,
$17))],
joinType=[left])
OLAPJoinRel(condition=[=($4,
$12)],
joinType=[left])
OLAPTableScan(table=[[DEFAULT,
TEST_KYLIN_FACT]],
fields=[[0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11]])
OLAPTableScan(table=[[DEFAULT,
TEST_CAL_DT]],
fields=[[0,
1]])
OLAPTableScan(table=[[DEFAULT,
test_category]],
fields=[[0,
1,
2,
3,
4,
5,
6,
7,
8]])
OLAPTableScan(table=[[DEFAULT,
TEST_SITES]],
fields=[[0,
1,
2]])
Kylin Query Engine - Explain Plan
57. http://kylin.io
SELECT
test_cal_dt.week_beg_dt,
test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,
test_kylin_fact.lstg_format_name,
test_sites.site_name,
SUM(test_kylin_fact.price)
AS
GMV,
COUNT(*)
AS
TRANS_CNT
FROM
test_kylin_fact
LEFT
JOIN
test_cal_dt
ON
test_kylin_fact.cal_dt
=
test_cal_dt.cal_dt
LEFT
JOIN
test_category
ON
test_kylin_fact.leaf_categ_id
=
test_category.leaf_categ_id
AND
test_kylin_fact.lstg_site_id
=
test_category.site_id
LEFT
JOIN
test_sites
ON
test_kylin_fact.lstg_site_id
=
test_sites.site_id
WHERE
test_kylin_fact.seller_id
=
123456OR
test_kylin_fact.lstg_format_name
=
’New'
GROUP
BY
test_cal_dt.week_beg_dt,
test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,
test_kylin_fact.lstg_format_name,test_sites.site_name
OLAPToEnumerableConverter
OLAPProjectRel(WEEK_BEG_DT=[$0],
category_name=[$1],
CATEG_LVL2_NAME=[$2],
CATEG_LVL3_NAME=[$3],
LSTG_FORMAT_NAME=[$4],
SITE_NAME=[$5],
GMV=[CASE(=($7,
0),
null,
$6)],
TRANS_CNT=[$8])
OLAPAggregateRel(group=[{0,
1,
2,
3,
4,
5}],
agg#0=[$SUM0($6)],
agg#1=[COUNT($6)],
TRANS_CNT=[COUNT()])
OLAPProjectRel(WEEK_BEG_DT=[$13],
category_name=[$21],
CATEG_LVL2_NAME=[$15],
CATEG_LVL3_NAME=[$14],
LSTG_FORMAT_NAME=[$5],
SITE_NAME=[$23],
PRICE=[$0])
OLAPFilterRel(condition=[OR(=($3,
123456),
=($5,
’New'))])
OLAPJoinRel(condition=[=($2,
$25)],
joinType=[left])
OLAPJoinRel(condition=[AND(=($6,
$22),
=($2,
$17))],
joinType=[left])
OLAPJoinRel(condition=[=($4,
$12)],
joinType=[left])
OLAPTableScan(table=[[DEFAULT,
TEST_KYLIN_FACT]],
fields=[[0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11]])
OLAPTableScan(table=[[DEFAULT,
TEST_CAL_DT]],
fields=[[0,
1]])
OLAPTableScan(table=[[DEFAULT,
test_category]],
fields=[[0,
1,
2,
3,
4,
5,
6,
7,
8]])
OLAPTableScan(table=[[DEFAULT,
TEST_SITES]],
fields=[[0,
1,
2]])
Kylin Query Engine - Explain Plan
58. Cube Optimization
§ Curse
of
Dimensionality
§ N
dimension
cube
has
2N
cuboid
§ Full
Cube
vs.
Parqal
Cube
§ Huge
Data
Volume
§ Dicqonary
Encoding
§ Incremental
Building
http://kylin.io
59. § Full
Cube
- Pre-aggregate all dimension combinations
- “Curse of dimensionality”: N dimension cube has 2N cuboid.
§ ParOal
Cube
- To avoid dimension explosion, we divide the dimensions into different aggregation
groups
- 2N+M+L à 2N + 2M + 2L
- For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid
number will reduce from 1 Billion to 3 Thousands
- 230 à 210 + 210 + 210
- Tradeoff between online aggregation and offline pre-aggregation
http://kylin.io
Full Cube vs. Partical Cube
62. What’s Next
§ Improve
cube
algorithm
§ Cube
by
segments,
30%-‐50%
faster
§ Build
delay
down
to
tens
of
minutes
§ Streaming
cubing
§ Analyze
real-‐qme
data
§ Build
delay
down
to
seconds
§ Spark
http://kylin.io
63. Cube by Layer
§ The current algorithm
- Many MRs, the number of
dimensions
- Huge shuffles, aggregation at
reduce side, 100x of total cube
size
http://kylin.io
Full
Data
0-‐D
Cuboid
1-‐D
Cuboid
2-‐D
Cuboid
3-‐D
Cuboid
4-‐D
Cuboid
MR
MR
MR
MR
MR
64. Cube by Segments
§ The to-be algorithm,
30%-50% faster
- 1 round MR
- Reduced shuffles, map side
aggregation, 20x total cube
size
- Hourly incremental build done
in tens of minutes
http://kylin.io
Data
Split
Cube
Segment
Data
Split
Cube
Segment
Data
Split
Cube
Segment
……
Final
Cube
Merge
Sort
(Shuffle)
mapper mapper mapper
65. Streaming Cubing
§ Cube is great but…
- Cube takes time to build, how about real-time analysis?
- Sometimes we want to drill down to row level information
§ Streaming cubing
- Build micro cube segments from streaming
- Use inverted index to capture last minute data
http://kylin.io
68. Adding Spark Support
§ Cubing Efficiency
§ MR is not optimal framework
§ Spark Cubing Engine
http://kylin.io
69. Adding Spark Support
§ Cubing Efficiency
§ MR is not optimal framework
§ Spark Cubing Engine
§ Source from SparkSQL
§ Read data from SparkSQL instead of Hive
http://kylin.io
70. Adding Spark Support
§ Cubing Efficiency
§ MR is not optimal framework
§ Spark Cubing Engine
§ Source from SparkSQL
§ Read data from SparkSQL instead of Hive
§ Route to SparkSQL
§ Unsupported queries be coved by SparkSQL
http://kylin.io
75. Future2015 2016
Kylin Evolution Roadmap
http://kylin.io
20142013
Initial
Prototype
for
MOLAP
• Basic
end
to
end
POC
Sep,
2013
Jan,
2014
76. Future2015 2016
Kylin Evolution Roadmap
http://kylin.io
20142013
Initial
Prototype
for
MOLAP
• Basic
end
to
end
POC
MOLAP
• Incremental
Refresh
• ANSI
SQL
• ODBC
Driver
• Web
GUI
• Tableau
• ACL
• Open
SourceSep,
2013
Jan,
2014
Oct,
2014
77. Future2015 2016
Kylin Evolution Roadmap
http://kylin.io
20142013
Initial
Prototype
for
MOLAP
• Basic
end
to
end
POC
MOLAP
• Incremental
Refresh
• ANSI
SQL
• ODBC
Driver
• Web
GUI
• Tableau
• ACL
• Open
Source
StreamingOLAP
• Streaming
OLAP
• JDBC
Driver
• New
UI
• Excel
• SparkSQL
• …
more
Sep,
2013
Jan,
2014
Oct,
2014
H1,
2015
78. Future2015 2016
Kylin Evolution Roadmap
http://kylin.io
20142013
Initial
Prototype
for
MOLAP
• Basic
end
to
end
POC
MOLAP
• Incremental
Refresh
• ANSI
SQL
• ODBC
Driver
• Web
GUI
• Tableau
• ACL
• Open
Source
StreamingOLAP
• Streaming
OLAP
• JDBC
Driver
• New
UI
• Excel
• SparkSQL
• …
more
Sep,
2013
Jan,
2014
Oct,
2014
H1,
2015
HybridOLAP
• Lambda
Arch
• Automation
• Capacity
Management
• Spark
• …
more
79. Future2015 2016
Kylin Evolution Roadmap
http://kylin.io
20142013
Initial
Prototype
for
MOLAP
• Basic
end
to
end
POC
MOLAP
• Incremental
Refresh
• ANSI
SQL
• ODBC
Driver
• Web
GUI
• Tableau
• ACL
• Open
Source
StreamingOLAP
• Streaming
OLAP
• JDBC
Driver
• New
UI
• Excel
• SparkSQL
• …
more
Sep,
2013
Jan,
2014
Oct,
2014
H1,
2015
HybridOLAP
• Lambda
Arch
• Automation
• Capacity
Management
• Spark
• …
more
Next
Gen
• Adv
OLAP
Functions
• In-‐Memory
Analysis
(TBD)
• Mobile
(TBD)
• …
more
80. Future2015 2016
Kylin Evolution Roadmap
http://kylin.io
20142013
Initial
Prototype
for
MOLAP
• Basic
end
to
end
POC
MOLAP
• Incremental
Refresh
• ANSI
SQL
• ODBC
Driver
• Web
GUI
• Tableau
• ACL
• Open
Source
StreamingOLAP
• Streaming
OLAP
• JDBC
Driver
• New
UI
• Excel
• SparkSQL
• …
more
TBD
Sep,
2013
Jan,
2014
Oct,
2014
H1,
2015
HybridOLAP
• Lambda
Arch
• Automation
• Capacity
Management
• Spark
• …
more
Next
Gen
• Adv
OLAP
Functions
• In-‐Memory
Analysis
(TBD)
• Mobile
(TBD)
• …
more
81. Kylin Ecosystem
■ Kylin Core
■ Fundamental framework of Kylin OLAP Engine
■ Extension
■ Plugins to support for additional functions and features
■ Integration
■ Lifecycle Management Support to integrate with other
applications
■ Interface
■ Allows for third party users to build more features via user-
interface atop Kylin core
http://kylin.io
Kylin OLAP
Core
Extension
à Security
à Redis Storage
à Spark Engine
à Docker
Interface
à Web Console
à Customized BI
à Ambari/Hue Plugin
Integration
à ODBC Driver
à ETL
à Drill
à SparkSQL
82. http://kylin.io
If
you
want
to
go
fast,
go
alone.
If
you
want
to
go
far,
go
together.
-‐-‐African
Proverb
dev@kylin.incubator.apache.org