See how the world’s leading open source solution for query acceleration on massive datasets is revolutionizing analytics for enterprises across every industry, and how you can get started using it in your organization.
https://www.brighttalk.com/webcast/18317/413952
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
Learn about the latest advancements in Apache Kylin and how its OLAP technology is making analytics faster and insights more actionable.
Learn more about Apache Kylin: https://kyligence.io/apache-kylin-overview/
Learn more about Apache Kylin's enterprise version Kyligence: https://kyligence.io/
New Approaches for Fraud Detection on Apache Kafka and KSQLconfluent
Speakers: Dale Kim, Sr. Director, Products/Solutions, Arcadia Data + Chong Yan, Solutions Architect, Confluent
When it comes to corporate fraud, early detection is integral to mitigating and preventing drastic damage.
Modern streaming data technologies like Apache Kafka® and Confluent KSQL, the streaming SQL engine for Apache Kafka, can help companies catch and detect fraud in real time instead of after the fact. Kafka is ideal for managing fast, incoming data points, and KSQL provides the de facto standard for reading that data. Combine this with Arcadia Data visualizations designed for modern data types, and you have a powerful foundation for combating fraud.
You will learn:
-Why traditional batch-driven approaches to fraud detection are insufficient today
-Why Apache Kafka is widely used for real-time fraud detection
-How KSQL and real-time visualizations open more opportunities for searching for fraud
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
Learn about the latest advancements in Apache Kylin and how its OLAP technology is making analytics faster and insights more actionable.
Learn more about Apache Kylin: https://kyligence.io/apache-kylin-overview/
Learn more about Apache Kylin's enterprise version Kyligence: https://kyligence.io/
New Approaches for Fraud Detection on Apache Kafka and KSQLconfluent
Speakers: Dale Kim, Sr. Director, Products/Solutions, Arcadia Data + Chong Yan, Solutions Architect, Confluent
When it comes to corporate fraud, early detection is integral to mitigating and preventing drastic damage.
Modern streaming data technologies like Apache Kafka® and Confluent KSQL, the streaming SQL engine for Apache Kafka, can help companies catch and detect fraud in real time instead of after the fact. Kafka is ideal for managing fast, incoming data points, and KSQL provides the de facto standard for reading that data. Combine this with Arcadia Data visualizations designed for modern data types, and you have a powerful foundation for combating fraud.
You will learn:
-Why traditional batch-driven approaches to fraud detection are insufficient today
-Why Apache Kafka is widely used for real-time fraud detection
-How KSQL and real-time visualizations open more opportunities for searching for fraud
Big data is a huge world. There are lot of technologies old and new and all these options can be overwhelming for beginners who want to start working on Big Data projects.
In this session, we are going to talk about the basics of Big Data, what is -and what is not-. We will focus on Hadoop, Hive, Spark, Kafka and their use cases.
Deep Learning at Extreme Scale (in the Cloud) with the Apache Kafka Open Sou...Kai Wähner
How to Build a Machine Learning Infrastructure with Kafka, Connect, Streams, KSQL, etc…
This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem - and why this is a great fit for machine learning at extreme scale.
The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.
Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.
Key takeaways for the audience
• Learn how to build a Machine Learning infrastructure at extreme scale and how to productionize the built models in mission-critical real time applications
• Understand the benefits of a machine learning platform on the public cloud
• Learn about an extreme scale Machine Learning architecture around the Apache Kafka open source ecosystem including Kafka Connect, Kafka Streams and KSQL
• See a live demo for an Internet of Things use case: Sensor analytics for predictive alerting in real time
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
Under-replicated Partitions: The mother of all metrics
Request Latencies: Why your users complain
Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Apache Hive is an Enterprise Data Warehouse build on top of Hadoop. Hive supports Insert/Update/Delete SQL statements with transactional semantics and read operations that run at Snapshot Isolation. This talk will describe the intended use cases, architecture of the implementation, new features such as SQL Merge statement and recent improvements. The talk will also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL. This API is used by Apache NiFi, Storm and Flume to stream data directly into Hive tables and make it visible to readers in near real time.
Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Apache Ranger’s pluggable architecture allows centralized authoring of authorization policies and access audits—for Hadoop and non-Hadoop components. Authorization policy model is designed to capture and express complex authorization needs of component.
In this session, we will present two more key enhancements made to the policy model in the next release to make it richer and support advanced authorization needs of contemporary enterprise security infrastructure.
•Ranger service definition is enhanced to support specification of allowed accesses on a given resource. This specification is then utilized to present only valid accesses when authoring policy targeted for the resource.
•Ranger policy model is enhanced to support time-based policy that temporarily grants/denies access to a resource during specified time window. The time specification supports specification of a time zone which is enforced based on the time zone of the component where the Ranger plugin runs.
We will conclude by a demonstration of these new capabilities. ABHAY KULKARNI, Engineer, Hortonworks and RAMESH MANI, Staff Software Engineer, Hortonworks
Open Source Technologies in the Analytics RevolutionSamanthaBerlant
One of the hallmarks of modern analytics is that data pipelines are largely built upon open source software (OSS). It is entirely possible to create cutting edge data science, machine learning, data engineering, ETL processing, and predictive analytics pipelines without using any commercial software. Of course, OSS does not necessarily mean “free,” but as a thought experiment, the first part of this session will explore the role of OSS in your data analytics stacks and data pipelines.
For the second half of this presentation, we will examine how OSS tools and platforms can be used to learn and create your own Machine Learning and Data Analytics projects without breaking the bank.
View the presentation: https://youtu.be/JbNuikWKC1Q
Big data is a huge world. There are lot of technologies old and new and all these options can be overwhelming for beginners who want to start working on Big Data projects.
In this session, we are going to talk about the basics of Big Data, what is -and what is not-. We will focus on Hadoop, Hive, Spark, Kafka and their use cases.
Deep Learning at Extreme Scale (in the Cloud) with the Apache Kafka Open Sou...Kai Wähner
How to Build a Machine Learning Infrastructure with Kafka, Connect, Streams, KSQL, etc…
This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem - and why this is a great fit for machine learning at extreme scale.
The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.
Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.
Key takeaways for the audience
• Learn how to build a Machine Learning infrastructure at extreme scale and how to productionize the built models in mission-critical real time applications
• Understand the benefits of a machine learning platform on the public cloud
• Learn about an extreme scale Machine Learning architecture around the Apache Kafka open source ecosystem including Kafka Connect, Kafka Streams and KSQL
• See a live demo for an Internet of Things use case: Sensor analytics for predictive alerting in real time
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
Under-replicated Partitions: The mother of all metrics
Request Latencies: Why your users complain
Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Apache Hive is an Enterprise Data Warehouse build on top of Hadoop. Hive supports Insert/Update/Delete SQL statements with transactional semantics and read operations that run at Snapshot Isolation. This talk will describe the intended use cases, architecture of the implementation, new features such as SQL Merge statement and recent improvements. The talk will also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL. This API is used by Apache NiFi, Storm and Flume to stream data directly into Hive tables and make it visible to readers in near real time.
Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Apache Ranger’s pluggable architecture allows centralized authoring of authorization policies and access audits—for Hadoop and non-Hadoop components. Authorization policy model is designed to capture and express complex authorization needs of component.
In this session, we will present two more key enhancements made to the policy model in the next release to make it richer and support advanced authorization needs of contemporary enterprise security infrastructure.
•Ranger service definition is enhanced to support specification of allowed accesses on a given resource. This specification is then utilized to present only valid accesses when authoring policy targeted for the resource.
•Ranger policy model is enhanced to support time-based policy that temporarily grants/denies access to a resource during specified time window. The time specification supports specification of a time zone which is enforced based on the time zone of the component where the Ranger plugin runs.
We will conclude by a demonstration of these new capabilities. ABHAY KULKARNI, Engineer, Hortonworks and RAMESH MANI, Staff Software Engineer, Hortonworks
Open Source Technologies in the Analytics RevolutionSamanthaBerlant
One of the hallmarks of modern analytics is that data pipelines are largely built upon open source software (OSS). It is entirely possible to create cutting edge data science, machine learning, data engineering, ETL processing, and predictive analytics pipelines without using any commercial software. Of course, OSS does not necessarily mean “free,” but as a thought experiment, the first part of this session will explore the role of OSS in your data analytics stacks and data pipelines.
For the second half of this presentation, we will examine how OSS tools and platforms can be used to learn and create your own Machine Learning and Data Analytics projects without breaking the bank.
View the presentation: https://youtu.be/JbNuikWKC1Q
Building Enterprise OLAP on Hadoop for FSILuke Han
Building Enterprise OLAP on Hadoop for Finance Services Industry, and following a use case of CPIC (fortune 500 insurance company) about how to replace legacy IBM Cognos OLAP with Kyligence platform
Take the Bias out of Big Data Insights With Augmented AnalyticsTyler Wishnoff
Is bias impacting your Big Data insights? Learn how augmented analytics and the latest advancements in OLAP technology are making analytics (including on cloud) from business intelligence, data science, and machine learning more accurate and impactful. Learn more at https://kyligence.io
Integrating and fully utilizing data is a critical prerequisite for ensuring the success of data-driven operations and decision making. This is especially true as more and more corporations begin transforming legacy data warehouses and transitioning to the Cloud. See how Augmented OLAP technology is leading the way in streamlining Big Data analytics on the Cloud with this presentation by Kyligence CEO Luke Han at Big Things Conference 2019. Learn more here: https://kyligence.io
Apache Kylin and Use Cases - 2018 Big Data SpainLuke Han
Apache Kylin is rapidly being adopted over the world as the leading open source OLAP for Big Data. In this topic, Luke Han, creator and PMC chair of Apache Kylin, will introduce the motivation when build this project and technical highlights, alwo will explore how various industries use Apache Kylin, and the resulting business impact.
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...Tyler Wishnoff
See how extreme query speeds and ultra-high concurrency on MicroStrategy, and any other business intelligence (BI) tool, on Big Data is possible through the Kyligence platform. Learn more here: https://kyligence.io/
With an explosion of data, today’s emerging needs are not being met by existing technologies, which require rich skill sets and expertise. Companies that want to lead changes in highly competitive markets must optimize their storage, speed, and spending. The key is for them to augment their data management and analytics platforms with artificial intelligence and machine learning for analysts, engineers, and other users.
Big Data, Machine Learning, and AI have created new opportunities for organizations worldwide, but this has also put tremendous pressure on IT and data engineering teams to scale and maintain analytics performance on massive datasets. This presentation at Strata Data Conference London 2019 by Luke Han explains how Augmented OLAP technology solves the challenges of analytics on massive datasets while reducing IT costs. Learn more about this powerful approach to Big Data here: https://kyligence.io/
A lot has changed with OLAP in the last few years and this presentation offers a great overview of how OLAP has evolved with the help of Augmented Analytics. See why Augmented OLAP is proving to be the best way to ensure high-performance analytics at any scale, and learn which large enterprises have already adopted this approach and how it's helping them. Learn more about Augmented OLAP and what it can do at: https://kyligence.io/
Cloud-native Semantic Layer on Data LakeDatabricks
With larger volume and more real-time data stored in data lake, it becomes more complex to manage these data and serve analytics and applications. With different service interfaces, data caliber, performance bias on different scenarios, the business users begin to suffer low confidence on quality and efficiency to get insight from data.
Batched To Perfection: Modeling & Solving Business Problems With Apache SparkEliav Lavi
As data gets bigger, the applications we're maintaining as developers are becoming increasingly more data hungry. There comes a point where simply querying all raw data and crunching it into some meaningful piece of information upon request at runtime is just not giving us good enough performance. Maybe our application starts lagging. This is when pre-calculating our aggregations become crucial.
In this talk, we will examine how can Apache Spark help in building elegant and accurate batch processing data pipelines. This will allow us to maintain our pre-calculated aggregations and make our web applications run blazingly fast again. Along the way we will make use of some other cool technologies, such as Databricks' Delta Lake, and throw some functional programming goodness into the process.
Eliav Lavi is a technical lead @ Riskified, where he's been working for the past 7 years. Previously a classically trained musician, he shifted to a developer position in order to pursue his long-time passion for tech. Today he is part of Riskified's Account Protection team, preventing bad actors from taking over customers' accounts.
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...Tyler Wishnoff
Learn how to empower your analysts with easier access to all the data they need, exactly when they need it - all while reducing workloads for IT and data engineering.
This presentation will walk you through those challenges, what modern options are available for solving them, and how taking an AI-powered approach to self-service analytics may yield the greatest level of data access along with the best possible performance. Learn more here: https://kyligence.io/
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Data Con LA
Leading entrepreneurial outfits are disrupting traditional companies by rapidly building data-driven apps. They employ top software talent and effectively use storage, analytics and app-dev tools from various open source ecosystems. We show how companies of all sizes are now transforming into data-driven enterprises using their existing software skill sets by leveraging a single platform that combines flexible data storage systems, advanced analytics and agile app-dev PaaS frameworks, all available now in open source forums.
Kyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and IndexingSamanthaBerlant
You’ve moved your data to the cloud, awesome. Now you’re running into issues of concurrency, scale, and cost overruns. But there’s a better way to run your cloud analytics if you think of cloud resources as commodities to conserve and maximize. Sure, you could run the same query from start to finish every time, or you could speed up this process, and save some cash in the process, by precomputing those queries and storing the response for fast retrieval any time, by any number of analysts.
Kyligence Cloud 4’s Spark-Powered Cubing and Indexing feature provides just that - intelligent precomputation, which fundamentally boils down to low-cost, high-performance analytics. Join us for the fourth part of this series exploring the key features of Kyligence Cloud 4.
In this webinar you will learn:
-About modern, cloud era OLAP and cubing theory
-Performance gains you’ll get from intelligent precomputation
-How to apply cloud computing and distributed processing
-Precomputation strategies and tactics
Smashing Through Big Data Barriers with Tableau and SnowflakeSamanthaBerlant
Your analysts are working with more data than ever before in Tableau. Chances are, as the data volumes grow, your teams are experiencing some slowdowns. While it may be tempting to blame Tableau, the most likely explanation for performance and scalability pains lies in your data service layer. What if you could transform the way you do analytics without having to retrain your Tableau users? What if you could get more critical business value out of Tableau, and your data, without disrupting the way your business operates?
Join us for this session to learn how Tableau could be the ultimate window into ALL of your valuable data, no matter how large. Learn how precomputation technology and AI-augmented query optimization can help you break free of the downward performance spiral of legacy analytics approaches.
In this presentation, you will learn:
-How to get the fastest big data analytics experience on Tableau
-How a unified semantic layer can ensure that your current Tableau users are not disrupted by big data
-How to improve your analytics operations with automation and machine intelligence
Watch the webinar to see this technology in action during the live Snowflake demo. Enter the onramp to unmatched performance with big data analytics on Tableau.
If you have big data, more and more of your analytics stack needs to be intelligent. Your tools need to be able to anticipate the needs of your analysts, customers, and your business. With the AI-Augmented Engine, this learning process is automated and predictive. It intelligently adapts to user behavior and query patterns and learns to anticipate each users’ needs. Join us for the third installment of this series diving into the core features of Kyligence Cloud 4.
In this presentation you will learn:
-How the Kyligence Cloud 4 AI-Augmented Engine works
-How the AI-Augmented Engine gives optimal efficiency for cube building
-How the AI-Augmented Engine greatly simplifies data modeling
Watch the webinar here: https://www.brighttalk.com/webcast/18317/480320
Precomputation or Data Virtualization, which one is right for you?SamanthaBerlant
In the world of cloud analytics, what role do precomputation and distributed OLAP play compared with a data virtualization approach? Which should you choose? Do they compete or complement each other? This webinar will address these questions and provide some guidance for how to choose the right approach for your circumstances.
Both technologies are trying to address a similar challenge: make analytics easily accessible to a wider audience in a modern big data environment. Precomputation focuses on performance, response time, and concurrency in the production environment. Data Virtualization technologies focus on making analysis easily available to users by reducing or eliminating ETL and data warehouses.
In this presentation we will cover:
-The key differences between precomputation and data virtualization
-How your choice between the two affects data quality, security, governance, and TCO
-The financial impact each of these technologies have on your analytics program
Architecting Snowflake for High Concurrency and High PerformanceSamanthaBerlant
Cloud Data Warehousing juggernaut Snowflake has raced out ahead of the pack to deliver a data management platform from which a wealth of new analytics can be run. Using Snowflake as a traditional data warehouse has some obvious cost advantages over a hardware solution. But the real value of Snowflake as a data platform lies in its ability to support a high-concurrency analytics platform using Kyligence Cloud, powered by Apache Kylin.
In this presentation, Senior Solutions Architect Robert Hardaway will describe a modern data service architecture using precomputation and distributed indexes to provide interactive analytics to hundreds or even thousands of users running against very large Snowflake datasets (TBs to PBs).
In January of this year, Kyligence announced the immediate availability of Kyligence Cloud 4, the first fully cloud-native, distributed OLAP platform. During our announcement, EMA analyst John Santaferraro said:
“As the race for unified analytics heats up, Kyligence offers a solution that overcomes the challenges of querying data in both data lakes and data warehouses located both in the cloud and on premises.”
Join Li Kang - VP of North America at Kyligence - as he provides an overview of the Kyligence Cloud 4 release that will show:
--The new cloud native architecture that employs Apache Kylin, Apache Spark, and Apache Parquet to ensure optimal performance.
--How KC4 delivers sub-second query responses on very large datasets using precomputed aggregate indexes (hyper-cubes) and table indexes.
--The AI-Augmented engine that intelligently organizes your data and reduces data modeling time from days/weeks to minutes.
In this presentation, we will present the Kyligence Cloud 4 story - high-speed analytics with unprecedented sub-second query response times against petabyte datasets.
Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...SamanthaBerlant
People have been using Excel for 35 years. There are over 750 million Excel users. People are making magic with Excel every day. With the surging interest in big data, advanced analytics, and the cloud, how does Excel stay relevant and how extreme can Excel get? In this presentation, we will examine:
o Traditional limits of Excel performance, scale, dataset sizes
o Cloud technologies that make Excel better
o Defining the new extremes for Excel power users
Speaker Bio:
Rachel Beddor is a Solutions Engineer for Kyligence where she creates technical content to enhance the learning experience for new Apache Kylin and Kyligence users. She has dedicated her career to making technology more accessible, fun, and inviting to people of all backgrounds.
Addressing the systemic shortcomings of cloud analyticsSamanthaBerlant
Learn how existing open source technologies like Apache Kylin, Spark, and Mondrian can be used to increase the value of your analytics investment.
As we enter what some have called The Golden Age of Analytics, there are still some fundamental challenges that plague even the largest and most sophisticated cloud analytics adopters. Chief among these is the challenge of scale, often reflected in limitations of concurrency, multi-tenancy, distributed query performance, and all manner of latencies.
Other less obvious, but equally crucial, challenges of scale and performance have to do with IT and end-user productivity. In other words, there have been few technological advances that enable the quick deployment of big data analytics and the rapid creation of business value from the data being analyzed.
This presentation will consider a few of these systemic challenges and suggest some ways that they can be addressed with available open source technology such as Apache Kylin, Apache Spark, and Apache Mondrian.
Presenter:
Kaige Liu is a Senior Solutions Architect at Kyligence, where he works on building the next-generation big data analytics platform. Previously, he worked on the OpenStack and Bluemix team at IBM, focusing on cloud computing and virtualization technology. Kaige loves the open source community and is an active Apache Kylin committer.
SF Big Analytics Meetup - Exact Count Distinct with Apache KylinSamanthaBerlant
With over 450 million customers, Didi (world’s largest rideshare company) conducts complex user behavior analysis on huge datasets daily. Exact Count Distinct is one of Didi’s most critical metrics, but it is known for being computationally heavy and notoriously slow. The difference between exact Count Distinct and approximate Count Distinct can cost Didi millions of dollars. In this talk, Kaige Liu of the Apache Kylin project will explain how Didi uses Apache Kylin to return exact Distinct Count on billions of rows of data with sub-second latency to generate the most accurate picture of its business.
You will also learn about the latest development in modern OLAP technologies. Kaige will share how Didi and Truck Alliance (a truck-hailing company that processes $100 billion worth of goods yearly) use Apache Kylin to power their analytics platforms that allow 100s of analysts to achieve sub-second latency on petabyte-scale data.
Learn how to solve the top 3 challenges Snowflake customers face, and what you can do to ensure high-performance, intelligent analytics at any scale. Ideal for those currently using Snowflake and those considering it.
https://www.brighttalk.com/webcast/18317/422499
Enhance Data Governance with Kyligence Unified Semantic LayerSamanthaBerlant
Simplify data lake governance, no matter how much data you work with and how many data sources and BI tools you manage. This presentation offers all you need to develop your own strategy for smarter data lake governance.
https://www.brighttalk.com/webcast/18317/414017
How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...SamanthaBerlant
See how to consistently deliver accurate COUNT DISTINCT queries in under a second, even on petabyte-scale datasets. This presentation will share Apache Kylin’s approach to COUNT DISTINCT queries for user behavior analysis.
https://www.brighttalk.com/webcast/18317/414006
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).