Presented at Elastic's worldwide "Virtual Meetup" on 5/15/2024
https://www.youtube.com/watch?v=ayQap5pH_0w
https://john.soban.ski/aggregations-the-elasticsearch-group-by.html
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd
The document discusses how OpsVerse migrated their Jaeger distributed tracing storage from Cassandra to ClickHouse for improved performance monitoring. Jaeger is an open source distributed tracing system that was originally designed to use Elasticsearch or Cassandra for storage. While Cassandra worked well for basic functionality, it lacked capabilities for advanced analytics. ClickHouse supports richer query functions and better handles large datasets. The document outlines the steps OpsVerse took to implement the ClickHouse storage plugin for Jaeger and deploy ClickHouse on Kubernetes using the ClickHouse Operator. This migration enabled more insightful performance monitoring and analytics.
This presentation summarizes how we use Elasticsearch for analytics at Wingify for our product Visual Website Optimizer (http://vwo.com). This presentation was prepared for my poster session at The Fifth Elephant (https://funnel.hasgeek.com/fifthel2014/1143-using-elasticsearch-for-analytics).
The slides for the first ever SnappyData webinar. Covers SnappyData core concepts, programming models, benchmarks and more.
SnappyData is open sourced here: https://github.com/SnappyDataInc/snappydata
We also have a deep technical paper here: http://www.snappydata.io/snappy-industrial
We can be easily contacted on Slack, Gitter and more: http://www.snappydata.io/about#contactus
This document discusses using Elasticsearch for dashboard data management applications. It describes uploading data from WLCG Transfers into Elasticsearch, performing matrix and plot queries to retrieve aggregated statistics, and different methods for grouping the results. Matrix and plot queries on Elasticsearch clusters hosted on virtual and physical machines were tested, with load times generally faster than Oracle but slower on the shared physical cluster due to other application load.
AWS October Webinar Series - Introducing Amazon Elasticsearch ServiceAmazon Web Services
Running Elasticsearch often requires specialized expertise and significant resources to operate and manage infrastructure and Elasticsearch software.
Amazon Elasticsearch Service makes it easy to deploy, operate, and scale Elasticsearch in AWS.
In this webinar, we will walk through how to launch a fully functional Amazon Elasticsearch domain, load your data, and analyze it using the built-in Kibana integration. We will also cover the CloudWatch Logs integration, which enables you to have your log data, such as VPC logs, automatically loaded into your Amazon Elasticsearch domain for analysis and exploration.
This document provides an overview of using Elasticsearch for data analytics. It discusses various aggregation techniques in Elasticsearch like terms, min/max/avg/sum, cardinality, histogram, date_histogram, and nested aggregations. It also covers mappings, dynamic templates, and general tips for working with aggregations. The main takeaways are that aggregations in Elasticsearch provide insights into data distributions and relationships similarly to GROUP BY in SQL, and that mappings and templates can optimize how data is indexed for aggregation purposes.
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd
The document discusses how OpsVerse migrated their Jaeger distributed tracing storage from Cassandra to ClickHouse for improved performance monitoring. Jaeger is an open source distributed tracing system that was originally designed to use Elasticsearch or Cassandra for storage. While Cassandra worked well for basic functionality, it lacked capabilities for advanced analytics. ClickHouse supports richer query functions and better handles large datasets. The document outlines the steps OpsVerse took to implement the ClickHouse storage plugin for Jaeger and deploy ClickHouse on Kubernetes using the ClickHouse Operator. This migration enabled more insightful performance monitoring and analytics.
This presentation summarizes how we use Elasticsearch for analytics at Wingify for our product Visual Website Optimizer (http://vwo.com). This presentation was prepared for my poster session at The Fifth Elephant (https://funnel.hasgeek.com/fifthel2014/1143-using-elasticsearch-for-analytics).
The slides for the first ever SnappyData webinar. Covers SnappyData core concepts, programming models, benchmarks and more.
SnappyData is open sourced here: https://github.com/SnappyDataInc/snappydata
We also have a deep technical paper here: http://www.snappydata.io/snappy-industrial
We can be easily contacted on Slack, Gitter and more: http://www.snappydata.io/about#contactus
This document discusses using Elasticsearch for dashboard data management applications. It describes uploading data from WLCG Transfers into Elasticsearch, performing matrix and plot queries to retrieve aggregated statistics, and different methods for grouping the results. Matrix and plot queries on Elasticsearch clusters hosted on virtual and physical machines were tested, with load times generally faster than Oracle but slower on the shared physical cluster due to other application load.
AWS October Webinar Series - Introducing Amazon Elasticsearch ServiceAmazon Web Services
Running Elasticsearch often requires specialized expertise and significant resources to operate and manage infrastructure and Elasticsearch software.
Amazon Elasticsearch Service makes it easy to deploy, operate, and scale Elasticsearch in AWS.
In this webinar, we will walk through how to launch a fully functional Amazon Elasticsearch domain, load your data, and analyze it using the built-in Kibana integration. We will also cover the CloudWatch Logs integration, which enables you to have your log data, such as VPC logs, automatically loaded into your Amazon Elasticsearch domain for analysis and exploration.
This document provides an overview of using Elasticsearch for data analytics. It discusses various aggregation techniques in Elasticsearch like terms, min/max/avg/sum, cardinality, histogram, date_histogram, and nested aggregations. It also covers mappings, dynamic templates, and general tips for working with aggregations. The main takeaways are that aggregations in Elasticsearch provide insights into data distributions and relationships similarly to GROUP BY in SQL, and that mappings and templates can optimize how data is indexed for aggregation purposes.
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
Visualizing Austin's data with Elasticsearch and KibanaObjectRocket
This document provides an introduction to Elasticsearch and Kibana. It describes what Elasticsearch is and how it can scale to handle large amounts of data and queries. It also describes Kibana and how it is used for data visualization. The document then demonstrates how to use Elasticsearch and Kibana together to visualize and analyze Austin transportation and restaurant inspection data.
This document provides guidance on sizing Elastic Stack deployments for security use cases. It discusses Elasticsearch internals and computing resources needed for different node roles. It recommends preparing by ingesting sample data and monitoring size and ingestion rates to calculate storage needs. The document also discusses optimizing performance by understanding hardware capabilities, balancing cluster size and costs, and aiming for optimal shard sizes. It suggests using techniques like cross-cluster search, data tiering, and transforms. Guidance is provided on scaling Kibana and the detection engine. Examples are given for calculating storage needs and determining necessary data nodes for small and large deployments.
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2
To view recording of this webinar please use below URL:
WSO2 Data Analytics Server (WSO2 DAS) version 3.0 is the successor of WSO2 Business Activity Monitor 2.5. It based on the latest technologies and is an evolutionary upgrade to the current system. WSO2 DAS comes with a comprehensive set of new features including support for pluggable data sources, support for batch processing with Apache Spark, support for distributed data indexing, a new dashboard and support for unified data querying with analytics REST APIs.
The WSO2 DAS combines real-time, batch, interactive, and predictive (via machine learning) analysis of data into a single integrated platform. This webinar will present and demonstrate the following key features and capabilities in detail:
Pluggable data sources support with its new data abstraction layer
Batch analytics using the Apache Spark analytics engine
Interactive analysis powered by Apache Lucene
An analytics dashboard to visualize results
Activity monitoring capabilities for tracking related events in a system
This document summarizes a presentation about near real-time analytics platforms at Uber and LinkedIn. It discusses use cases for streaming analytics, challenges with scalability and operations, and new platforms developed using Apache Samza and SQL. Key points include how Samza is used to build streaming applications with SQL queries, operators, and support for multi-stage workflows. The platforms aim to simplify deployment and management of streaming jobs through interfaces like AthenaX.
See webinar recording of this presentation at: https://resource.alibabacloud.com/webinar/live.htm?&webinarId=67
In this presentation, you will learn all you need to know about Elasticsearch, one of the most widely used open source search platforms in the world. We will walk you through what Elasticsearch is, why you need it, and show common use cases. First, we will introduce Elastic Search and the best practices for deploying it, as well as show what some of the salient features of the platform are. In the second part of the webinar, we delve into the various use cases for Elasticsearch and show why it is an excellent platform to query a large dataset. This includes a demo on querying a cluster. Finally, we will show how you can launch an elastic cluster on Alibaba Cloud and how to use Elasticsearch to query a large dataset for an autocomplete use case.
Learn more about Alibaba Cloud’s Elasticsearch offering:
https://www.alibabacloud.com/product/elasticsearch
1. Kusto (Azure Data Explorer) is a fast and flexible data exploration service for analyzing security and application logs, performance counters, and other streaming data.
2. A Data Engineer's role is evolving to focus more on real-time analysis using Kusto as opposed to traditional SQL. Understanding how to use Kusto's query engine and data ingestion capabilities is key.
3. Techniques like using materialized views, partitioning data, and leader-follower databases can help distribute workloads and improve query performance at scale in Kusto. However, Kusto has limitations around concurrency, memory usage, and result set sizes that need to be considered.
Solr Power FTW: Powering NoSQL the World OverAlex Pinkin
Solr is an open source, Lucene based search platform originally developed by CNET and used by the likes of Netflix, Yelp, and StubHub which has been rapidly growing in popularity and features during the last few years. Learn how Solr can be used as a Not Only SQL (NoSQL) database along the lines of Cassandra, Memcached, and Redis. NoSQL data stores are regularly described as non-relational, distributed, internet-scalable and are used at both Facebook and Digg. This presentation will quickly cover the fundamentals of NoSQL data stores, the basics of Lucene, and what Solr brings to the table. Following that we will dive into the technical details of making Solr your primary query engine on large scale web applications, thus relegating your traditional relational database to little more than a simple key store. Real solutions to problems like handling four billion requests per month will be presented. We'll talk about sizing and configuring the Solr instances to maintain rapid response times under heavy load. We'll show you how to change the schema on a live system with tens of millions of documents indexed while supporting real-time results. And finally, we'll answer your questions about ways to work around the lack of transactions in Solr and how you can do all of this in a highly available solution.
Run your queries 14X faster without any investment!Knoldus Inc.
This document discusses techniques for improving query performance on large datasets through pre-aggregation. It explains that pre-aggregating data into summary tables can speed up queries by orders of magnitude. Partitioning, bucketing, and creating aggregate tables are recommended to optimize storage. Dimensional modeling principles for designing aggregate tables are also covered, along with best practices for determining which aggregates to create and query optimization.
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
Analyze Big Data for Consumer Applications with Looker BI and Amazon Redshift Customizing the customer experience based on user behavior is a constant challenge for today’s consumer apps. Business intelligence helps analyze and model large amounts of data. Looker offers a modern approach to BI leveraging AWS that’s fast, agile, and easy to manage. Join this webinar to learn how MessageMe, which provides emotionally engaging messaging apps to consumers, leverages Looker business intelligence software and the Amazon Redshift data warehouse service to analyze billions of rows of customer data in seconds.
Webinar topics include:
• How MessageMe turns billions of rows of customer data stored in Amazon Redshift into actionable insights
• How Looker connects directly to Amazon Redshift in just a few clicks, enabling MessageMe to build a modern, big data analytics in the cloud. Who should attend
• Information or Solution Architects, Data Analysts, BI Directors, DBAs, Development Leads, Developers, or Technical IT Leaders.
Presenters:
• Justin Rosenthal, CTO, MessageMe
• Keenan Rice, VP, Marketing & Alliances, Looker
• Tina Adams, Senior Product Manager, AWS
Kai Sasaki from Treasure Data discusses their efforts to implement auto scaling for their distributed Presto and Hive query engines. They decoupled the storage layer from the processing engines to allow dynamic scaling. They migrated infrastructure to AWS CodeDeploy and Auto Scaling Groups to automate deployments and scaling. They implemented target tracking auto scaling based on CPU usage but found it did not work well due to conservative scaling in behavior and long running queries blocking instance termination. Future work includes real auto scaling without target tracking and auto query migration during outages.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data AnalyticsAmazon Web Services
Organizations are collecting an ever-increasing amount of data from numerous sources such as log systems, click streams, and connected devices. Launched in 2009, Elasticsearch —an open-source analytics and search engine— has emerged as a popular tool for real-time analytics and visualization of data. Some of the most common use cases include risk assessment, error detection, and sentiment analysis. However, as data volumes and applications grow, managing Elasticsearch clusters can consume significant IT resources while adding little or no differentiated value to the organization. Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Amazon ES offers the benefits of a managed service, including cluster provisioning, easy configuration, replication for high availability, scaling options, data durability, security, and node monitoring. This session presents a technical deep dive on Amazon ES. Attendees learn: Common challenges with real-time data analytics and visualization and how to address them; the benefits, reference architecture, and best practices for using Amazon ES; and data ingestion options with Amazon DynamoDB, AWS Lambda, and Amazon Kinesis.
The document discusses MongoDB and how it compares to relational database management systems (RDBMS). It provides examples of how data can be modeled and stored differently in MongoDB compared to SQL databases. Specifically, it discusses how MongoDB allows for flexible, dynamic schemas as each document can have a different structure. This enables complex data like product catalogs with varying attributes for different items to be stored easily in a single collection. The document also provides examples of common operations like insert, update and delete in MongoDB compared to SQL.
This document discusses how Amazon SageMaker can be used to train machine learning models on large datasets using hosted Jupyter notebooks. It notes that DigitalGlobe plans to use SageMaker to train models on petabytes of Earth observation imagery so that users can create and deploy models within one scalable environment. The document also quotes the CTO of Maxar Technologies saying they will use SageMaker to build and deploy novel AI algorithms at scale to solve complex problems.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Amazon Web Services
Join us for the first-ever Amazon DynamoDB practical hands-on workshop. This session is designed for developers, engineers, and database administrators who are involved in designing and maintaining DynamoDB applications. We begin with a walkthrough of proven NoSQL design patterns for at-scale applications. Next, we use step-by-step instructions to apply lessons learned to design DynamoDB tables and indexes that are optimized for performance and cost. Expect to leave this session with the knowledge to build and monitor DynamoDB applications that can grow to any size and scale. Attendees should have a basic understanding of DynamoDB. To attend this workshop, bring your laptop.
Real-Time Forecasting at Scale using Delta Lake and Delta CachingDatabricks
GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. Inventory impression is the real estate to show potential ads on a publisher page. By generating near-real-time inventory forecast based on campaign-specific targeting rules, GumGum enables the account managers to set up successful future campaigns.
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Visualizing Austin's data with Elasticsearch and KibanaObjectRocket
This document provides an introduction to Elasticsearch and Kibana. It describes what Elasticsearch is and how it can scale to handle large amounts of data and queries. It also describes Kibana and how it is used for data visualization. The document then demonstrates how to use Elasticsearch and Kibana together to visualize and analyze Austin transportation and restaurant inspection data.
This document provides guidance on sizing Elastic Stack deployments for security use cases. It discusses Elasticsearch internals and computing resources needed for different node roles. It recommends preparing by ingesting sample data and monitoring size and ingestion rates to calculate storage needs. The document also discusses optimizing performance by understanding hardware capabilities, balancing cluster size and costs, and aiming for optimal shard sizes. It suggests using techniques like cross-cluster search, data tiering, and transforms. Guidance is provided on scaling Kibana and the detection engine. Examples are given for calculating storage needs and determining necessary data nodes for small and large deployments.
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2
To view recording of this webinar please use below URL:
WSO2 Data Analytics Server (WSO2 DAS) version 3.0 is the successor of WSO2 Business Activity Monitor 2.5. It based on the latest technologies and is an evolutionary upgrade to the current system. WSO2 DAS comes with a comprehensive set of new features including support for pluggable data sources, support for batch processing with Apache Spark, support for distributed data indexing, a new dashboard and support for unified data querying with analytics REST APIs.
The WSO2 DAS combines real-time, batch, interactive, and predictive (via machine learning) analysis of data into a single integrated platform. This webinar will present and demonstrate the following key features and capabilities in detail:
Pluggable data sources support with its new data abstraction layer
Batch analytics using the Apache Spark analytics engine
Interactive analysis powered by Apache Lucene
An analytics dashboard to visualize results
Activity monitoring capabilities for tracking related events in a system
This document summarizes a presentation about near real-time analytics platforms at Uber and LinkedIn. It discusses use cases for streaming analytics, challenges with scalability and operations, and new platforms developed using Apache Samza and SQL. Key points include how Samza is used to build streaming applications with SQL queries, operators, and support for multi-stage workflows. The platforms aim to simplify deployment and management of streaming jobs through interfaces like AthenaX.
See webinar recording of this presentation at: https://resource.alibabacloud.com/webinar/live.htm?&webinarId=67
In this presentation, you will learn all you need to know about Elasticsearch, one of the most widely used open source search platforms in the world. We will walk you through what Elasticsearch is, why you need it, and show common use cases. First, we will introduce Elastic Search and the best practices for deploying it, as well as show what some of the salient features of the platform are. In the second part of the webinar, we delve into the various use cases for Elasticsearch and show why it is an excellent platform to query a large dataset. This includes a demo on querying a cluster. Finally, we will show how you can launch an elastic cluster on Alibaba Cloud and how to use Elasticsearch to query a large dataset for an autocomplete use case.
Learn more about Alibaba Cloud’s Elasticsearch offering:
https://www.alibabacloud.com/product/elasticsearch
1. Kusto (Azure Data Explorer) is a fast and flexible data exploration service for analyzing security and application logs, performance counters, and other streaming data.
2. A Data Engineer's role is evolving to focus more on real-time analysis using Kusto as opposed to traditional SQL. Understanding how to use Kusto's query engine and data ingestion capabilities is key.
3. Techniques like using materialized views, partitioning data, and leader-follower databases can help distribute workloads and improve query performance at scale in Kusto. However, Kusto has limitations around concurrency, memory usage, and result set sizes that need to be considered.
Solr Power FTW: Powering NoSQL the World OverAlex Pinkin
Solr is an open source, Lucene based search platform originally developed by CNET and used by the likes of Netflix, Yelp, and StubHub which has been rapidly growing in popularity and features during the last few years. Learn how Solr can be used as a Not Only SQL (NoSQL) database along the lines of Cassandra, Memcached, and Redis. NoSQL data stores are regularly described as non-relational, distributed, internet-scalable and are used at both Facebook and Digg. This presentation will quickly cover the fundamentals of NoSQL data stores, the basics of Lucene, and what Solr brings to the table. Following that we will dive into the technical details of making Solr your primary query engine on large scale web applications, thus relegating your traditional relational database to little more than a simple key store. Real solutions to problems like handling four billion requests per month will be presented. We'll talk about sizing and configuring the Solr instances to maintain rapid response times under heavy load. We'll show you how to change the schema on a live system with tens of millions of documents indexed while supporting real-time results. And finally, we'll answer your questions about ways to work around the lack of transactions in Solr and how you can do all of this in a highly available solution.
Run your queries 14X faster without any investment!Knoldus Inc.
This document discusses techniques for improving query performance on large datasets through pre-aggregation. It explains that pre-aggregating data into summary tables can speed up queries by orders of magnitude. Partitioning, bucketing, and creating aggregate tables are recommended to optimize storage. Dimensional modeling principles for designing aggregate tables are also covered, along with best practices for determining which aggregates to create and query optimization.
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
Analyze Big Data for Consumer Applications with Looker BI and Amazon Redshift Customizing the customer experience based on user behavior is a constant challenge for today’s consumer apps. Business intelligence helps analyze and model large amounts of data. Looker offers a modern approach to BI leveraging AWS that’s fast, agile, and easy to manage. Join this webinar to learn how MessageMe, which provides emotionally engaging messaging apps to consumers, leverages Looker business intelligence software and the Amazon Redshift data warehouse service to analyze billions of rows of customer data in seconds.
Webinar topics include:
• How MessageMe turns billions of rows of customer data stored in Amazon Redshift into actionable insights
• How Looker connects directly to Amazon Redshift in just a few clicks, enabling MessageMe to build a modern, big data analytics in the cloud. Who should attend
• Information or Solution Architects, Data Analysts, BI Directors, DBAs, Development Leads, Developers, or Technical IT Leaders.
Presenters:
• Justin Rosenthal, CTO, MessageMe
• Keenan Rice, VP, Marketing & Alliances, Looker
• Tina Adams, Senior Product Manager, AWS
Kai Sasaki from Treasure Data discusses their efforts to implement auto scaling for their distributed Presto and Hive query engines. They decoupled the storage layer from the processing engines to allow dynamic scaling. They migrated infrastructure to AWS CodeDeploy and Auto Scaling Groups to automate deployments and scaling. They implemented target tracking auto scaling based on CPU usage but found it did not work well due to conservative scaling in behavior and long running queries blocking instance termination. Future work includes real auto scaling without target tracking and auto query migration during outages.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
(BDT209) Launch: Amazon Elasticsearch For Real-Time Data AnalyticsAmazon Web Services
Organizations are collecting an ever-increasing amount of data from numerous sources such as log systems, click streams, and connected devices. Launched in 2009, Elasticsearch —an open-source analytics and search engine— has emerged as a popular tool for real-time analytics and visualization of data. Some of the most common use cases include risk assessment, error detection, and sentiment analysis. However, as data volumes and applications grow, managing Elasticsearch clusters can consume significant IT resources while adding little or no differentiated value to the organization. Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Amazon ES offers the benefits of a managed service, including cluster provisioning, easy configuration, replication for high availability, scaling options, data durability, security, and node monitoring. This session presents a technical deep dive on Amazon ES. Attendees learn: Common challenges with real-time data analytics and visualization and how to address them; the benefits, reference architecture, and best practices for using Amazon ES; and data ingestion options with Amazon DynamoDB, AWS Lambda, and Amazon Kinesis.
The document discusses MongoDB and how it compares to relational database management systems (RDBMS). It provides examples of how data can be modeled and stored differently in MongoDB compared to SQL databases. Specifically, it discusses how MongoDB allows for flexible, dynamic schemas as each document can have a different structure. This enables complex data like product catalogs with varying attributes for different items to be stored easily in a single collection. The document also provides examples of common operations like insert, update and delete in MongoDB compared to SQL.
This document discusses how Amazon SageMaker can be used to train machine learning models on large datasets using hosted Jupyter notebooks. It notes that DigitalGlobe plans to use SageMaker to train models on petabytes of Earth observation imagery so that users can create and deploy models within one scalable environment. The document also quotes the CTO of Maxar Technologies saying they will use SageMaker to build and deploy novel AI algorithms at scale to solve complex problems.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Amazon Web Services
Join us for the first-ever Amazon DynamoDB practical hands-on workshop. This session is designed for developers, engineers, and database administrators who are involved in designing and maintaining DynamoDB applications. We begin with a walkthrough of proven NoSQL design patterns for at-scale applications. Next, we use step-by-step instructions to apply lessons learned to design DynamoDB tables and indexes that are optimized for performance and cost. Expect to leave this session with the knowledge to build and monitor DynamoDB applications that can grow to any size and scale. Attendees should have a basic understanding of DynamoDB. To attend this workshop, bring your laptop.
Real-Time Forecasting at Scale using Delta Lake and Delta CachingDatabricks
GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. Inventory impression is the real estate to show potential ads on a publisher page. By generating near-real-time inventory forecast based on campaign-specific targeting rules, GumGum enables the account managers to set up successful future campaigns.
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
Similar to Aggregations - The Elasticsearch "GROUP BY" (20)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
2. About Me
John Sobanski
○VP AI/ML at Pyramid Systems Inc.
○Elastic Certified Engineer (2020, 2022)
○https://soban.ski
3. Agenda
●Background - Relational Database Management Systems
●Aggregations – The Elasticsearch GROUP BY
○Why use the Elasticsearch Aggregation API
○1st Demo - Execute GROUP BY operations via Elasticsearch
aggregations
●Elasticsearch Aggregations Drive Time Series Data Visualization
○BUCKETS in Elasticsearch
○2nd Demo - Time Series Data Visualization with Kibana
6. RDBMS Series Operations
●In the traditional Relational Database Management System (RDBMS) world, SQL databases
use GROUP BY syntax to group rows with similar values into summary rows.
●The query, "find the number of web page hits per country," for example, represents a
typical GROUP BY operation.
This table records the number of hits to my
site [https://soban.ski], broken down by time
zone.
7. ●Further summarize the table to record "hits per country" via
a GROUP BY operation
○Collapse the Time zones into parent countries.
SELECT COUNTRY, SUM(HITS) FROM timezone_hits GROUP BY COUNTRY;
SQL Syntax:
8. RDBMS vs. Elasticsearch
●Even though Elasticsearch does not use the row construct to identify a unit of data (Elastic calls their
rows Documents), we can still perform GROUP BY queries in Elasticsearch
●Elasticsearch names their GROUP BY queries Aggregations.
●The Elasticsearch API provides an expressive REST API to execute Aggregations
○Kibana also provides a Graphical User Interface (GUI) to execute Aggregations
ElasticSearch
Relational Database
Index
Type
Documents
Fields
Database
Table
Rows
Columns
10. Why use the Elasticsearch Aggregation API? - 1
●Easy Approach: Query Elasticsearch, convert the returned JSON to a Pandas Dataframe, and then
apply a Pandas GROUP BY to the Dataframe to retrieve summary stats.
●Modern laptops include 32GB of memory and you have had no issues with this method. If you use
Elasticsearch for non time series data, e.g. static data for blogs, you may not need to worry about
running out of memory.
11. Why use the Elasticsearch Aggregation API? - 2
●In the future, you may deal with Big Data. If you collect time series data, such as
access logs, or security logs, you might scale to Big Data. In that case, the
Elasticsearch database size will exceed the memory of your laptop.
12. Why use the Elasticsearch Aggregation API? - 3
●Aggs API allows you to command Elasticsearch to execute the GROUP
BY analogue in-stu (a best practice), and then also apply the summary stats in
place.
●Elasticsearch will then return the summary stats as JSON, and you will not run out of
memory.
13. Time Series
●GROUP BY(RDMMS) and Aggregation(Elasticsearch) operations
lend themselves well to Time Series data, since these operations
allow you to GROUP BY or Aggregate results over a given time
bucket (e.g. Hour, Day, Week, Month, etc.).
14. 1st Demo
●Execute GROUP BY operations via Elasticsearch aggregations
●Demonstrate how to generate tables via both Kibana and the
Elasticsearch API.
16. Elasticsearch BUCKETS
●In Elasticsearch parlance, we put the rows into BUCKETS, with one BUCKET for each country.
●Consider a query for total hits. The Elasticsearch API reports that I received ~100k hits in the
month of June.
●Add the hits to BUCKETS !
19. Time Buckets
T
●Time BUCKETS
○Analyze, summarize and visualize time series data.
○BUCKETS return a smaller amount of data, for example, hits per
hour
○We need BUCKETS because we cannot plot every datum from
a big data document store
23. Conclusion
●Summarize Series
○Sum, Median, etc.
○RDBMS Uses GROUP BY
●Elasticsearch Uses AGGS
○Separate data by tag via BUCKETS
○Country Buckets, Time Buckets, Month Buckets