Building a Real-Time Gaming Analytics Service with Apache Druid

•

1 like•259 views

At GameAnalytics we receive and process real time behavioural data from more than 100 million daily active users, helping thousands of game studios and developers understand user behaviour and improve their games. In this talk, you will learn how we managed to migrate our legacy backend system from using an in-house built streaming analytics service to Apache Druid, and the lessons learned along the way. By adopting Druid, we have been able to reduce development costs, increase reliability of our systems and implement new features that would have not been possible with our old stack. We will provide an overview of our approach to schema design, segments optimization, creation of our query layer, caching and datasources optimisation, which can help you better understand how you can successfully use Druid as a key component on your data processing and reporting infrastructure.

Technology

Building a Real-Time
analytics service with
Apache Druid
Virtual Druid Summit
October 2020
Ramón Lastres Guerrero, Director of Engineering,
GameAnalytics
1

Introduction to GameAnalytics
user behaviour analytics
focused on just gaming
SDKs
Rest API https://gameanalytics.com/docs/item/rest-api-doc
results in real-time and also historical aggregate

Introduction to GameAnalytics
150M+ 25,000+ 19B+ JSON1.7B+ 1 TB

Key Performance Indicators: Player Retention

Technical Requirements
high level technical requirements
● (responsive Frontend)
● real time queries
● Reliability
● infrastructure cost
● ﬂexible querying / ﬁltering
● number of unique users

Backend Overview
three main components
●
●
●

Druid: Query Layer
build our own query layer
● deﬁne metrics on backend side
● Implement authentication
● caching, query priorities, rate limiting
Elixir language
Druid client for Elixir

A / B testing and Druid
What is
A / B testing?

A / B testing and Druid
real-time
result metrics in real time
probabilistic model
variants are just Druid dimensions

Druid: Cluster Topology
Imply Cloud
●
●
●

Druid: Performance numbers
multi-tenancy
75k queries per hour
rollup
DAU 1.4k

Hash partitioning VS single dimension partitioning
game_id (our tenant id) dimension
unstable EMR ingestion
hashed partitioning

Druid: Query Layer Caching
always implement good caching

Annotation System
SDK attribution
partners

Annotation System: Calculating Player Retention
retention
calculation
increases the size by ~30%
installation timestamp (truncated to day)

Single Datasource VS Multiple Datasources
one single Kinesis stream
high cardinality low cardinality
reduce number of rows processed

Single Datasource VS Multiple Datasources
daily
Datasource Daily Size Daily segments Avg. Segment size
Small ~ 8.5GB ~ 10 ~ 550MB
Reduced ~ 50GB ~ 75 ~ 550MB
Full ~ 290GB ~ 450 ~ 550MB

Single Datasource VS Multiple Datasources
●
●
●
●
●
●

Druid: Tiering
leverage tiering
use it to lower costs
serving more frequently accessed data
with more powerful hardware

Druid: Lookups
joins with data stored outside of Druid
using lookups we can query on studio and
organization level

Dates: November 10, 2020
druidsummit.org
38
Register Now for
the Next Druid
Virtual Summit

As Twitch grew, both the amount of data we received and the number of employees interested in the data grew rapidly. In order to continue empowering decision making as we scaled, we turned to using Druid and Imply to provide self service analytics to both our technical and non technical staff allowing them to drill into high level metrics in lieu of reading generated reports. In this talk, learn how Twitch implemented a common analytics platform for the needs of many different teams supporting hundreds of users, thousands of queries, and ~5 billion events each day. This session will explain our Druid architecture in detail, including: -The end-to-end architecture deployed on Amazon that includes Kinesis, RDS, S3, Druid, Pivot and Tableau -How the data is brought together to deliver a unified view of live customer engagement and historical trends -Operational best practices we learnt scaling Druid -An example walk through using the platform

Archmage, Pinterest’s Real-time Analytics Platform on Druid

Imply

In this talk, we will talk about: 1) the motivation of switching from Hbase backed analytics system to Druid 2) the architecture design of Druid as a platform in Pinterest (Archmage, Hadoop, Kafka) including a query interface, Archmage, a thrift service in front of Druid which exposes a thrift api to company-wise clients, handles Druid broker hosts discovery, serves as a relay to broker hosts to abstract the async HTTP connection and provides query optimizations transparent to clients including directly translating fixed pattern SQL to Druid native JSON queries to save planning time. In addition, we’ll cover the production Hadoop batch and Kafka real time ingestion pipeline setup and the reason we picked a pull-based solution instead of a push-based solution for real time ingestion. 3) We will also talk about the use cases currently running in production on this platform including their data volume, QPS, Druid cluster setup, the unique challenges we met while onboarding and how we addressed them with extensive tunings to meet SLA and lessons learned for use cases including: partner insights, which provides partners with stats on organic pins; realtime spam detection, which detects user login related anomaly events and pin related spamming events like pin creation and repin; and migrating the backend from Presto to Druid for Ads related experiments data analysis.

Splunk: Druid on Kubernetes with Druid-operator

Imply

We went through the journey of deploying Apache Druid clusters on Kubernetes(K8s) and created a druid-operator (https://github.com/druid-io/druid-operator). This talk introduces the druid kubernetes operator, how to use it to deploy druid clusters and how it works under the hood. We will share how we use this operator to deploy Druid clusters at Splunk. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Druid is a complex stateful distributed system and a Druid cluster consists of multiple web services such as Broker, Historical, Coordinator, Overlord, MiddleManager etc each deployed with multiple replicas. Deploying a single web service on K8s requires creating few K8s resources via YAML files and it multiplies due to multiple services inside of a Druid cluster. Now doing it for multiple Druid clusters (dev, staging, production environments) makes it even more tedious and error prone. K8s enables creation of application (such as Druid) specific extension, called “Operator”, that combines kubernetes and application specific knowledge into a reusable K8s extension that makes deploying complex applications simple.

Apache Druid®: A Dance of Distributed Processes

Imply

Apache Druid® is an open source analytics database powering fresh, fast analytics in companies from AirBnB to Zeotap on clickstream, telemetry, financial transactions, applications and more. In this talk, we open the box on the three distributed processes in Druid led by the coordinator, overlord, and broker, and the ways that these come together to deliver reliable, performant query, ingestion, and management services.

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021

StreamNative

You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!

Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix

DataWorks Summit

Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies? In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...

Imply

Making Apache Spark Better with Delta Lake

Databricks

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved!

Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table. In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.

Data Discovery at Databricks with Amundsen

Databricks

Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally. We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including: Surface the most popular tables used within Databricks Support fuzzy search and facet search for dataset- Surface rich metadata on datasets: Lineage information (downstream table, upstream table, downstream jobs, downstream users) Dataset owner Dataset frequent users Delta extend metadata (e.g change history) ETL job that generates the dataset Column stats on numeric type columns Dashboards that use the given dataset Use Databricks data tab to show the sample data Surface metadata on dashboards including: create time, last update time, tables used, etc Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.

Microsoft Azure Cost Optimization and improve efficiency

Kushan Lahiru Perera

Iceberg: A modern table format for big data (Strata NY 2018)

Ryan Blue

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Pig, and Presto.

Apache Flink and what it is used for

Aljoscha Krettek

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

Free Training: How to Build a Lakehouse

Databricks

Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data. That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads. Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos. Here’s what you’ll learn in this 2-hour session: How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...

HostedbyConfluent

At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts. aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions. We will talk about the details of our solution and the interesting technical challenges faced.

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019

confluent

Tesla ingests trillions of events every day from hundreds of unique data sources through our streaming data platform. Find out how we developed a set of high-throughput, non-blocking primitives that allow us to transform and ingest data into a variety of data stores with minimal development time. Additionally, we will discuss how these primitives allowed us to completely migrate the streaming platform in just a few months. Finally, we will talk about how we scale team size sub-linearly to data volumes, while continuing to onboard new use cases.

How Uber scaled its Real Time Infrastructure to Trillion events per day

DataWorks Summit

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.

Dataday Texas 2016 - Datadog

Datadog

Delta from a Data Engineer's Perspective

Databricks

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...

DataWorks Summit

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Databricks

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

Owning Your Own (Data) Lake House

Data Con LA

Data Con LA 2020 Description In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake. Speaker Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning

InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx

InfluxData

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Flink Forward

Flink Forward San Francisco 2022. Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way. by Jeff Chao

My first 90 days with ClickHouse.pdf

Alkin Tezuysal

This talk will tell the story of an analytics use case database from a non-OLAP and ACID-compliant RDBMS (MySQL) perspective. I will cover the basics of the Clickhouse database Sample Clickhouse installation in a lab environment. We are configuring Clickhouse for essential operations. We will load the sample data set and monitor it. We will query and visualize the results. This talk will also base on how Kubernetes can help Clickhouse implementation via an operator. Conclusions will include Do's and Don't of this emerging technology. Best practices and some advice around ingesting and analyzing terabytes of data efficiently.

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Predicting Flights with Azure Databricks

Sarah Dutkiewicz

Databricks is a popular tool used with large amounts of data, applying to many roles - including data analysts, data engineers, data scientists, and machine learning engineers. It can be found on many cloud platforms - including Azure, AWS, and GCP. In this talk, we will look at a flight-themed end-to-end solution using Azure Databricks, Azure Data Factory, Azure Storage, and Power BI. By the end of this session, you will have a better understanding of Databricks' capabilities and how it integrates with other Azure offerings.

Game Analytics at London Apache Druid Meetup

Jelena Zanko

RedisConf18 - Video Experience Operational Insights in Real Time.

Redis Labs

What's hot

Change Data Feed in Delta

Databricks

Data Discovery at Databricks with Amundsen

Databricks

Microsoft Azure Cost Optimization and improve efficiency

Kushan Lahiru Perera

Iceberg: A modern table format for big data (Strata NY 2018)

Ryan Blue

Apache Flink and what it is used for

Aljoscha Krettek

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

Free Training: How to Build a Lakehouse

Databricks

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...

HostedbyConfluent

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019

confluent

How Uber scaled its Real Time Infrastructure to Trillion events per day

DataWorks Summit

Dataday Texas 2016 - Datadog

Datadog

Delta from a Data Engineer's Perspective

Databricks

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...

DataWorks Summit

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Databricks

Owning Your Own (Data) Lake House

Data Con LA

InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx

InfluxData

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Flink Forward

My first 90 days with ClickHouse.pdf

Alkin Tezuysal

Learn to Use Databricks for Data Science

Databricks

Predicting Flights with Azure Databricks

Sarah Dutkiewicz

What's hot (20)

Change Data Feed in Delta

Data Discovery at Databricks with Amundsen

Microsoft Azure Cost Optimization and improve efficiency

Iceberg: A modern table format for big data (Strata NY 2018)

Apache Flink and what it is used for

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Free Training: How to Build a Lakehouse

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019

How Uber scaled its Real Time Infrastructure to Trillion events per day

Dataday Texas 2016 - Datadog

Delta from a Data Engineer's Perspective

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Owning Your Own (Data) Lake House

InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

My first 90 days with ClickHouse.pdf

Learn to Use Databricks for Data Science

Predicting Flights with Azure Databricks

Similar to Building a Real-Time Gaming Analytics Service with Apache Druid

Game Analytics at London Apache Druid Meetup

Jelena Zanko

RedisConf18 - Video Experience Operational Insights in Real Time.

Redis Labs

Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...

Severalnines

Traditional server monitoring tools are not built for modern distributed database architectures. Let’s face it, most production databases today run in some kind of high availability setup - from simpler master-slave replication to multi-master clusters fronted by redundant load balancers. Operations teams deal with dozens, often hundreds of services that make up the database environment. This is why we built ClusterControl - to address modern, highly distributed database setups based on replication or clustering. We wanted something that could provide a systems view of all the components of a distributed cluster, including load balancers. Watch this replay of a webinar on free database monitoring using ClusterControl Community Edition. We show you how to monitor all your MySQL, MariaDB, PostgreSQL and MongoDB systems from a single point of control - whether they are deployed as Galera Clusters, sharded clusters or replication setups across on-prem and cloud data centers. We also see how to use Advisors in order to improve performance. AGENDA - Requirements for monitoring distributed database systems - Cloud-based vs On-prem monitoring solutions - Agent-based vs Agentless monitoring - Deepdive into ClusterControl Community Edition - Architecture - Metrics Collection - Trending - Dashboards - Queries - Performance Advisors - Other features available to Community users SPEAKER Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.

Building a data pipeline to ingest data into Hadoop in minutes using Streamse...

Guglielmo Iozzia

Sprint 45 review

ManageIQ

MongoDB Sharding Webinar 2014

Dylan Tong

#TwitterRealTime - Real time processing @twitter

Twitter Developers

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Redis Labs

Think you have big data? What about high availability requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads

Strategies for Context Data Persistence

FIWARE

This training camp teaches you how FIWARE technologies and iSHARE, brought together under the umbrella of the i4Trust initiative, can be combined to provide the means for creation of data spaces in which multiple organizations can exchange digital twin data in a trusted and efficient manner, collaborating in the development of innovative services based on data sharing and creating value out of the data they share. SMEs and Digital Innovation Hubs (DIHs) will be equipped with the necessary know-how to use the i4Trust framework for creating data spaces!

AWS Lambda and Serverless framework: lessons learned while building a serverl...

Luciano Mammino

Planet9energy.com is a new electricity company that is building a sophisticated analytics and energy trading platform for the UK market. Since the earliest days of the company we took the unconventional decision to go serverless and finally we are building the product on top of AWS Lambda and the Serverless framework using Node.js. In this talk we will discuss why we took this radical decision, what are the pros and cons of this approach and what are the main issues we faced as a tech team in our design and development experience. We will discuss how normal things like testing and deployment need to be re-thought to work on a serverless fashion but also the benefits of (almost) infinite auto-scalability and the piece of mind of not having to manage hundreds of servers. Finally we will underline how Node.js seems to fit naturally in this scenario and how it makes developing serverless applications extremely convenient. Thanks to Padraig O'Brien and Luciano Mammino for speaking this month. Speakers Bio: Padraig O'Brien Podge @Podgeypoos79 is a software engineer for over 15 years, most of that was spent developing in .NET and SQL Server, designing and building large scale data intensive applications. Lately he has shifted towards open source technologies and is spending most of his time learning Node.js, Scala and cool data tech like Spark, Cassandra. He is also working on a “super-secret” project called UnicornDB, don’t tell anybody! In his spare time he helps out with organising some meetups like NodeSchool Dublin, NodeSchool Dun Laoghaire and teaching Kanban via Agile Lean Ireland. Luciano Mammino Luciano @loige is a Software Engineer born in 1987, the same year that the Nintendo released “Super Mario Bros” in Europe, which, “by chance” is his favourite game! His primary passion is code and he is extremely fascinated by the web, smart apps and everything that's creative like music, art and design. He started coding at the age of 12 using his father's old i386 provided only with DOS and the qBasic interpreter.He is a senior software developer at Planet9Energy in Dublin and he loves JavaScript (React/Node.js). He is also the co-author of "Node.js design patterns" 2nd edition (Packt, http://amzn.to/1ZF279B). Hosted by Intercom, sponsored by Nearform and organised by Node.js Dublin (https://www.meetup.com/Dublin-Node-js-Meetup/events/236870576/)

Summit2013 eventos onto quadSemantic Technology Institute International

TechEd NZ 2014: Azure and Sharepoint

Intergen

TechEd NZ 2014 - DCIM211 - Aben Samuel This session with take IT Pros, Managers through various aspects of Azure, but with a focus on SharePoint and how organizations should be looking at Azure with regards to: 1. Hybrid Approach 2. Complete Warm SharePoint Platform 3. Disaster Recovery , Business Continuity The session would also look into some of the newer features that have been made available recently and also look into some of the experiences with deploying SharePoint implementations on Azure.

Time Series Analytics Azure ADX

Riccardo Zamana

韓国オンラインゲームから学ぶアドホックなビックデータ分析

Daisuke Masubuchi

db tech showcase 2019 https://www.db-tech-showcase.com/dbts/tokyo にて、登壇させていただいた際の資料です。お話しできなかったAppendix もあるのでぜひチェックしてみてください。スピーカー Jisun Kim・Daisuke Masubuchi 「Azure Data Explorer」（ADX）は、Azure管理用の内部ツールを商用に公開したバージョンで、毎日1兆以上のイベントと数百TBから数千TBを超えるデータを処理します。Preview公開後は、大手の石油会社や韓国のオンラインゲーム企業がこの基盤を利用してデータ分析をはじめました。長年韓国のゲーム会社でデータアナリストをしてきたJisunと、日本のゲーム会社様をお手伝いしている増渕から、ADXを使用したゲームのリアルタイム分析がどのように変わるのか、パイプライン作成、クエリの活用、分析の技術的なヒントを実際のツールとデモを行いました。

Girish Juneja - Intel Big Data & Cloud Summit 2013IntelAPAC

Using Event Streams in Serverless Applications

Jonathan Dee

Android Lollipop: The developer's perspective

Sebastian Vieira

Netflix Playback Data Systems Team and Job Overview

Suudhan Rangarajan

Google BigQuery for Everyday Developer

Márton Kodok

IV. IT&C Innovation Conference - October 2016 - Sovata, Romania A. Every scientist who needs big data analytics to save millions of lives should have that power Legacy systems don’t provide the power. B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics. Traditional solutions are not applicable. The Plan: have oversight over developments as they happen. Goal: Store everything accessible by SQL immediately. What is BigQuery? Analytics-as-a-Service - Data Warehouse in the Cloud Fully-Managed by Google (US or EU zone) Scales into Petabytes Ridiculously fast Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing 100.000 rows / sec Streaming API Open Interfaces (Web UI, BQ command line tool, REST, ODBC) Familiar DB Structure (table, views, record, nested, JSON) Convenience of SQL + Javascript UDF (User Defined Functions) Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors Client libraries available in YFL (your favorite languages) Our benefits no provisioning/deploy no running out of resources no more focus on large scale execution plan no need to re-implement tricky concepts (time windows / join streams) pay only the columns we have in your queries run raw ad-hoc queries (either by analysts/sales or Devs) no more throwing away-, expiring-, aggregating old data.

My past-3 yeas-developer-journey-at-linkedin-by-iantsai

Kim Kao

Similar to Building a Real-Time Gaming Analytics Service with Apache Druid (20)

Game Analytics at London Apache Druid Meetup

RedisConf18 - Video Experience Operational Insights in Real Time.

Webinar slides: Free Monitoring (on Steroids) for MySQL, MariaDB, PostgreSQL ...

Building a data pipeline to ingest data into Hadoop in minutes using Streamse...

Sprint 45 review

MongoDB Sharding Webinar 2014

#TwitterRealTime - Real time processing @twitter

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Strategies for Context Data Persistence

AWS Lambda and Serverless framework: lessons learned while building a serverl...

Summit2013 eventos onto quad

TechEd NZ 2014: Azure and Sharepoint

Time Series Analytics Azure ADX

韓国オンラインゲームから学ぶアドホックなビックデータ分析

Girish Juneja - Intel Big Data & Cloud Summit 2013

Using Event Streams in Serverless Applications

Android Lollipop: The developer's perspective

Netflix Playback Data Systems Team and Job Overview

Google BigQuery for Everyday Developer

My past-3 yeas-developer-journey-at-linkedin-by-iantsai

More from Imply

Pivot 2.0 - The next generation visualization tool for your streaming data

Imply

We have rearchitected Pivot from the ground up for enhanced dimensional analysis while ensuring that it is even faster, if that was even possible. Pivot 2.0 has plenty of new ways for you to visualize your data so that you can figure out the complex relationships between your data and enhanced the comparative analysis to quickly gain insight. In this webinar, will walk you through the exciting new features that are coming soon to Pivot.

Druid Adoption Tips and Tricks

Imply

Peter Marshall, Technology Evangelist at Imply Abstract: Apache Druid® can revolutionise business decision-making with a view of the freshest of fresh data in web, mobile, desktop, and data science notebooks. In this talk, we look at key activities to integrate into Apache Druid POCs, discussing common hurdles and signposting to important information. Bio: Peter Marshall (https://petermarshall.io) is an Apache Druid Technology Evangelist at Imply (http://imply.io/), a company founded by original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.

Druid in Spot Instances

Imply

Nicolas Trésegnie, Chief Architect at SuperAwesome Abstract: SuperAwesome's mission is to make the internet safer for kids. At the core of SuperAwesome's analytics is Druid. In this talk, we walk through how we run Druid on spot instances. We explain the consequences in terms of cost and reliability, how we managed to build a reliable system despite the risks, and how you could do the same. Nicolas works as Chief Architect at SuperAwesome, where is is looking after the overall architecture of the systems and the infrastructure. He is all about automation and how technology can be used to achieve business goals. Nicolas studied Computer Science and Bioinformatics, and he is now pursuing an MBA at Imperial.

Zeotap: Data Modeling in Druid for Non temporal and Nested Data

Imply

Druid has been the production workhorse for the past 2+ years at Zeotap powering the core Audience planning across our Connect and Targeting products. Though Druid is best suited for data having time as a dimension as it partitions data based on time first, we have used Druid to serve ML powered enhanced insights and Estimation of potential dataset sizes, to assist us with our core business case of Audience planning. These are datasets without timestamp a.k.a non-temporal with high scale and having nested dimensions. These have been achieved using nuanced data modelling to store the data sets and achieve millisecond latency retrieval on top of the same. The core of the presentation would be on the data modelling journey to achieve these use cases detailing the query access patterns. We also delve upon the architecture - ingestion into druid sink and processing including ML. In the end we go over the production setup and configurations and provide the performance tunings applied. The presentation would have the following heads: The presentation would have the following heads * Business case in Ad-Tech and Mar-Tech vertical * Audience Planner Usecase 1 - Insights -Lambda Architecture and data flow -Deep dive on data model -Takeaways *Audience Planner Usecase 2 - Estimator -Architecture and data flow -Stratified sampling explained -Data model to solve nested data - deep dive -Takeaways *Audience Planner Usecase 3 - Skew correction -Skew correction model -Query Access -Data model in Druid to accommodate output from ML models -Takeaways *Production setup, config and Tunings *Production Operation experience takeaways

Nielsen: Casting the Spell - Druid in Practice

Imply

At Nielsen Identity, we leverage Druid to provide our customers with real-time analytics tools for various use-cases, including in-flight analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale. We’ve been using Druid to solve these problems for the past 4 years, and gained a lot of experience with it. In this talk, we will share some of the best practices and tips we’ve gathered over the years, including: *Data modeling *Ingestion *Retention and deletion *Query optimization

Building Data Applications with Apache Druid

Imply

One of the most popular use cases for Apache Druid is building data applications. Data applications exist to deliver data into the hands of everyone on a team in a business, and are used by these teams to make faster, better decisions. To fulfill this role, they need to support granular drill down, because the devil is in the details, but also be extremely fast, because otherwise people won't use them! In this talk, Gian Merlino will cover: *The unique technical challenges of powering data-driven applications *What attributes of Druid make it a good platform for data applications *Some real-world data applications powered by Druid

Maximizing Apache Druid performance: Beyond the basics

Imply

Druid is a powerful real-time database, and part of that power is the level of control you get over cluster configuration, allowing you to get maximum performance for your specific data and query types. In this talk, Gian Merlino, one of the original authors of Druid and CTO and co-founder of Imply, will walk you through some advanced techniques that can provide a multiplier to your Druid performance. Afterwards, he’ll take your questions about performance, or anything else Druid-related.

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...

Imply

Target is one of the largest retailers in the United States, with brick-and-mortar stores in all 50 states and one of the most-visited ecommerce sites in the country. In addition to typical merchandising functions like assortment planning, pricing and inventory management, Target also operates a large supply chain, financial/banking operations and property management organizations. As a data-driven organization, we need a data analytics platform that can address the unique needs of each of these various business units, while scaling to hundreds of thousands of users and accommodating an ever-increasing amount of data. In this talk we’ll cover why Target chose to create our own analytics platform and specifically how Druid makes this platform successful. We’ll cover how we utilize key features in Druid, such as union datasources, arbitrary granularities, real-time ingestion, complex aggregation expressions and lightning-fast query response to provide analytics to users at all levels of the organization. We’ll also cover how Druid’s speed and flexibility allow us to provide interactive analytics to front-line, edge-of-business consumers to address hundreds of unique use-cases across several business units.

How TrafficGuard uses Druid to Fight Ad Fraud and Bots

Imply

In this session, TrafficGuard’s Head of Data Science, Raigon Jolly, will discuss how TrafficGuard uses Druid and its partnership with Imply to: - Provide granular reporting to clients in near-real time - Monitor rules and concept drift - Staying ahead of the moving target that is ad fraud - Facilitate performance tuning and right-sizing infrastructure so our team can focus on innovation of our core product

Apache Druid: Lightning Fast Analytics on Real-time and Historical Data (Atla...

Imply

Talk abstract: Users are demanding access to large, multi-petabyte, multi-dimension, real-time datasets to answer business critical questions. Providing a self-service interface that meets the performance expectations of these users can be challenging. Enter Apache Druid: an open source analytics database powering real-time, ad hoc, lightning fast analytics. It is used for clickstream analytics, network telemetry, fraud detection, application monitoring and so much more by companies like Apple, Netflix, Twitter, and AirBnb. Druid can ingest millions of records per second and deliver sub-second response times on OLAP-style slice and dice queries. In this talk, we will start with an overview of Apache Druid followed by a look at several examples of how Druid is being used in the real-world. We'll finish up with Q&A and some virtual networking. Speaker Bio: Mike McLaughlin is a senior field engineer at Imply. He helps customers run and optimize Apache Druid in production. He has 20 years experience developing, architecting, and deploying software.

August meetup - All about Apache Druid

Imply

Benchmarking Apache Druid

Imply

Druid: Under the Covers (Virtual Meetup)

Imply

Apache Druid ingests and enables instant query on many billions of events in real-time. But how? In this talk, each of the components of an Apache Druid cluster is described – along with the data and query optimisations at its core – that unlock fresh, fast data for all. Bio: Peter Marshall (https://linkedin.com/in/amillionbytes/) leads outreach and engineering across Europe for Imply (http://imply.io/), a company founded by the original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.

Why data warehouses cannot support hot analytics

Imply

Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics. Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.

What’s New in Imply 3.3 & Apache Druid 0.18

Imply

Check out the webinar: https://imply.io/videos/whats-new-imply-3-3-apache-druid-0-18 The most recent Imply 3.3 release, based on Apache 0.18 brings several major new features, including joins, query laning and Clarity Alerts. These new features deliver increased design flexibility during design, and provide improved ingestion performance, and sub-second response times to help accelerate data warehouse and data lake deployments, and add real-time analytics in general.

Apache Druid Vision and Roadmap

Imply

Analytics over Terabytes of Data at Twitter

Imply

MoPub, a Twitter company, provides monetization solutions for mobile app publishers and developers around the globe. MoPub receives over 33 Billion ad requests per day generating over 200TB of raw logs every day. We built MoPub Analytics as the analytics platform, using Druid + Imply for our end users who are Publishers, Demand side partners and Internal users. We will talk about the architecture of the analytics platform, our Druid cluster setup, hardware choices, monitoring, use cases, limiting factors, challenges with lookups and solutions we used. Watch video:https://imply.io/virtual-druid-summit/analytics-over-terabytes-of-data-at-twitter-apache-druid

More from Imply (17)

Pivot 2.0 - The next generation visualization tool for your streaming data

Druid Adoption Tips and Tricks

Druid in Spot Instances

Zeotap: Data Modeling in Druid for Non temporal and Nested Data

Nielsen: Casting the Spell - Druid in Practice

Building Data Applications with Apache Druid

Maximizing Apache Druid performance: Beyond the basics

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...

How TrafficGuard uses Druid to Fight Ad Fraud and Bots

Apache Druid: Lightning Fast Analytics on Real-time and Historical Data (Atla...

August meetup - All about Apache Druid

Benchmarking Apache Druid

Druid: Under the Covers (Virtual Meetup)

Why data warehouses cannot support hot analytics

What’s New in Imply 3.3 & Apache Druid 0.18

Apache Druid Vision and Roadmap

Analytics over Terabytes of Data at Twitter

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

DevOps and Testing slides at DASA Connect

Kari Kakkonen

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Generating a custom Ruby SDK for your web service or Rails API using Smithy

Accelerate your Kubernetes clusters with Varnish Caching

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

GraphRAG is All You need? LLM & Knowledge Graph

DevOps and Testing slides at DASA Connect

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Securing your Kubernetes cluster_ a step-by-step guide to success !

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPath Test Automation using UiPath Test Suite series, part 4

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Connector Corner: Automate dynamic content and events by pushing a button

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Building a Real-Time Gaming Analytics Service with Apache Druid

1. Building a Real-Time analytics service with Apache Druid Virtual Druid Summit October 2020 Ramón Lastres Guerrero, Director of Engineering, GameAnalytics 1

2. Agenda ➢ ➢ ➢ ➢ ○ ○ ○ ○ ○ ○ ➢

3. Introduction to GameAnalytics user behaviour analytics focused on just gaming SDKs Rest API https://gameanalytics.com/docs/item/rest-api-doc results in real-time and also historical aggregate

4. Introduction to GameAnalytics 150M+ 25,000+ 19B+ JSON1.7B+ 1 TB

5. 25,000 Daily Active Games

6. Analytics for 90,000 Game Developers

7. Key Performance Indicators: Player Retention

8. Interactive Filtering

9. Interactive Filtering

10. Technical Requirements high level technical requirements ● (responsive Frontend) ● real time queries ● Reliability ● infrastructure cost ● ﬂexible querying / ﬁltering ● number of unique users

11. Backend Overview three main components ● ● ●

12. Data Collection

13. Data Annotation System

14. Aggregation and Reporting: Druid s3

15. Druid: Batch Ingestion Coordination

16. Druid: Query Layer build our own query layer ● deﬁne metrics on backend side ● Implement authentication ● caching, query priorities, rate limiting Elixir language Druid client for Elixir

17. Druid: Imply Pivot Imply Pivot

18. A / B testing and Druid What is A / B testing?

19. A / B testing and Druid ● ● ●

20. A / B testing and Druid

21. A / B testing and Druid real-time result metrics in real time probabilistic model variants are just Druid dimensions

22. Druid: Cluster Topology Imply Cloud ● ● ●

23. Druid: Performance numbers multi-tenancy 75k queries per hour rollup DAU 1.4k

24. Druid: Performance numbers

25. Hash partitioning VS single dimension partitioning game_id (our tenant id) dimension unstable EMR ingestion hashed partitioning

26. Druid: Query Layer Caching always implement good caching

27. Annotation System annotation service

28. Annotation System SDK attribution partners

29. Annotation System: Calculating Player Retention retention calculation increases the size by ~30% installation timestamp (truncated to day)

30. Single Datasource VS Multiple Datasources one single Kinesis stream high cardinality low cardinality reduce number of rows processed

31. Single Datasource VS Multiple Datasources daily Datasource Daily Size Daily segments Avg. Segment size Small ~ 8.5GB ~ 10 ~ 550MB Reduced ~ 50GB ~ 75 ~ 550MB Full ~ 290GB ~ 450 ~ 550MB

32. Single Datasource VS Multiple Datasources ● ● ● ● ● ●

33. Single Datasource VS Multiple Datasources

34. Single Datasource VS Multiple Datasources

35. Druid: Tiering leverage tiering use it to lower costs serving more frequently accessed data with more powerful hardware

36. Druid: Lookups joins with data stored outside of Druid using lookups we can query on studio and organization level

37. Time for questions @gameanalytics 37 Thank you! Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org. Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

38. Dates: November 10, 2020 druidsummit.org 38 Register Now for the Next Druid Virtual Summit

Building a Real-Time Gaming Analytics Service with Apache Druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Real-Time Gaming Analytics Service with Apache Druid

Similar to Building a Real-Time Gaming Analytics Service with Apache Druid (20)

More from Imply

More from Imply (17)

Recently uploaded

Recently uploaded (20)

Building a Real-Time Gaming Analytics Service with Apache Druid