Riot Games uses Hadoop to improve the player experience and track game metrics like Teemo deaths. They initially used a single MySQL database that struggled as the player base grew exponentially. Hadoop allowed them to collect vast amounts of structured and unstructured data from multiple sources globally. This data helps optimize matchmaking, improve client performance across varying hardware, and react quickly to game changes. Their goals are to provide self-service analytics, ingest more real-time data, and gain deeper global insights to continuously enhance the player experience.
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
Riot Games - Player Focused Pipeline - Stampedecon 2015sean_seannery
In this talk from Stampedecon 2015 we tell the story of how Riot Games' big data platform has evolved over the last couple of years. We highlight some pain that we experienced as we iterated over the years, and provide our top 5 suggestions of how to avoid that pain in your ecosystem.
Building A Player Focused Data Pipeline at Riot Games - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Riot Games’ mission statement is to become the most player focused company in the world. With over 67 million players battling on the fields of justice every month, League of Legends generates more than 45 terabytes of data on a daily basis. From game events to store transactions, data comes in from thousands of sources around the world. The big data engineering team at Riot Games is responsible for collecting this data and exposing it through a variety of tools to assist in delivering value to the players. This talk will span the past, present, and future of our data ecosystem, covering the reasons behind the decisions we made and the lessons we learned along the way.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Across the globe energy systems are changing, creating unprecedented challenges for the organisations tasked with ensuring the lights stay on. In the UK, National Grid is facing shrinking margins, looming capacity shortages and unpredictable peaks and troughs in energy supply caused by increasing levels of renewable penetration. Open Energi uses its IoT technology to unlock demand-side capacity - from industrial equipment, co-generation and batery storage systems - creating a smarter grid; one that is cleaner, cheaper, more secure and more efficient.
I'll talk about how we use Apache Nifi to orchestrate and coordinate Machine Learning microservices that operate on streams of data coming from IoT devices, providing a layer of fault-tolerance and traceability. With built-in retry logic, backpressure and clustering, Nifi helps us keep hard problems away from our code. It comes with processors that integrate with our cloud provider of choice (Microsoft Azure), fitting seamlessly into our processing pipeline.Finally, its straightforward graphical interface makes it easy enough to use that any team member can step in and troubleshoot a flow with little training.
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
Riot Games - Player Focused Pipeline - Stampedecon 2015sean_seannery
In this talk from Stampedecon 2015 we tell the story of how Riot Games' big data platform has evolved over the last couple of years. We highlight some pain that we experienced as we iterated over the years, and provide our top 5 suggestions of how to avoid that pain in your ecosystem.
Building A Player Focused Data Pipeline at Riot Games - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Riot Games’ mission statement is to become the most player focused company in the world. With over 67 million players battling on the fields of justice every month, League of Legends generates more than 45 terabytes of data on a daily basis. From game events to store transactions, data comes in from thousands of sources around the world. The big data engineering team at Riot Games is responsible for collecting this data and exposing it through a variety of tools to assist in delivering value to the players. This talk will span the past, present, and future of our data ecosystem, covering the reasons behind the decisions we made and the lessons we learned along the way.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Across the globe energy systems are changing, creating unprecedented challenges for the organisations tasked with ensuring the lights stay on. In the UK, National Grid is facing shrinking margins, looming capacity shortages and unpredictable peaks and troughs in energy supply caused by increasing levels of renewable penetration. Open Energi uses its IoT technology to unlock demand-side capacity - from industrial equipment, co-generation and batery storage systems - creating a smarter grid; one that is cleaner, cheaper, more secure and more efficient.
I'll talk about how we use Apache Nifi to orchestrate and coordinate Machine Learning microservices that operate on streams of data coming from IoT devices, providing a layer of fault-tolerance and traceability. With built-in retry logic, backpressure and clustering, Nifi helps us keep hard problems away from our code. It comes with processors that integrate with our cloud provider of choice (Microsoft Azure), fitting seamlessly into our processing pipeline.Finally, its straightforward graphical interface makes it easy enough to use that any team member can step in and troubleshoot a flow with little training.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world so that nearly every streaming framework now supports higher level relational operations.
On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in an enterprise production environment to deploy and operationalized?
The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story?
We discuss the drivers and expected benefits of changing the existing event processing systems. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier.
Speaker: Andrew Psaltis, Principal Solution Engineer, Hortonworks
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
In this exclusive Premier Inside Out, you will hear from Druid committer Slim Bouguerra, Staff Software Engineer and Product Manager Will Xu. These Hortonworkers will explain the vision of these components, review new features, share some best practices and answer your questions.
View the webinar here: https://hortonworks.com/webinar/hortonworks-premier-apache-druid/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
This workshop will provide a hands on introduction to simple event data processing and data flow processing using a Sandbox on students’ personal machines.
Format: A short introductory lecture to Apache NiFi and computing used in the lab followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache NiFi. In the lab, you will install and use Apache NiFi to collect, conduct and curate data-in-motion and data-at-rest with NiFi. You will learn how to connect and consume streaming sensor data, filter and transform the data and persist to multiple data sources.
Pre-requisites: Registrants must bring a laptop that has the latest VirtualBox installed and an image for Hortonworks DataFlow (HDF) Sandbox will be provided.
Speaker: Andy LoPresto
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
In this talk, I’ll describe how you can leverage 3 open-source standards - workflow management with Airflow, EL with Airbyte, transformation with dbt - to build your next modern data stack. I’ll explain how to configure your Airflow DAG to trigger Airbyte’s data replication jobs and DBT’s transformation one with a concrete use case.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Frame - Feature Management for Productive Machine LearningDavid Stein
Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018.
Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Running Apache Spark Jobs Using KubernetesDatabricks
Apache Spark has introduced a powerful engine for distributed data processing, providing unmatched capabilities to handle petabytes of data across multiple servers. Its capabilities and performance unseated other technologies in the Hadoop world, but while Spark provides a lot of power, it also comes with a high maintenance cost, which is why we now see innovations to simplify the Spark infrastructure.
Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...HostedbyConfluent
The saying goes that there are only two hard things in Computer Science: cache invalidation, and naming things. Well, turns out the first one is solved actually ;)
Join us for this session to learn how to keep read views of your data in distributed caches close to your users, always kept in sync with your primary data stores change data capture. You will learn how to
- Implement a low-latency data pipeline for cache updates based on Debezium, Apache Kafka, and Infinispan
- Create denormalized views of your data using Kafka Streams and make them accessible via plain key look-ups from a cache cluster close by
- Propagate updates between cache clusters using cross-site replication
We'll also touch on some advanced concepts, such as detecting and rejecting writes to the system of record which are derived from outdated cached state, and show in a demo how all the pieces come together, of course connected via Apache Kafka.
Data ingestion and distribution with apache NiFiLev Brailovskiy
In this session, we will cover our experience working with Apache NiFi, an easy to use, powerful, and reliable system to process and distribute a large volume of data. The first part of the session will be an introduction to Apache NiFi. We will go over NiFi main components and building blocks and functionality.
In the second part of the session, we will show our use case for Apache NiFi and how it's being used inside our Data Processing infrastructure.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world so that nearly every streaming framework now supports higher level relational operations.
On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in an enterprise production environment to deploy and operationalized?
The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story?
We discuss the drivers and expected benefits of changing the existing event processing systems. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier.
Speaker: Andrew Psaltis, Principal Solution Engineer, Hortonworks
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
In this exclusive Premier Inside Out, you will hear from Druid committer Slim Bouguerra, Staff Software Engineer and Product Manager Will Xu. These Hortonworkers will explain the vision of these components, review new features, share some best practices and answer your questions.
View the webinar here: https://hortonworks.com/webinar/hortonworks-premier-apache-druid/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
This workshop will provide a hands on introduction to simple event data processing and data flow processing using a Sandbox on students’ personal machines.
Format: A short introductory lecture to Apache NiFi and computing used in the lab followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache NiFi. In the lab, you will install and use Apache NiFi to collect, conduct and curate data-in-motion and data-at-rest with NiFi. You will learn how to connect and consume streaming sensor data, filter and transform the data and persist to multiple data sources.
Pre-requisites: Registrants must bring a laptop that has the latest VirtualBox installed and an image for Hortonworks DataFlow (HDF) Sandbox will be provided.
Speaker: Andy LoPresto
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
In this talk, I’ll describe how you can leverage 3 open-source standards - workflow management with Airflow, EL with Airbyte, transformation with dbt - to build your next modern data stack. I’ll explain how to configure your Airflow DAG to trigger Airbyte’s data replication jobs and DBT’s transformation one with a concrete use case.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Frame - Feature Management for Productive Machine LearningDavid Stein
Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018.
Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Running Apache Spark Jobs Using KubernetesDatabricks
Apache Spark has introduced a powerful engine for distributed data processing, providing unmatched capabilities to handle petabytes of data across multiple servers. Its capabilities and performance unseated other technologies in the Hadoop world, but while Spark provides a lot of power, it also comes with a high maintenance cost, which is why we now see innovations to simplify the Spark infrastructure.
Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...HostedbyConfluent
The saying goes that there are only two hard things in Computer Science: cache invalidation, and naming things. Well, turns out the first one is solved actually ;)
Join us for this session to learn how to keep read views of your data in distributed caches close to your users, always kept in sync with your primary data stores change data capture. You will learn how to
- Implement a low-latency data pipeline for cache updates based on Debezium, Apache Kafka, and Infinispan
- Create denormalized views of your data using Kafka Streams and make them accessible via plain key look-ups from a cache cluster close by
- Propagate updates between cache clusters using cross-site replication
We'll also touch on some advanced concepts, such as detecting and rejecting writes to the system of record which are derived from outdated cached state, and show in a demo how all the pieces come together, of course connected via Apache Kafka.
Data ingestion and distribution with apache NiFiLev Brailovskiy
In this session, we will cover our experience working with Apache NiFi, an easy to use, powerful, and reliable system to process and distribute a large volume of data. The first part of the session will be an introduction to Apache NiFi. We will go over NiFi main components and building blocks and functionality.
In the second part of the session, we will show our use case for Apache NiFi and how it's being used inside our Data Processing infrastructure.
This talk will explore how we REBOOTED our Project Design. After a decade of production usage, the RavenDB team addressed a lot of ongoing concerns & changed some of RavenDB's core architecture.
We'll investigate the driving forces behind it, the reasoning process & look at how it all turned out.
Spil Games: outgrowing an internet startupart-spilgames
This presentation will explain how Spil Games has grown in a short time from an internet startup to a global online gaming company and how we currently are building a global cross datacenter storage solution with MySQL as its backend.
The first part will contain a short summary of where we started with our database engineering department (look ahead at most one week in time), to a more professionalized department (look ahead and plan three to four months in time) to currently growing out of the startup phase (look ahead and plan more than one year in time). This will be illustrated with some examples of the growing pains we encountered with scaling, replication and high availability and leading up to the conclusion that we need to acknowledge our problems and shortcomings to actually be able to overcome them.
The second part of the presentation will contain a comparison of our old architecture against the new architecture. In this new architecture we take into account that failure of a complete datacenter is certain to occur sometime and strive to give our users the best possible experience, even in worst case when data is inaccessible. We also introduce asynchronous calls which enable us to fire and forget most of our writes. The architecture is being built with MySQL, handler sockets, Erlang and Memcache as its building blocks.
Maximize Your Production Effort (English)slantsixgames
Efficient Content Authoring Tools and Pipeline for Inter-Studio Asset Development
With the complexity of today's video games and their associated tight timelines, it is paramount for video game studios to have a highly efficient content authoring process and production workflow. With a trend towards outsourced development of game assets, there are additional considerations that are important for achieving optimal workflow between studios that are co-developing or sharing assets. This lecture gives valuable insight into how to create new content authoring tools and data transformation pipelines that promote efficient work flow for both internal and remote production teams. Specific considerations for outsourcing and worldwide development are made along the way.
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
The Strathclyde Poker Research EnvironmentLuke Dicken
This presentation was given at AISB 2011 and introduces the Strathclyde Poker Research Environment (SPREE) an open tool for Poker research. Available from Sourceforge @ https://sourceforge.net/projects/spree-poker/
Greater consumer demands for performance, features, and graphics have forced design studios to keep pace in a rapidly-changing industry while simultaneously securing all their development assets — source code, binary data, requirements, documents, and more. In this webinar, Sven Erik Knop, Principal Solutions Engineer for Perforce Software, details the game development best practices achieved with Helix versioning engine for optimal game design. See firsthand what the best game studios already know.
Presentation on dogfooding data at Lyft by Mark Grover and Arup Malakar on Oct 25, 2017 at Big Analytics Meetup (https://www.meetup.com/SF-Big-Analytics/events/243896328/)
Wordnik's technical co-founder Tony Tam describes the reason for going NoSQL. During his talk Tony will discuss the selection criteria, testing + evaluation and successful, zero-downtime migration to MongoDB. Additionally details on Wordnik's speed and stability will be covered as well as how NoSQL technologies have changed the way Wordnik scales.
Так, ви не помилилися. Це саме детективна історія. Навіть кілька. Кожна з яких має заплутаний сюжет, кілька діючих (або бездіяльних) осіб, факти та докази. У доповіді ми розплутаємо всі ці справи і пройдемо шлях від отримання інформації (точної або неточної) до повного розуміння та вирішення проблеми, що дозволить вам добре орієнтуватися в схожих ситуаціях й ефективніше працювати з базою даних. Розкривати більше деталей зараз не можу (самі розумієте), але можу обіцяти, що буде дуже цікаво.
Exploring the World at Scale, the Challenges it poses and how MongoDB can help address those challenges. We will also explore the Black Swan Problem at work in various industries and how one can use MongoDB to address IT challenges due to Positive Black Swans, Negative Black Swans and Standard Use Cases
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
4. 1 INTRO
THIS PRESENTATION IS ABOUT…
2
• A quick history of our data warehouse
3 • Our high level architecture
• Player experience use cases
4 • How Hadoop has enabled these use cases
• Changes we have made to our architecture
5
to facilitate deeper insight at velocity
• Where we’re headed
6
7
5. 1 INTRO
WHO is RIOT GAMES?
2
• Developer and publisher of League of Legends
3 • Founded 2006 by gamers for gamers
• Player experience focused
4 – Needless to say, data is pretty important to
understanding the player experience!
5
6
7
6. 1
LEAGUE OF LEGENDS
INTRO
2
12 MILLION 70 MILLION
DAILY ACTIVE PLAYERS REGISTERED PLAYERS
3
4
5
3 MILLION 32 MILLION
PEAK CONCURRENT MONTHLY ACTIVE
PLAYERS PLAYERS
6
7
Numbers based on Riot Games data published October 2012.
8. 1 INITIAL LAUNCH / SCRAPPY START UP PHASE
2 HISTORY START-UP
3 • Had a single, dedicated MySQL instance for the DW
• Data was ETL’d from production slaves into this instance
4 • Queries were run in MySQL
• Reporting was done in Excel
• All ETLs, Queries, and Reporting were done by one person!
5
6 This worked great!
7
9. 1 AND THEN – CRAZY GROWTH!!!!
CRAZY
2 HISTORY START-UP
GROWTH
3
# unique logins
TOTAL ACTIVE PLAYERS
4
June 2012
4.2M
5 NOV. 2011
6
7
time
10. 1 THE BREAKING POINT
CRAZY BREAKING
2 HISTORY START-UP
GROWTH POINT
3 • Data Warehouse reached a breaking point
– 24 hours of data took 24.5 hours to ETL
• We couldn’t handle…
4 – multiple environments in a vertical MySQL instance
– a single environment in a vertical MySQL instance
5 • We needed to change!
6
7
11. 1 INTRODUCTION OF HADOOP
CRAZY BREAKING
2 HISTORY START-UP
GROWTH POINT
HADOOP!
3
• Hadoop has a number of great qualities!
– Cost Effective
4
– Scalable
– Open Source
5 – We could execute quickly!
6
7
12. 1 HIGH LEVEL ARCHITECTURE, JUNE 2012
Business
2 HISTORY Audit Plat
Analyst
LoL
Tableau
3 NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 +
Warehouse
LoL
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
13. 1
BUT, THIS WASN’T GOOD ENOUGH
2 HISTORY
• We needed to improve on many levels
– Shorten time to insight
3
– Increase depth of insight
– Enable data analysis for client-side features
4
– Flexible auditing framework
– Log ingestion and analysis
5 – International data infrastructure
6
7
14. 1
BUT, THIS WASN’T GOOD ENOUGH
2 HISTORY
• We needed to improve on many levels
– Shorten time to insight
3
– Increase depth of insight
– Enable data analysis for client-side features
4
– Flexible auditing framework
– Log ingestion and analysis
5 – International data infrastructure
6
7
19. 1
CLIENT FOOTPRINT
2 • As a AAA video game, a significant portion of our software
runs directly on players’ machines
– High performance graphics
3 CONTEXT
– Responsiveness
4 • There is logic in these components that is ONLY exercised
on the client-side
5
• Understanding the performance, reliability, and stability of
these features is paramount to improving the player
6 experience!
7
24. 1
CHALLENGE: THE GAME IS ALIVE!
2
• The game is a living, breathing service that’s always in motion
3
• Updated every 2-3 weeks
USECASE
4 #1
– New champions
– New items
– New effects/particles
5 – Changes in environment
– Changes in design and design balance
6
7
26. 1
CHALLENGE: PC VARIABILITY
2
• Hardware and OS profiles are significantly different even within
regions!
3 • OS and Patch Level
• CPU
4
USECASE
#1
• Memory
• Video Card
• Video Card Memory
5
• Drivers!
6
7
29. 1
IMPROVING THE PLAYER EXPERIENCE
2
• We need to gather information across all of these dimensions
in order to UNDERSTAND the player experience
3
• We use this information to:
4
USECASE – React quickly to changes
#1
– Optimize performance
– Optimize designs
5 – Improve our testing
• Like Creating our Compatibility Testing Lab!
6
7
34. 1
HOW DID WE SOLVE THIS?
2
• We have an ARMY of TEEMOs watching players’ machines
through their telescopes?!?!?!
3 – Not really, but we DID consider it!
USECASE
4 #1
5
6
7
36. 1
HONU-CLIENT SDK
2
3
4
5
GAME_CLIENT_STATS
timestamp source app pingAvg serverId system
6 1234567890 99.123.456.78 game_client 220.9542 12.345.678.90 Intel64 …
Select avg(f[‘pingAVG’]) from game_client_stats group by f[‘serverId’];
7
37. 1
HONU-TOOLS
2
• DradisTestUI: WEB UI to send a message to Dradis
3 directly without any coding
• EchoService: WEB UI to easily and immediately
4
visualize the data that has been sent to Honu
Collectors
5
6
7
38. 1
HONU-COLLECTORS
2 • Each collector:
– Collect events from multiple SQS
3 clients (Thrift/NIO)
– Save all events to one S3
4
compressed file locally
– Upload that file every XX
minutes to S3
5
– Send a message to SQS for Honu Collectors
Demux
6
7
44. 1
MATCHMAKING
2 • One of the most important features outside of gameplay
• Like a dating service, the objective is to match people up;
3 • Number of different queues that players can line up in, depending
on the type of match they’re looking for
4
USECASE
5 #2
6 Critical that this system is balanced
balanced
and able to create good matches quickly
7
45. 1
MATCHING THE RIGHT PLAYERS
2
3
4
USECASE
5 #2
6
7
46. 1
IMPROVING THE EXPERIENCE
2 • We want to ensure that all players are having the best
possible experience getting the matches they want
– This is VERY challenging!
3
– We’re obviously always studying this and trying to improve it
4 • Recently, we’ve started combining client data with data we
have about many other dimensions
– Queue times
USECASE
5 #2 – Match quality
– Player Skill/Matchmaking Rating
6
• Obviously, we hope this will lead us to improvements we
haven’t identified as of yet
7
47. 1
HOW DO WE ACCOMPLISH THAT?
2
3
4
USECASE
5 #2
6
7
48. 1
2
Tools & Business Process Tools & Business Process Tools & Business Process
3
4
Audit Plat Audit Plat Audit Plat
LoL LoL LoL
5 NORTH AMERICA EUROPE KOREA
6 MySQL
7 JUNE 2012
49. 1 Audit Plat Audit Plat Audit Plat
LoL LoL LoL
2 NORTH AMERICA EUROPE KOREA
3
4
5
6
7 Dashboards Tools & Business Process
59. 1
CONTINUE INCREASING VELOCITY
June 2012 February 2013
2
MySQL tables 180 1200
Pipeline Events/day 0 2.5+ Billion
3 Workflows Cronjob + Pentaho Oozie
Environment Datacenter DC + AWS
SLA 1 day 2 hours
4
Event tracking • 2+ weeks (DB • 10 minutes
update)
• Dependencies: DBA • Self-Service
5 teams + ETL teams +
Tools teams
• Downtime (3h min.) • No downtime
6
THE
7 FUTURE
60. 1
OUR IMMEDIATE GOALS
2
• Self-Service reporting
• Metadata Management Service
3
• Real-time aggregation pipeline
• Real-time slicing/dicing for non-critical data
4
• Log ingestion and analysis
• International data infrastructure
5
6
THE
7 FUTURE
61. 1
CHALLENGE: MAKE IT GLOBAL
2 • Data centers across the globe since latency has huge effect on
gameplay log data scattered around the world
3
• Large presence in Asia -- some areas (e.g., PH) have bandwidth
challenges or bandwidth is expensive
4
5
6
THE
7 FUTURE
62. 1
CHALLENGE: WE HAVE BIG DATA
STRUCTURED DATA
2
500G DAILY
APPLICATION AND OPERATIONAL LOGS
3
4.5TB DAILY
4 RIOT YOUTUBE CHANNEL
3MM SUBSCRIBERS
448+MM VIEWS
5
+ chat logs
+ detailed gameplay event tracking
6 + so on….
THE
7 FUTURE
63. 1
OUR AUDACIOUS GOALS
2
Build a world-class data and analytics organization
• Deeply understand players across the globe
• Apply that understanding to improve games for players
3
• Deeply understand our entire ecosystem, including social media
4 Have ability to identify, understand and react to
meaningful trends in real time
5
Have deep, real-time understanding of our systems
from player experience and operational standpoints
6
THE
7 FUTURE
64. 1
SHAMELESS HIRING PLUG
2 • Like most everybody else at this conference… we’re
hiring!
3
• The Riot Manifesto
Player experience first
4 Challenge convention
Focus on talent and team
5
Take play seriously
Stay hungry, stay humble
6
THE
7 FUTURE
65. 1
SHAMELESS HIRING PLUG
2
3
4
5
6
THE
And yes, you can play games at work.
7 FUTURE
It’s encouraged!
Times where there were 20% month over month growth in a single environment2 environments w/~200K CCU to 16 environments and 1.3million CCU in the space of 12 monthsResources were focused on getting our operational systems to scale along with demand
Today, we’re going to focus on a few of these.
Before we talk about our first usecase, we need to give you a little bit of context about the game and gameplay (super high level), as well as a quick overview of some of the pieces of the LoL architectureSession Based Team play - basic idea is like “kill the other team’s nexus (base)” – MOBA!If you die, you re-spawn after a certain amount of time (that time grows as the game progresses)Lots of strategy to the game
Each player “summons” a Champion that he playsEach champion has very different abilities
Each player “summons” a Champion that he playsEach champion has very different abilities
“Typical” Hardware profiles differ significantly from region to regionWe want to make sure that all regions have a good experienceWe also want to understand latency from various locations in the world to our installs
There are lots of customizable settings around resolution and graphics quality that players can changeAll of these settings have a potential impact on performance
So given that we have all of these challenges, what are we doing to improve the
So given that we have all of these challenges, what are we doing to improve the