Riot Games uses Hadoop and Hive to analyze massive amounts of player data from League of Legends. They ingest data daily from MySQL into Hive for analytics and build cubes in MySQL for Tableau visualization. Two key use cases are analyzing champion balance and matchmaking across regions. Regional metrics require joining large tables and IP lookups, which Hive initially struggled with due to lack of indexing. Riot leveraged open-source GeoIP UDFs to enable these regional queries in Hive.
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
Riot Games - Player Focused Pipeline - Stampedecon 2015sean_seannery
In this talk from Stampedecon 2015 we tell the story of how Riot Games' big data platform has evolved over the last couple of years. We highlight some pain that we experienced as we iterated over the years, and provide our top 5 suggestions of how to avoid that pain in your ecosystem.
Building A Player Focused Data Pipeline at Riot Games - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Riot Games’ mission statement is to become the most player focused company in the world. With over 67 million players battling on the fields of justice every month, League of Legends generates more than 45 terabytes of data on a daily basis. From game events to store transactions, data comes in from thousands of sources around the world. The big data engineering team at Riot Games is responsible for collecting this data and exposing it through a variety of tools to assist in delivering value to the players. This talk will span the past, present, and future of our data ecosystem, covering the reasons behind the decisions we made and the lessons we learned along the way.
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
A walk through of creating a dataflow for ingest of twitter data and analyzing the stream with NLTK Vader Python Sentiment Analysis and Inception v3 TensorFlow via Python in Apache NiFi. Storage in Hadoop HDFS.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
Riot Games - Player Focused Pipeline - Stampedecon 2015sean_seannery
In this talk from Stampedecon 2015 we tell the story of how Riot Games' big data platform has evolved over the last couple of years. We highlight some pain that we experienced as we iterated over the years, and provide our top 5 suggestions of how to avoid that pain in your ecosystem.
Building A Player Focused Data Pipeline at Riot Games - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Riot Games’ mission statement is to become the most player focused company in the world. With over 67 million players battling on the fields of justice every month, League of Legends generates more than 45 terabytes of data on a daily basis. From game events to store transactions, data comes in from thousands of sources around the world. The big data engineering team at Riot Games is responsible for collecting this data and exposing it through a variety of tools to assist in delivering value to the players. This talk will span the past, present, and future of our data ecosystem, covering the reasons behind the decisions we made and the lessons we learned along the way.
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
A walk through of creating a dataflow for ingest of twitter data and analyzing the stream with NLTK Vader Python Sentiment Analysis and Inception v3 TensorFlow via Python in Apache NiFi. Storage in Hadoop HDFS.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
This presentation was created as an introduction to the Apache NiFi project; to be followed by “Lab 0” of the “Realtime Event Processing in Hadoop with NiFi, Kafka and Storm” tutorial hosted here: http://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_1
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Apache Flink 101 - the rise of stream processing and beyondBowen Li
Apache Flink is the most popular and widely adopted streaming processing framework, powering real time stream event computations at extremely large scale in companies like Uber, Lyft, AWS, Alibaba, Pinterest, Splunk, Yelp, etc.
In this talk, we will go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink fills the gaps and stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will also take a look at how Flink is going beyond stream processing into areas like unified data processing, enterprise intergration, AI/machine learning (especially online ML), and serverless computation, and how Flink fits with its distinct value.
SPEAKER: Bowen Li
SPEAKER BIO: Bowen is a committer of Apache Flink, senior engineer at Alibaba, and host of Seattle Flink Meetup.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
Introduction: This workshop will provide a hands on introduction to simple event data processing and data flow processing using a Sandbox on students’ personal machines.
Format: A short introductory lecture to Apache NiFi and computing used in the lab followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache NiFi. In the lab, you will install and use Apache NiFi to collect, conduct and curate data-in-motion and data-at-rest with NiFi. You will learn how to connect and consume streaming sensor data, filter and transform the data and persist to multiple data sources.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Dataflow Management From Edge to Core with Apache NiFiDataWorks Summit
What is “dataflow?” — the process and tooling around gathering necessary information and getting it into a useful form to make insights available. Dataflow needs change rapidly — what was noise yesterday may be crucial data today, an API endpoint changes, or a service switches from producing CSV to JSON or Avro. In addition, developers may need to design a flow in a sandbox and deploy to QA or production — and those database passwords aren’t the same (hopefully). Learn about Apache NiFi — a robust and secure framework for dataflow development and monitoring.
Abstract: Identifying, collecting, securing, filtering, prioritizing, transforming, and transporting abstract data is a challenge faced by every organization. Apache NiFi and MiNiFi allow developers to create and refine dataflows with ease and ensure that their critical content is routed, transformed, validated, and delivered across global networks. Learn how the framework enables rapid development of flows, live monitoring and auditing, data protection and sharing. From IoT and machine interaction to log collection, NiFi can scale to meet the needs of your organization. Able to handle both small event messages and “big data” on the scale of terabytes per day, NiFi will provide a platform which lets both engineers and non-technical domain experts collaborate to solve the ingest and storage problems that have plagued enterprises.
Expected prior knowledge / intended audience: developers and data flow managers should be interested in learning about and improving their dataflow problems. The intended audience does not need experience in designing and modifying data flows.
Takeaways: Attendees will gain an understanding of dataflow concepts, data management processes, and flow management (including versioning, rollbacks, promotion between deployment environments, and various backing implementations).
Current uses: I am a committer and PMC member for the Apache NiFi, MiNiFi, and NiFi Registry projects and help numerous users deploy these tools to collect data from an incredibly diverse array of endpoints, aggregate, prioritize, filter, transform, and secure this data, and generate actionable insight from it. Current users of these platforms include many Fortune 100 companies, governments, startups, and individual users across fields like telecommunications, finance, healthcare, automotive, aerospace, and oil & gas, with use cases like fraud detection, logistics management, supply chain management, machine learning, IoT gateway, connected vehicles, smart grids, etc.
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Agenda:
1.Data Flow Challenges in an Enterprise
2.Introduction to Apache NiFi
3.Core Features
4.Architecture
5.Demo –Simple Lambda Architecture
6.Use Cases
7.Q & A
Ozone is an object store for Hadoop. Ozone solves the small file problem of HDFS, which allows users to store trillions of files in Ozone and access them as if there are on HDFS. Ozone plugs into existing Hadoop deployments seamlessly, and programs like Hive, LLAP, and Spark work without any modifications. This talk looks at the architecture, reliability, and performance of Ozone.
In this talk, we will also explore Hadoop distributed storage layer, a block storage layer that makes this scaling possible, and how we plan to use the Hadoop distributed storage layer for scaling HDFS.
We will demonstrate how to install an Ozone cluster, how to create volumes, buckets, and keys, how to run Hive and Spark against HDFS and Ozone file systems using federation, so that users don’t have to worry about where the data is stored. In other words, a full user primer on Ozone will be part of this talk.
Speakers
Anu Engineer, Software Engineer, Hortonworks
Xiaoyu Yao, Software Engineer, Hortonworks
Stardog is a fast, scalable, lightweight RDF database for complex SPARQL queries. It features OWL 2 reasoning, transactions, a robust security layer, integrity constraint validation via Pellet 3, and world-class support.
This presentation was created as an introduction to the Apache NiFi project; to be followed by “Lab 0” of the “Realtime Event Processing in Hadoop with NiFi, Kafka and Storm” tutorial hosted here: http://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_1
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Apache Flink 101 - the rise of stream processing and beyondBowen Li
Apache Flink is the most popular and widely adopted streaming processing framework, powering real time stream event computations at extremely large scale in companies like Uber, Lyft, AWS, Alibaba, Pinterest, Splunk, Yelp, etc.
In this talk, we will go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink fills the gaps and stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will also take a look at how Flink is going beyond stream processing into areas like unified data processing, enterprise intergration, AI/machine learning (especially online ML), and serverless computation, and how Flink fits with its distinct value.
SPEAKER: Bowen Li
SPEAKER BIO: Bowen is a committer of Apache Flink, senior engineer at Alibaba, and host of Seattle Flink Meetup.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
Introduction: This workshop will provide a hands on introduction to simple event data processing and data flow processing using a Sandbox on students’ personal machines.
Format: A short introductory lecture to Apache NiFi and computing used in the lab followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache NiFi. In the lab, you will install and use Apache NiFi to collect, conduct and curate data-in-motion and data-at-rest with NiFi. You will learn how to connect and consume streaming sensor data, filter and transform the data and persist to multiple data sources.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Dataflow Management From Edge to Core with Apache NiFiDataWorks Summit
What is “dataflow?” — the process and tooling around gathering necessary information and getting it into a useful form to make insights available. Dataflow needs change rapidly — what was noise yesterday may be crucial data today, an API endpoint changes, or a service switches from producing CSV to JSON or Avro. In addition, developers may need to design a flow in a sandbox and deploy to QA or production — and those database passwords aren’t the same (hopefully). Learn about Apache NiFi — a robust and secure framework for dataflow development and monitoring.
Abstract: Identifying, collecting, securing, filtering, prioritizing, transforming, and transporting abstract data is a challenge faced by every organization. Apache NiFi and MiNiFi allow developers to create and refine dataflows with ease and ensure that their critical content is routed, transformed, validated, and delivered across global networks. Learn how the framework enables rapid development of flows, live monitoring and auditing, data protection and sharing. From IoT and machine interaction to log collection, NiFi can scale to meet the needs of your organization. Able to handle both small event messages and “big data” on the scale of terabytes per day, NiFi will provide a platform which lets both engineers and non-technical domain experts collaborate to solve the ingest and storage problems that have plagued enterprises.
Expected prior knowledge / intended audience: developers and data flow managers should be interested in learning about and improving their dataflow problems. The intended audience does not need experience in designing and modifying data flows.
Takeaways: Attendees will gain an understanding of dataflow concepts, data management processes, and flow management (including versioning, rollbacks, promotion between deployment environments, and various backing implementations).
Current uses: I am a committer and PMC member for the Apache NiFi, MiNiFi, and NiFi Registry projects and help numerous users deploy these tools to collect data from an incredibly diverse array of endpoints, aggregate, prioritize, filter, transform, and secure this data, and generate actionable insight from it. Current users of these platforms include many Fortune 100 companies, governments, startups, and individual users across fields like telecommunications, finance, healthcare, automotive, aerospace, and oil & gas, with use cases like fraud detection, logistics management, supply chain management, machine learning, IoT gateway, connected vehicles, smart grids, etc.
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Agenda:
1.Data Flow Challenges in an Enterprise
2.Introduction to Apache NiFi
3.Core Features
4.Architecture
5.Demo –Simple Lambda Architecture
6.Use Cases
7.Q & A
Ozone is an object store for Hadoop. Ozone solves the small file problem of HDFS, which allows users to store trillions of files in Ozone and access them as if there are on HDFS. Ozone plugs into existing Hadoop deployments seamlessly, and programs like Hive, LLAP, and Spark work without any modifications. This talk looks at the architecture, reliability, and performance of Ozone.
In this talk, we will also explore Hadoop distributed storage layer, a block storage layer that makes this scaling possible, and how we plan to use the Hadoop distributed storage layer for scaling HDFS.
We will demonstrate how to install an Ozone cluster, how to create volumes, buckets, and keys, how to run Hive and Spark against HDFS and Ozone file systems using federation, so that users don’t have to worry about where the data is stored. In other words, a full user primer on Ozone will be part of this talk.
Speakers
Anu Engineer, Software Engineer, Hortonworks
Xiaoyu Yao, Software Engineer, Hortonworks
Stardog is a fast, scalable, lightweight RDF database for complex SPARQL queries. It features OWL 2 reasoning, transactions, a robust security layer, integrity constraint validation via Pellet 3, and world-class support.
Scale by the Bay 2019 Reprogramming the ProgrammerPaul Cleary
Over the last few years I have built a DNS management system. Initially started as an Event Sourcing application built in Akka, the system had to be re-architected multiple times to address unforeseen issues stemming from new requirements, operational issues, and developer pitfalls (mistakes). This talk will introduce concepts in the DNS domain and different architecture styles including Event Sourcing in Akka and Stream processing in FS2. The talk will describe the journey from inception through to the current system design, highlighting the key challenges encountered along the way and the evolution of the design to account for those challenges. I plan on using real code to demonstrate each architecture along the journey.
What's new with Scala 2.10.0? A brief look at the past, and a detailed look at what's coming down the pipeline.
Note: at the time this presentation was created, Scala 2.10.0 had not been released yet. The final version will probably differ in some ways.
Last update: September 20th
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms.
To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL.
sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL.
Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.
Wordnik's technical co-founder Tony Tam describes the reason for going NoSQL. During his talk Tony will discuss the selection criteria, testing + evaluation and successful, zero-downtime migration to MongoDB. Additionally details on Wordnik's speed and stability will be covered as well as how NoSQL technologies have changed the way Wordnik scales.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.
Fatih Degirmenci, Ericsson, Jack Morgan, Intel
The OPNFV community relies on our community labs, CI and testing projects to ensure we release quality code. The current strategies to use hardware resources in OPNFV community labs will not be able to sustain its current growth. New strategies need to be implemented to allow for new OPNFV projects. The presenters will look at the current lab usage model and discuss ways already being worked in OPNFV community labs through the POD descriptor file. In our CI process through Dynamic CI, Cross Community CI and other initiatives. In our testing projects use of hardware resources and its importance in the release process. The presenters will show current tools used to track usage such as the Bitergia dashboard.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
4. 1 INTRO
THIS PRESENTATION IS ABOUT…
2
• The history of Riot’s data warehouse
3 • Why we incorporated Hadoop
• Our high level architecture
4 • Usecases Hadoop has enabled
• Lessons learned
5
• Where we’re headed
6
7
5. 1 INTRO
WHO?
2
• Developer and publisher of League of Legends
3 • Founded 2006 by gamers for gamers
• Player experience focused – requires data
4
5
6
7
6. 1 INTRO
2
4.2 MILLION 32.5 MILLION
DAILY REGISTERED
3
4
5
1.3 MILLION 11.5 MILLION
CONCURRENT MONTHLY
6
7
8. 1
MEET ANDY HO
2 HISTORY
“With enough data,
even simple questions
3
become difficult
questions”
4
5
6
7
9. 1 SCRAPPY START-UP PHASE
2 HISTORY START-UP
3
• One initial beta environment for North America
• Queries done directly off production MySQL slaves
4 • This is obviously not a good practice
5
6
7
10. 1 AROUND OUR INITIAL LAUNCH
INITIAL
2 HISTORY START-UP
LAUNCH
3 • Moved to a dedicated, single MySQL instance for the DW
• Data ETL’d from production slaves into this instance (by Andy)
4 • Queries run in MySQL (by Andy)
• Reporting was done in Excel (by Andy)
5
6 This worked great!
7
11. 1 THEN WE STARTED GROWING
INITIAL
2 HISTORY START-UP
LAUNCH
GROWTH
3 • Resources were focused elsewhere
– We had competition
– Focused on producing features and scaling our systems
4 • Opened EU environment June 2010
• Needed something speedy – created parallel installation
– This was bad
5
– But we could still get the answers we wanted
6
7
12. 1 AND THEN – CRAZY GROWTH!
INITIAL CRAZY
2 HISTORY START-UP
LAUNCH
GROWTH
GROWTH
3
# unique logins
TOTAL ACTIVE PLAYERS
4
4.2M
5 NOV. 2011
1.5MM
JULY 2011
6
7
time
13. 1 THE BREAKING POINT
INITIAL CRAZY BREAKING
2 HISTORY START-UP
LAUNCH
GROWTH
GROWTH POINT
3 • NA Data Warehouse reached a breaking point 9 months ago
– 24 hours of data took 24.5 hours to ETL
• We couldn’t handle…
4 – multiple environments in a vertical MySQL instance
– a single environment in a vertical MySQL instance
5 • We needed to change!
6
7
15. 1
WHY HADOOP?
2 COST EFFECTIVE
Expanding rapidly, so CAPEX was a concern
3 SOLUTION
SCALABLE
Handles massive data sets and diverse data sets
4 (both structured and unstructured)
OPEN SOURCE
5
Our engineers can dive into problems
6 SPEED OF EXECUTION
We needed to move fast!
7
16. 1
HIGH LEVEL ARCHITECTURE – CURRENT
Business
2 Audit Plat
Analyst
LoL
Tableau
3 SOLUTION NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 LoL +
Warehouse
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
17. 1
WHAT MAKES UP OUR ETL
Business
2 Audit Plat
Analyst
LoL
Tableau
3 SOLUTION NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 LoL +
Warehouse
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
18. 1
WHAT MAKES UP OUR ETL
2
3 SOLUTION
Pentaho
All of these orchestrated by Pentaho
+
Custom ETL
4 +
Sqoop
We use Sqoop for staging data only
5
Then dynamically partition data into Hive tables
6
7
19. 1
WHAT MAKES UP OUR ETL
Business
2 Audit Plat
Analyst
LoL
Tableau
3 SOLUTION NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 LoL +
Warehouse
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
20. 1
WHAT MAKES UP OUR ETL
Hive Data Warehouse
2
3 SOLUTION
Data Temp
Staging
4 Area
5
1
Data written into
temp staging area
6
Prevents analysts from running queries out of partially written tables
Helps us leverage Hive’s merging and compression settings
7
21. 1
WHAT MAKES UP OUR ETL
Hive Data Warehouse
2
Partition A
3 SOLUTION Partition B
Data Temp
Staging Partition C
4 Area
Partition D
Partition E
5
2
Hive dynamically
inserts data into
6 appropriate partitions
According to value generated for partition key in the target table
7 Non-existent partitions will be created by Hive
22. 1
WHAT MAKES UP OUR ETL
Hive Data Warehouse
2
Partition A1
Partition A Partition A2
Partition A3
Partition B1
3 SOLUTION Partition B Partition B2
Data Temp
Partition B3
Partition C1
Staging Partition C Partition C2
Partition C3
4 Area Partition D1
Partition D Partition D2
Partition D3
Partition E1
Partition E Partition E2
Partition E3
5
3
Layered partitioning
= very helpful for
6 region-based partitioning
Helps maintain one table definition across regions
7
23. 1
WHAT MAKES UP OUR ETL
Hive Data Warehouse
2
3 SOLUTION
Data Temp
Staging
4 Area
5
TO OPTIMIZE DISK IO FOR USER QUERIES,
6
WE ENABLED COMPRESSION
7
24. Hive Data Warehouse
1
Data Temp
2 Staging
Area
3 SOLUTION
WHY COMPRESSION?
We have 24 cores and disk IO is always the bottleneck,
4 so compression is essential
WHY SNAPPY COMPRESSED
5 SEQUENCEFILE BLOCKS?
Lots of “why Snappy” discussion on the interwebs already
SequenceFile can be split by Hadoop and can run
6
multiple maps in parallel
Block compression yields better compression ratio while
keeping the file splittable; this block size is configurable
7
25. 1
WHAT WE DO IN HIVE
2
3 SOLUTION
4
Hive Data
Warehouse
5
We ETL data from OLTP MySQL slaves daily
6
7
26. 1
WHAT WE DO IN HIVE
2 Our analysts shoot
Hive queries
every day
3 SOLUTION
4
Hive Data
Warehouse
5
Translating to 1000s of MR jobs daily
6
7
27. 1
WHAT WE DO IN HIVE
2
We have some pretty large tables:
3 SOLUTION
4 e.g., one with 50,795,997,734 rows
Hive Data
Warehouse
5
We use metrics derived from Hive queries to
6 improve our matchmaking system and player behavior
7
28. 1
WHAT DID WE LEARN FROM ETL?
2 • If you use custom ETL, keep an eye out for block distribution
• DRY: Re-inventing the wheel is not a good idea
3 SOLUTION – Invest time in researching proper tools that suit your needs
– Tons of options for ETL and workflow management
– Just because company X is using a particular ETL or workflow
4 management tool, it may or may not work effectively for you
5
6
7
29. 1
WHY TABLEAU?
Business
2 Audit Plat
Analyst
LoL
Tableau
3 SOLUTION NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 LoL +
Warehouse
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
30. 1
WHY TABLEAU?
Business
2 Analyst
• We needed to democratize access for
Tableau
non-technical folks
3 SOLUTION
– Design
– Execs
MySQL – Player Support
4
• Great visualization capability
• Easy to work with
5 • Has a Hive connector*
6
7
31. 1
LEAGUE OF LEGENDS GAMEPLAY BASICS
2
3 SOLUTION
4
5
6
7
35. 1
2
3 WAIT, SO WHAT’S A YORDLE?
• Yordles = very cute race of champions in League of Legends
4 • We track Yordles (and the rest of our champions) because game
balance is exceptionally important
5
6
7
36. 1
DESIGN BALANCE IS IMPORTANT
2
• Highly competitive game
• Updated every 2-3 weeks
3
– New champions
– New items
4
USECASE
#1
• Game is a living, breathing service that’s always in motion
• Have to maintain a level playing field
5
6
7
37. 1
QUICKLY REACTING TO CHANGES
2 = wins
3
USECASE
4 #1
5
6
total plays
7 time
38. 1
HOW DID WE CREATE THAT?
2
3
USECASE
4 #1
5
6
7
*All logos are trademarks of respective owners
39. 1
WHY NOT JUST HIVE?
2
3
USECASE
4 #1
5
6
7
*All logos are trademarks of respective owners
40. 1
WHY NOT JUST HIVE?
2
3
HIVE IS FOR
MASSIVE JOBS
USECASE
4 #1
5
6
7
41. 1
HIVE TO MYSQL TRANSFORMATION
2 • Many of our stakeholders use Tableau
• Transformed required data into cubes for direct Tableau
consumption using Pentaho
3 • Initially experimented with Hive-to-Tableau connector
– Had issues, e.g., triggering MR jobs for every change and non-
USECASE
persistent Hive-Server
4 #1
5
6
7
42. 1
WE WANTED TO KNOW MORE ABOUT…
2
Which champions and skins are popular across all regions?
3
USECASE
4 #1
What are the win-rates of champions across all regions?
5
Are better players choosing different champions?
6
7
43. 1
WE CREATED CUBES OF AGGREGATED DATA
2
win rates
3
USECASE
4 #1
5
6
champions
7
44. 1
HOW WE DID IT: TRANSFORMATION++
2
Massive tables
reside in Hive
3
Hive MySQL TABLEAU
transformation transformed
creates into cubes for
USECASE
4 #1
dimension tables Tableau consumption
5
6 Some dimension tables
moved to join with
other fact tables in Hive
7
45. 1
WHY DID WE GO THIS ROUTE?
2
3
Not good for slowly changing MySQL is not awesome for joining
dimensions massive tables
USECASE • No automatic primary key
4 #1
generation
• Can’t regenerate dimension
table quickly enough since it
requires a full-table scan
5
6 • Decided to use best of both worlds
• Also leveraged map-side joins and distributed cache
7
47. 1
FIRST, SOME CONTEXT
2 • League of Legends is global in scale, with players
logging in from >145 countries in a typical day
3
• No-fee play means very low barrier to play
• Players often play on multiple environments regularly
(e.g. EU players on NA environments and vice versa)
4 • Same features and mechanics deployed in all territories
• It’s vitally important that we understand game
5 USECASE
#2
performance metrics by geography and region
6
7
48. 1
MATCHMAKING
2 • One of the most important features outside of gameplay
• Like a dating service, the objective is to match people up;
3 • Number of different queues that players can line up in, depending
on the type of match they’re looking for
4
USECASE
5 #2
6 Critical that this system is balanced
balanced
and able to create good matches quickly
7
49. 1
MATCHMAKING – IS IT WORKING?
2 • Matchmaking algorithm based on modified Elo system
• Inspecting the “curve” of these scores:
3 – Should show a similar distribution in all regions
– May show interesting trends, such as win/lose ratios
4
USECASE
5 #2
6
7
50. 1
MATCHMAKING – IS IT WORKING?
2
% players
ELO DISTRIBUTION GRAPH
3
4
USECASE
5 #2
6
7
ELO score
51. 1
WHAT WAS NEEDED TO GENERATE IT?
1
2 Had to join massive tables with session and player data
MASSIVE MASSIVE MASSIVE
3 TABLE TABLE TABLE
WITH WITH WITH
SESSION PLAYER GAME
4 DATA DATA DATA
2
USECASE Needed to lookup and range-query IP-addresses in same join
5 #2
Required for many region-based metrics
6
7
52. 1
LIMITATIONS OF HIVE
2
Hive
3
4 No good indexing Not efficient for
mechanism in our lookup and range
version queries
USECASE
5 #2
6 This made region-based queries computationally difficult
7
53. 1
SOLUTION
2
Hive
3
leveraged
open-source
4 libraries online
GeoIP UDFs
USECASE UDFs = user-defined functions that one
5 #2
can add to the Hive interpreter
6
7
55. 1
2
3 LESSON #1
a
Analysts are greedy in
4
mid-sized cluster
ited
Enable a scheduler with cap-lim
resources for users
5
6 LESSONS
7
56. 1
2
3 LESSON #2
ase
Configuration follows usec
ld shou
• Hardware profiles and file-structure
4
match the workload
with
• Enabling com pression to trade disk IO
CPU helped us
5 hines is
• Large r block size in beefy mac
MMV
performant. 25 6MB worked for us àY
directory
6 • Levera ge all the spindles. Stripe
arameters
LESSONS
access for appropriate p
7
57. 1
2
3 LESSON #3
nd
Cover your downside (a
4
your backside)
r and
• RA ID Namenode, Jobtracke ode)
Secondary N amenode (Checkpoint N
rough http
5 • Backup namespace th
crucial –
• Backu ps and Trash configs are
u a warning
remem ber “rmr” doesn’t give yo
6 LESSONS
7
58. 1
2
3 LESSON #4
Plan capac ity for at least one
4
year ahead
ea
• Instead of usi ng production cluster, hav
erimenting new
dar k launch cluster for exp
5 usecases
an other
• In-pla ce upgrades is trickier th ibility
enterprise so ftware since wire-compat
istros
6 LESSONS is n ot available in current d
7
59. 1
2
3 LESSON #5
Automate for reality! ed
rc
pen-sou
We wrote chef recipes and o
4 es/
them at h ttps://github.com/RiotGam
cloudera-cookbook
5
6 LESSONS
may not
2.0 (the "License"); you
Apache License, Version y of the License at htt
p://
mes. Licensed under the y obtain a cop
Copyright 2012 Riot Ga with the License. You ma to in writing,
compliance agreed
use this file except in by applicable law or
SE-2.0 Unless required BASIS, WITHOUT WARR
ANTIES OR
www.a pache.org/licenses/LICEN d on an "AS IS"
er the License is distribute for the specific language
7 software distributed und
CONDITIONS OF AN
Y KIND, either expres
limitations under the
s or implied. See the Lic
License.
ense
governing permissions and
61. 1
OUR IMMEDIATE GOALS
2
• Shorten time to insight
• Increase depth of insight
3
• Enable data analysis for client-side features
• Log ingestion and analysis
4
• Flexible auditing framework
• International data infrastructure
5
6
THE
7 FUTURE
62. 1
CHALLENGE: MAKE IT GLOBAL
2 • Data centers across the globe since latency has huge effect on
gameplay à log data scattered around the world
• Large presence in Asia -- some areas (e.g., PH) have bandwidth
3
challenges or bandwidth is expensive
4
5
6
THE
7 FUTURE
63. 1
CHALLENGE: WE HAVE BIG DATA
STRUCTURED DATA
2
500G DAILY
APPLICATION AND OPERATIONAL LOGS
3
4.5TB DAILY
4 OFFICIAL LOL SITE TRAFFIC
6MM HITS DAILY
5 RIOT YOUTUBE CHANNEL
1.7MM SUBSCRIBERS
270+MM VIEWS
6
+ chat logs
+ detailed gameplay event tracking
7 THE
FUTURE
+ so on….
64. 1
OUR AUDACIOUS GOALS
2
Build a world-class data and analytics organization
• Deeply understand players across the globe
• Apply that understanding to improve games for players
3
• Deeply understand our entire ecosystem, including social media
4 Have ability to identify, understand and react to
meaningful trends in real time
5
Have deep, real-time understanding of our systems
from player experience and operational standpoints
6
THE
7 FUTURE
65. 1
SHAMELESS HIRING PLUG
2 • Like most everybody else at this conference… we’re
hiring!
• The Riot Manifesto
3
Player experience first
4 Challenge convention
Focus on talent and team
5
Take play seriously
Stay hungry, stay humble
6
THE
7 FUTURE
66. 1
SHAMELESS HIRING PLUG
2
3
4
5
6
THE
And yes, you can play games at work.
7 FUTURE
It’s encouraged!