This is a talk that was given for the Scalable Internet Services Masters-level Computer Science class at UCLA and UCSB. It briefly discusses the server architecture for the game League of Legends before going into depth about how the data warehouse can hold petabytes of player data. Discussion about message queue architecture and scalability occurs along the way
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
Achieving Continuous Delivery: An Automation Storyjimi-c
Continuos Deployment is the act of deploying software constantly. The idea is if "release early, release often" is good, releasing very often is better. It's not trivial. Automation is part of the battle, and testing is another. Learn to use tools like Jenkins and Ansible to move from deploying software once a month to 15 times every hour, and why you'll want to.
Presented at PyCon 2015 in Montreal
Would you ever play an online game if you were not able to communicate with your teammates? Isn’t it fun if you can make new friends, arrange pre-made games and celebrate your victories with people you like to play with?
Riot Games’ League of Legends handles millions of online players at any given time. Each chat server is responsible for routing over 1 billion real time events a day. In order to support the overwhelming user base and be prepared future growth, as well as pave the road for the upcoming features, chat infrastructure had to be designed and built with the utmost care, so that it would never fail the players.
In this talk I would like to present how we achieved linear scalability, improved the overall fault tolerance, created a framework for real time code upgrades and got ready for the new features we want to ship. I will also discuss in detail why we chose to use Erlang as a foundation for the system, and why we migrated our data from MySQL to Riak.
Mobile Library Development - stuck between a pod and a jar file - Zan Markan ...Codemotion
Isaac Newton, the father of modern software engineering, called it “Standing on the shoulders of giants”. Modern development is exciting as it gets easier and easier, partly because of the wealth of resources available at our fingertips. One category of these resources are libraries, SDKs, and frameworks. This talk will be a guide into the considerations that go into building a library for both iOS/Swift and Java/Android. We will be taking cues from both my personal experience, as well as from studying how the leaders in the field do it.
Scaling Your First 1000 Containers with DockerAtlassian
Deploying large numbers of containers to production can be a difficult proposition if you don’t approach the problem with the right strategy – one that's appropriate for both your developers and the size of your operations team. Choosing a strategy lets you codify your deployment patterns in a repeatable manner and reuse them over hundreds of deployments without incurring unnecessary cost and complexity.
Using Atlassian’s PaaS as a model, we will discuss important milestones as you scale from a single container to tens, hundreds, and eventually to a thousand containers. At what points should you begin to embrace log aggregation? How about monitoring and metrics collection? Orchestration and clustering solutions? Learn how to incorporate ever more sophisticated third-party solutions as you go, to achieve cost-effective and stable management of your containers in production.
Webinar: Queues with RabbitMQ - Lorna MitchellCodemotion
Queues are a great addition to any application that has some tasks that need processing asynchronously. This could be sending a confirmation email, resizing an avatar, or recalculating a running total of some kind; in all those cases it would be cool to send the response back to the user and then sort out that task later. This session looks at how to use a RabbitMQ job queue in your application. It also looks at how to design elegant and robust long-running workers that will consume the jobs from the queue and process them. This session is ideal for technical leads, developers and architects alike.
Cracking the nut, solving edge ai with apache tools and frameworksTimothy Spann
Cracking the nut, solving edge ai with apache tools and frameworks
Using the FLaNK stack for Edge AI and Streaming AI.
Apache Flink, Apache Kafka, Apache Nifi, Apache Kudu, DJL, Apache MXNet, Apache OpenNLP, Apache Tika, Apache Hue, Apache Hadoop, Apache HDFS
Presented at AI DevWorld 2020 virtual
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
Achieving Continuous Delivery: An Automation Storyjimi-c
Continuos Deployment is the act of deploying software constantly. The idea is if "release early, release often" is good, releasing very often is better. It's not trivial. Automation is part of the battle, and testing is another. Learn to use tools like Jenkins and Ansible to move from deploying software once a month to 15 times every hour, and why you'll want to.
Presented at PyCon 2015 in Montreal
Would you ever play an online game if you were not able to communicate with your teammates? Isn’t it fun if you can make new friends, arrange pre-made games and celebrate your victories with people you like to play with?
Riot Games’ League of Legends handles millions of online players at any given time. Each chat server is responsible for routing over 1 billion real time events a day. In order to support the overwhelming user base and be prepared future growth, as well as pave the road for the upcoming features, chat infrastructure had to be designed and built with the utmost care, so that it would never fail the players.
In this talk I would like to present how we achieved linear scalability, improved the overall fault tolerance, created a framework for real time code upgrades and got ready for the new features we want to ship. I will also discuss in detail why we chose to use Erlang as a foundation for the system, and why we migrated our data from MySQL to Riak.
Mobile Library Development - stuck between a pod and a jar file - Zan Markan ...Codemotion
Isaac Newton, the father of modern software engineering, called it “Standing on the shoulders of giants”. Modern development is exciting as it gets easier and easier, partly because of the wealth of resources available at our fingertips. One category of these resources are libraries, SDKs, and frameworks. This talk will be a guide into the considerations that go into building a library for both iOS/Swift and Java/Android. We will be taking cues from both my personal experience, as well as from studying how the leaders in the field do it.
Scaling Your First 1000 Containers with DockerAtlassian
Deploying large numbers of containers to production can be a difficult proposition if you don’t approach the problem with the right strategy – one that's appropriate for both your developers and the size of your operations team. Choosing a strategy lets you codify your deployment patterns in a repeatable manner and reuse them over hundreds of deployments without incurring unnecessary cost and complexity.
Using Atlassian’s PaaS as a model, we will discuss important milestones as you scale from a single container to tens, hundreds, and eventually to a thousand containers. At what points should you begin to embrace log aggregation? How about monitoring and metrics collection? Orchestration and clustering solutions? Learn how to incorporate ever more sophisticated third-party solutions as you go, to achieve cost-effective and stable management of your containers in production.
Webinar: Queues with RabbitMQ - Lorna MitchellCodemotion
Queues are a great addition to any application that has some tasks that need processing asynchronously. This could be sending a confirmation email, resizing an avatar, or recalculating a running total of some kind; in all those cases it would be cool to send the response back to the user and then sort out that task later. This session looks at how to use a RabbitMQ job queue in your application. It also looks at how to design elegant and robust long-running workers that will consume the jobs from the queue and process them. This session is ideal for technical leads, developers and architects alike.
Cracking the nut, solving edge ai with apache tools and frameworksTimothy Spann
Cracking the nut, solving edge ai with apache tools and frameworks
Using the FLaNK stack for Edge AI and Streaming AI.
Apache Flink, Apache Kafka, Apache Nifi, Apache Kudu, DJL, Apache MXNet, Apache OpenNLP, Apache Tika, Apache Hue, Apache Hadoop, Apache HDFS
Presented at AI DevWorld 2020 virtual
Autoscaling Best Practices - WebPerf Barcelona Oct 2014Marc Cluet
This talk is an evolution of the one presented at FOSDEM'14, we talk about what are the common practices and methodologies for autoscaling, we also cover some best practices and the global scope of autoscaling inside your infrastructure.
Learn how in less than 6 months and with a 1-person team, they went from no infrastructure automation, to having all of their infrastructure automated with Ansible. Learn how BigPanda (http://bigpanda.io ) handles zero-downtime infrastructure updates and connects Ansible with their chat infrastructure, and some strategies on managing automation projects with very small teams.
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
Traditional machine learning pipelines end with life-less models sitting on disk in the research lab. These traditional models are typically trained on stale, offline, historical batch data. Static models and stale data are not sufficient to power today's modern, AI-first Enterprises that require continuous model training, continuous model optimizations, and lightning-fast model experiments directly in production. Through a series of open source, hands-on demos and exercises, we will use PipelineAI to breathe life into these models using 4 new techniques that we’ve pioneered:
* Continuous Validation (V)
* Continuous Optimizing (O)
* Continuous Training (T)
* Continuous Explainability (E).
The Continuous "VOTE" techniques has proven to maximize pipeline efficiency, minimize pipeline costs, and increase pipeline insight at every stage from continuous model training (offline) to live model serving (online.)
Attendees will learn to create continuous machine learning pipelines in production with PipelineAI, TensorFlow, and Kafka.
From Code to the Monkeys: Continuous Delivery at NetflixDianne Marsh
At Netflix, we continue to improve upon our continuous delivery process. We thrive in a hybrid environment, where every developer is able to deploy code, and with that freedom comes the responsibility for ensuring that our customers are not negatively impacted. We have constructed Open Source tools toward a Continuous Delivery solution. In this presentation, from QConSF 2013, you will learn about our tool chain so that you can determine which make sense in your environment.
Using apache mx net in production deep learning streaming pipelinesTimothy Spann
As a Data Engineer I am often tasked with taking Machine Learning and Deep Learning models into production, sometimes in the cloud and sometimes at the edge. I have developed Java code that allows us to run these models at the edge and as part of a sensor/webcam/images/data stream. I have developed custom interfaces in Apache NiFi to enable real-time classification against MXNet models directly through the Java API or through DJL.AI's Java interface. I will demo running models on NVIDIA Jetson Nanos and NVIDIA Xavier NX devices as well as in the cloud.
# Technologies Utilized:
# Apache MXNet, DJL.AI, NVIDIA Jetson Nano, NVIDIA Jetson XAVIER, Apache NiFi, MiNIFi, Java, Python.
Serverless in Production, an experience report (AWS UG South Wales)Yan Cui
AWS Lambda has changed the way we deploy and run software, but this new serverless paradigm has created new challenges to old problems - how do you test a cloud-hosted function locally? How do you monitor them? What about logging and config management? And how do we start migrating from existing architectures?
In this talk Yan and Scott will discuss solutions to these challenges by drawing from real-world experience running Lambda in production and migrating from an existing monolithic architecture.
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...Amazon Web Services
Parse is a BaaS for mobile developers that is built entirely on AWS. With over 150,000 mobile apps hosted on Parse, the stability of the platform is our primary concern, but it coexists with rapid growth and a demanding release schedule. This session is a technical discussion of the current architecture and the design decisions that went in to scaling the platform rapidly and robustly over the past year and a half. We talk about some of the lessons learned managing and scaling MongoDB, Cassandra, Redis, and MySQL in the cloud. We also discuss how Parse went from launching individual instances using chef to managing clusters of hosts with Auto Scaling groups, with instance discovery and registry handled by ZooKeeper, thus enabling us to manage vastly larger sets of services with fewer human resources. This session is useful to anyone who is trying to scale up from startup to established platform without sacrificing agility.
Api world apache nifi 101
2021
https://github.com/tspannhw/EverythingApacheNiFi
Apache NiFi 101 with Apache Pulsar
integration, basics, no spaghetti flows
https://emamo.com/event/api-world-2021/s/pro-talk-api-apache-nifi-101-introduction-and-best-practices-orYbeW
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the edge before we start our real-time streaming flows. Fortunately using the all Apache FLiP stack we can do this with ease! Streaming AI Powered Analytics From the Edge to the Data Center is now a simple use case. With MiNiFi we can ingest the data, do data checks, cleansing, run machine learning and deep learning models and route our data in real-time to Apache NiFi and/or Apache Pulsar for further transformations and processing. Apache Flink will provide our advanced streaming capabilities fed real-time via Apache Pulsar topics. Apache MXNet models will run both at the edge and in our data centers via Apache NiFi and MiNiFi.
Timothy Spann
Developer Advocate @ StreamNative
ex-Principal Field Engineer @ Cloudera
ex-Senior Sales Engineer @ Hortonworks
ex-Senior Field Engineer at Pivotal
Serverless in production, an experience report (FullStack 2018)Yan Cui
AWS Lambda has changed the way we deploy and run software, but this new serverless paradigm has created new challenges to old problems - how do you test a cloud-hosted function locally? How do you monitor them? What about logging and config management? And how do we start migrating from existing architectures?
In this talk Yan and Scott will discuss solutions to these challenges by drawing from real-world experience running Lambda in production and migrating from an existing monolithic architecture.
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
Perform Online Predictions using Slack
A/B and multi-armed bandit model compare
Train Online Models with Kafka Streams
Create new models quickly
Deploy to production safely
Mirror traffic to validate online performance
Any Framework, Any Hardware, Any Cloud
Dashboard to manage the lifecycle of models from local development to live production
Generates optimized runtimes for the models
Custom targeting rules, shadow mode, and percentage-based rollouts to safely test features in live production
Continuous model training, model validation, and pipeline optimization
https://youtu.be/zpkH9oiIovU
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/258276286/
Related Links
PipelineAI Home: https://pipeline.ai
PipelineAI Community Edition: https://community.pipeline.ai
PipelineAI GitHub: https://github.com/PipelineAI/pipeline
PipelineAI Quick Start: https://quickstart.pipeline.ai
Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
YouTube Videos: https://youtube.pipeline.ai
SlideShare Presentations: https://slideshare.pipeline.ai
Slack Support:
https://joinslack.pipeline.ai
Web Support and Knowledge Base: https://support.pipeline.ai
Email Support: help@pipeline.ai
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...Claus Ibsen
In this session, we'll focus on:
Camel 3: Demos of how Camel 3, Camel K and Camel Quarkus all work together, and will provide insights into Camel’s role in the next major release of Red Hat Integration products.
Camel K: This serverless integration platform provides low-code/no-code capabilities, where integrations can be snapped together quickly using the powers from integration patterns and Camel’s extensive set of connectors.
Camel Quarkus: Using Knative (the fast runtime of Quarkus) and Camel K brings awesome serverless features, such as auto-scaling, scaling to zero, and event-based communication, with great integration capabilities from Apache Camel.
You will also hear about the latest Camel sub-project Camel Kafka Connectors which makes it possible to use all the Camel components as Kafka Connect connectors.
Finally we bring details of the roadmap for what is coming up in the Camel projects.
Cowboy dating with big data TechDays at Lohika-2020b0ris_1
The story about things that happen if data platforms are developed not by data engineers, what pitfalls and mistakes can be made.
This will help you to understand what data engineering is about.
Autoscaling Best Practices - WebPerf Barcelona Oct 2014Marc Cluet
This talk is an evolution of the one presented at FOSDEM'14, we talk about what are the common practices and methodologies for autoscaling, we also cover some best practices and the global scope of autoscaling inside your infrastructure.
Learn how in less than 6 months and with a 1-person team, they went from no infrastructure automation, to having all of their infrastructure automated with Ansible. Learn how BigPanda (http://bigpanda.io ) handles zero-downtime infrastructure updates and connects Ansible with their chat infrastructure, and some strategies on managing automation projects with very small teams.
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
Traditional machine learning pipelines end with life-less models sitting on disk in the research lab. These traditional models are typically trained on stale, offline, historical batch data. Static models and stale data are not sufficient to power today's modern, AI-first Enterprises that require continuous model training, continuous model optimizations, and lightning-fast model experiments directly in production. Through a series of open source, hands-on demos and exercises, we will use PipelineAI to breathe life into these models using 4 new techniques that we’ve pioneered:
* Continuous Validation (V)
* Continuous Optimizing (O)
* Continuous Training (T)
* Continuous Explainability (E).
The Continuous "VOTE" techniques has proven to maximize pipeline efficiency, minimize pipeline costs, and increase pipeline insight at every stage from continuous model training (offline) to live model serving (online.)
Attendees will learn to create continuous machine learning pipelines in production with PipelineAI, TensorFlow, and Kafka.
From Code to the Monkeys: Continuous Delivery at NetflixDianne Marsh
At Netflix, we continue to improve upon our continuous delivery process. We thrive in a hybrid environment, where every developer is able to deploy code, and with that freedom comes the responsibility for ensuring that our customers are not negatively impacted. We have constructed Open Source tools toward a Continuous Delivery solution. In this presentation, from QConSF 2013, you will learn about our tool chain so that you can determine which make sense in your environment.
Using apache mx net in production deep learning streaming pipelinesTimothy Spann
As a Data Engineer I am often tasked with taking Machine Learning and Deep Learning models into production, sometimes in the cloud and sometimes at the edge. I have developed Java code that allows us to run these models at the edge and as part of a sensor/webcam/images/data stream. I have developed custom interfaces in Apache NiFi to enable real-time classification against MXNet models directly through the Java API or through DJL.AI's Java interface. I will demo running models on NVIDIA Jetson Nanos and NVIDIA Xavier NX devices as well as in the cloud.
# Technologies Utilized:
# Apache MXNet, DJL.AI, NVIDIA Jetson Nano, NVIDIA Jetson XAVIER, Apache NiFi, MiNIFi, Java, Python.
Serverless in Production, an experience report (AWS UG South Wales)Yan Cui
AWS Lambda has changed the way we deploy and run software, but this new serverless paradigm has created new challenges to old problems - how do you test a cloud-hosted function locally? How do you monitor them? What about logging and config management? And how do we start migrating from existing architectures?
In this talk Yan and Scott will discuss solutions to these challenges by drawing from real-world experience running Lambda in production and migrating from an existing monolithic architecture.
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...Amazon Web Services
Parse is a BaaS for mobile developers that is built entirely on AWS. With over 150,000 mobile apps hosted on Parse, the stability of the platform is our primary concern, but it coexists with rapid growth and a demanding release schedule. This session is a technical discussion of the current architecture and the design decisions that went in to scaling the platform rapidly and robustly over the past year and a half. We talk about some of the lessons learned managing and scaling MongoDB, Cassandra, Redis, and MySQL in the cloud. We also discuss how Parse went from launching individual instances using chef to managing clusters of hosts with Auto Scaling groups, with instance discovery and registry handled by ZooKeeper, thus enabling us to manage vastly larger sets of services with fewer human resources. This session is useful to anyone who is trying to scale up from startup to established platform without sacrificing agility.
Api world apache nifi 101
2021
https://github.com/tspannhw/EverythingApacheNiFi
Apache NiFi 101 with Apache Pulsar
integration, basics, no spaghetti flows
https://emamo.com/event/api-world-2021/s/pro-talk-api-apache-nifi-101-introduction-and-best-practices-orYbeW
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the edge before we start our real-time streaming flows. Fortunately using the all Apache FLiP stack we can do this with ease! Streaming AI Powered Analytics From the Edge to the Data Center is now a simple use case. With MiNiFi we can ingest the data, do data checks, cleansing, run machine learning and deep learning models and route our data in real-time to Apache NiFi and/or Apache Pulsar for further transformations and processing. Apache Flink will provide our advanced streaming capabilities fed real-time via Apache Pulsar topics. Apache MXNet models will run both at the edge and in our data centers via Apache NiFi and MiNiFi.
Timothy Spann
Developer Advocate @ StreamNative
ex-Principal Field Engineer @ Cloudera
ex-Senior Sales Engineer @ Hortonworks
ex-Senior Field Engineer at Pivotal
Serverless in production, an experience report (FullStack 2018)Yan Cui
AWS Lambda has changed the way we deploy and run software, but this new serverless paradigm has created new challenges to old problems - how do you test a cloud-hosted function locally? How do you monitor them? What about logging and config management? And how do we start migrating from existing architectures?
In this talk Yan and Scott will discuss solutions to these challenges by drawing from real-world experience running Lambda in production and migrating from an existing monolithic architecture.
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
Perform Online Predictions using Slack
A/B and multi-armed bandit model compare
Train Online Models with Kafka Streams
Create new models quickly
Deploy to production safely
Mirror traffic to validate online performance
Any Framework, Any Hardware, Any Cloud
Dashboard to manage the lifecycle of models from local development to live production
Generates optimized runtimes for the models
Custom targeting rules, shadow mode, and percentage-based rollouts to safely test features in live production
Continuous model training, model validation, and pipeline optimization
https://youtu.be/zpkH9oiIovU
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/258276286/
Related Links
PipelineAI Home: https://pipeline.ai
PipelineAI Community Edition: https://community.pipeline.ai
PipelineAI GitHub: https://github.com/PipelineAI/pipeline
PipelineAI Quick Start: https://quickstart.pipeline.ai
Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
YouTube Videos: https://youtube.pipeline.ai
SlideShare Presentations: https://slideshare.pipeline.ai
Slack Support:
https://joinslack.pipeline.ai
Web Support and Knowledge Base: https://support.pipeline.ai
Email Support: help@pipeline.ai
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...Claus Ibsen
In this session, we'll focus on:
Camel 3: Demos of how Camel 3, Camel K and Camel Quarkus all work together, and will provide insights into Camel’s role in the next major release of Red Hat Integration products.
Camel K: This serverless integration platform provides low-code/no-code capabilities, where integrations can be snapped together quickly using the powers from integration patterns and Camel’s extensive set of connectors.
Camel Quarkus: Using Knative (the fast runtime of Quarkus) and Camel K brings awesome serverless features, such as auto-scaling, scaling to zero, and event-based communication, with great integration capabilities from Apache Camel.
You will also hear about the latest Camel sub-project Camel Kafka Connectors which makes it possible to use all the Camel components as Kafka Connect connectors.
Finally we bring details of the roadmap for what is coming up in the Camel projects.
Cowboy dating with big data TechDays at Lohika-2020b0ris_1
The story about things that happen if data platforms are developed not by data engineers, what pitfalls and mistakes can be made.
This will help you to understand what data engineering is about.
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamStewart Needham
For AAA games now there is a consumer expectation that the developer has a post release strategy. This strategy goes beyond just DLC content. Users expect to receive bug fixes, balancing updates, gamemode variations and constant tuning of the game experience. So how can you architect your game technology to facilitate all of this? Stewart explains the unique patching system developed for Crysis 3 Multiplayer which allowed the team to hot-patch pretty much any asset or data used by the game. He also details the supporting telemetry, server and testing infrastructure required to support this along with some interesting lessons learned.
Leveraging Open Source to Manage SAN Performancebrettallison
Scope - The primary focus of this presentation is how to leverage open source software to help in managing Shared Storage performance. The storage server will be the focus with particular emphasis on ESS. This solution is a small one-off solution.
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D.
While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).
Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.
Slides from the Big Data Gurus meetup at Samsung R&D, August 14, 2013
This presentation covers the high level architecture of the Netflix Data Platform with a deep dive into the architecture, implementation, use cases, and future of Lipstick (https://github.com/Netflix/Lipstick) - our open source tool for graphically analyzing and monitoring the execution of Apache Pig scripts.
Netflix uses Apache Pig to express many complex data manipulation and analytics workflows. While Pig provides a great level of abstraction between MapReduce and data flow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. To address this problem, we created (and open sourced) a tool named Lipstick that visualizes and monitors the progress and performance of Pig scripts.
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
During this lunch, we’ll review open-source reverse ETL tools to uncover how to send data back to SaaS systems.
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
#data #dataengineering #datagovernance
The story about things that happen if data platforms are developed not by data engineers, what pitfalls and mistakes can be made.
This will help you to understand what data engineering is about.
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology and Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. The combination of the two can provide a solution to power advanced analytics for not only what has happened in the past, but make intelligent predictions about the future. Please join this webinar to learn how get the most value from your data for your data driven business.
Learning Objectives:
How to scale your Redshift queries with user-defined functions (UDFs)
How to apply Machine learning to historical data in Amazon Redshift
How to visualize your data with Amazon QuickSight
Present a reference architecture for advanced analytics
Who Should Attend:
Application developers looking to add UDFs, or predictive analytics to their applications, database administrators that need to meet the demand of data driven organizations, decision makers looking to derive more insight from their data
This presentation was given to the Dublin Node (JS) Community on May 29th 2014.
Presented by: Chris Lawless, Kevin Yu Wei Xia, Fergal Carroll @phergalkarl, Ciarán Ó hUallacháin, and Aman Kohli @akohli
View IT operations as a flow of data (Sources of Truth) thru work-cells (automation processes) to deliver value to the customer.
There should be only one source of truth for every piece of configuration data.
Device configurations are poor source of truth.
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
Similar to Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA (20)
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. SEAN MALONEY
BIG DATA ENGINEER
WHO IS THIS GUY?
Lead developer on Riot’s ETL tools
FUN FACT:
Was a student in this class 4 years
ago
Intern at Appfolio
3. MOVING MOUNTAINS OF DATA
INTRODUCTION1.
THE GAME PLATFORM: OUR MAIN DATA SOURCE2.
HOW WE INGEST AND QUERY DATA3.
HOW WE SCALE IN AWS4.
CONCLUSION - SEAN’S PRO TIPS5.
14. CHAT
ORACLE COHERENCE (IN MEMORY DB)
STORE AUDIT GAME ETC.
CHAT
CHAT
STORE AUDIT GAME ETC.
STORE AUDIT GAME ETC.
PRIMARY DB
HOT BACKUP DB
2nd BACKUP DB
/ ETL
18. PUSH-BASED
PULL-BASED / ETL
BATCH QUERIES
INGESTION STORAGE QUERY / VIEWS VIZ. TOOLS
SINGLE-ROW QUERIES
AGGREGATE QUERIES
FuETL
- OLTP game data
- External Data Sources
MASTER WAREHOUSE
HONU
- Anything pushed to it
- Server logs
DATA AUDITING
19. PUSH-BASED
PULL-BASED / ETL
BATCH QUERIES
INGESTION STORAGE
QUERY /
VIEWS
VIZ. TOOLS
SINGLE-ROW QUERIES
AGGREGATE QUERIES
FuETL
- OLTP game data
- External Data Sources
MASTER WAREHOUSE
HONU
- Anything pushed to it
- Server logs
DATA AUDITING
20. Distributed ETL Software written in
Ruby.
Scales Horizontally
Same ETL applied to multiple regions
/ datacenters
Self-Service UI with SQL query
templating.
28. Webapp
Core Libraries
Task Service
Tasks
Helper Service
Helpers
Environment
Service
Scheduler Process Worker Process Task / Helper / ControllersCommand Line Tool
View
- backbone.js
- Bootstrap CSS
Task DAO Helper DAOEnvironment
DAO
Env. Task DAO Env. Helper DAO
29. Webapp
Core Libraries
Task Service
Tasks
Helper Service
Helpers
Environment
Service
Scheduler Process Worker Process Task / Helper / ControllersCommand Line Tool
View
- backbone.js
- Bootstrap CSS
Task DAO Helper DAOEnvironment
DAO
Env. Task DAO Env. Helper DAO
30. Webapp
Core Libraries
Task Service
Tasks
Task DAO
Helper Service
Helpers
Helper DAO
Environment
Service
Environment
DAO
Scheduler Process Worker Process Task / Helper / Controllers
Env. Task DAO Env. Helper DAO
Command Line Tool
View
- backbone.js
- Bootstrap CSS
31. Webapp
Core Libraries
Task Service
Tasks
Helper Service
Helpers
Environment
Service
Scheduler Process Worker Process Task / Helper / ControllersCommand Line Tool
View
- backbone.js
- Bootstrap CSS
Task DAO Helper DAOEnvironment
DAO
Env. Task DAO Env. Helper DAO
35. Idempotency
Idempotent - an operation that will produce the
same results if executed once or multiple times
EXAMPLE:
Non-Idempotent: - x = x * 5;
- Submitting a purchase
Idempotent: - abs( abs(x) ) = abs(X)
- Cancelling a purchase
36. Idempotent?
In the transactional OLTP world….
INSERT INTO games_played
(SELECT * FROM games_played_na
WHERE date >= ‘2015-10-25’)
37. Idempotent?
In the big data / OLAP world….
INSERT INTO games_played
(SELECT * FROM games_played_na
WHERE date >= ‘2015-10-25’)
41. Message Queues
● AMAZON SIMPLE QUEUE SERVICE
● APACHE ACTIVEMQ
● RABBITMQ
● HORNETQ
● MICROSOFT MQ (MSMQ)
42. PUSH-BASED
PULL-BASED / ETL
BATCH QUERIES
INGESTION STORAGE
QUERY /
VIEWS
VIZ. TOOLS
SINGLE-ROW QUERIES
AGGREGATE QUERIES
FuETL
- OLTP game data
- External Data Sources
MASTER WAREHOUSE
HONU
- Anything pushed to it
- Server logs
DATA AUDITING
43. Self Service, Custom HTTP Edge
Service (Java)
0
Fronted by ELB in front of ~40
autoscaled m1.xlarge instances
Forwards JSON data indirectly to S3
Honu
The batches need to then be unpacked
and converted into Hive tables
0
44. Custom Collector Infrastructure
(Java) - Derived from Netflix Suro
0
Deployed in every data center
worldwide and also AWS
Self Service, Custom HTTP Edge
Service (Java API)
Honu
50. Idempotency
Use application logic to make idempotent
msg = queue.pop;
if (processed_games.contains( msg.game_id )
{
return; //do nothing
else {
process_game(msg);
}
51. What’s in there?
Data team doesn’t know everything that is submitted
Compliance
Are we violating international data laws?
Inconsistent data structure
Its formatted however developer submits it
THE
DOWN
SIDE
52. User Documentation
No one likes doing it, but it helps a lot.
Onboard training
Get new coworkers in-the-know
Familiar Protocols
Use REST or RPC so developers are on the same page
Focus on UX
Your tools need to be easy for non-technical people to use.
SELF
SERVICE
HOW?
53. PUSH-BASED
PULL-BASED / ETL
BATCH QUERIES
INGESTION STORAGE
QUERY /
VIEWS
VIZ. TOOLS
SINGLE-ROW QUERIES
AGGREGATE QUERIES
FuETL
- OLTP game data
- External Data Sources
MASTER WAREHOUSE
HONU
- Anything pushed to it
- Server logs
DATA AUDITING
57. PUSH-BASED
PULL-BASED / ETL
BATCH QUERIES
INGESTION STORAGE
QUERY /
VIEWS
VIZ. TOOLS
SINGLE-ROW QUERIES
AGGREGATE QUERIES
FuETL
- OLTP game data
- External Data Sources
MASTER WAREHOUSE
HONU
- Anything pushed to it
- Server logs
DATA AUDITING
58. REST micro-service built with Java
and docker.
Reports and visualizations we can
use to find problems.
Source and target comparison.
Warehouse
Auditing
Service
Platform
68. RDS
AWS Infrastructure Today
EMR EC2 Storage
Data
Science
Analytics /
Hue
ETL Telemetry
PlatforaDynamoDB
Loading
Auditing ETL
Telemetry
collectors
Data
dictionary
Rocana
(real time
dashboard)
Solr (real
time)
Point Data
Service
Metastore
Data
Science
Fraud
DYNAMODB
ETL App DB
Point Data
Store
S3
Source of “Truth”
Networking
VPC
AWS Direct
Connect
AWS Direct
Connect
AWS Direct
Connect
AWS Direct
Connect
70. DON’T
SEAN’S PRO TIPS OF THE DAY
DO
➔ Don’t wait. Create S3
permissions and naming
standards early
➔ Get an auditing solution
for DW accuracy
➔ Allocate time for tuning
AWS infrastructure
➔ Don’t forget to track cost.
AWS bills can surprise you
➔ Don’t underestimate simple
problems in big data.
➔ Prepare for multiple data
access patterns
➔ Keep idempotency in mind
and use MQ architecture
➔ Don’t stop. Believing
71. Custom rewards for mastering
different champions
Intensive query that spans every
game that every player has played
Improves player engagement
CHAMPION
MASTERY
72. Full copy of our data warehouse in
DynamoDB
Hive->DynamoDB Dynamic Partition
Support can answer questions faster
than ever.
PLAYER
SUPPORT
73. Data science team queries all chat
messages in game
Sentiment analysis and
classification
Identifies negative, offensive players
and mutes them automatically.
OFFENSIVE
CHAT
DETECTION