Slides from a talk I gave at Scandinavian Developers Conference 2012 on the architecture of Spotify. The slides follows a story of playing a track and the steps to get there.
In this talk Emil Fredriksson and David Poblador i Garcia explain how Spotify builds its infrastructure in order to deliver millions of songs to millions of users.
We explain how we manage to support our development teams to build features by developing a highly scalable infrastructure.
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
Discover how the world of big data is evolving and becoming faster, more reliable and better organized-- powering many of the cooler new features that you see in the client today!
Algorithmic Music Recommendations at SpotifyChris Johnson
In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
Slides from a talk I gave at Scandinavian Developers Conference 2012 on the architecture of Spotify. The slides follows a story of playing a track and the steps to get there.
In this talk Emil Fredriksson and David Poblador i Garcia explain how Spotify builds its infrastructure in order to deliver millions of songs to millions of users.
We explain how we manage to support our development teams to build features by developing a highly scalable infrastructure.
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
Discover how the world of big data is evolving and becoming faster, more reliable and better organized-- powering many of the cooler new features that you see in the client today!
Algorithmic Music Recommendations at SpotifyChris Johnson
In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...Kevin Goldsmith
This was an extended version of the talk that I gave at InfoShare 2016 in GDansk. This version of the talk was presented at ao.com and Think Money in Manchester UK in May 2016. This is a remix of several earlier talks and some new content to tie Spotify's autonomy and continuous improvement culture to it's data-driven product development approach to show the complete picture. As usual, I tend to talk to slides instead of putting a lot of the content into the slides themselves, so sorry if these don't have all the info.
Terraform est l’outil d’infrastructure as Code qui s’est imposé sur le marché ces dernières années. Cependant, un des principaux manquements que nous pouvons lui reprocher, c’est l’absence de gestion proactive des drifts. Durant ce talk, nous verrons comment réussir à réconcilier l’état de l’infrastructure et le code Terraform qui la décrit grâce à la méthode GitOps.
You have built an event-driven system leveraging Apache Kafka. Now you face the challenge of integrating traditional synchronous request-response capabilities, such as user interaction, through an HTTP web service.
There are various techniques, each with advantages and disadvantages. This talk discusses multiple options on how to do a request-response over Kafka — showcasing producers and consumers using single and multiple topics, and more advanced considerations using the interactive queries of ksqlDB and Kafka Streams.
Advanced considerations discussed:
What a consumer rebalance means to your active request-responses.
Discuss options for blocking for the async response in the web-service.
How can the CQRS (Command Query Responsibility Segregation) be leveraged with the interactive state stores of Kafka Streams and ksqlDB?
Interactive queries of the ksqlDB and Kafka Streams state stores are not available during a rebalance. What is the active Kafka development happening that will make interactive queries a more feasible option?
Would a custom state store help with rebalancing limitations?
Can custom partitioning be used for proper routing, and what impacts could that have to the other services in your ecosystem?
We will explore the above considerations with an interactive quiz application built using Apache Kafka, Kafka Streams, and ksqlDB. With a proper implementation in place, your request-response application can scale and be performant along with handling all of the requests.
Zillow's favorite big data & machine learning toolsnjstevens
This talk covers Zillow's favorite tools for keeping track of research, cluster computing, machine learning open source, workflow management, logging, deep learning and data storage
The Five Stages of Enterprise Jupyter DeploymentFrederick Reiss
Meetup talk from May 30, 2018.
Jupyter notebooks are an important tool for data science. For a single user on a laptop, these notebooks are a simple, straightforward tool. But Jupyter in the enterprise is a much more complex affair. Enterprises have large teams of data scientists who need to run their notebooks atop scalable compute infrastructure with secure, audited access to massive, proprietary data sets; all while keeping hardware costs down.
Here at IBM’s Center for Open-Source Data and AI Technologies, we’ve seen multiple enterprise rollouts of Jupyter notebooks, both first-hand, in IBM products and services; and second-hand, in our discussions with other members of the Jupyter community.
In this talk, we merge together the stories of these projects and walk through the process of deploying high-performance, secure, mulitentant Jupyter notebooks in an enterprise setting. Our goal is here is inform others who may be at the beginning of this journey of what is coming and how to navigate the challenges ahead.
Along the way, we answer five important questions: What are Jupyter notebooks? What makes Jupyter so attractive to data scientists? Why is deploying Jupyter in the enterprise difficult? What are your deployment options today? And, what are the tradeoffs of those approaches?
We’ll finish with a description of how how IBM and other members of the Jupyter community are working towards reducing those tradeoffs with the Jupyter Enterprise Gateway project. Finally, we’ll give a demonstration of multitenant Jupyter notebooks in action.
This talk is aimed at enterprise architects who need to support growing data science teams with multi-user deployments of Jupyter. No knowledge of data science is required.
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. It provides an end-to-end platform that can collect, curate, analyze, and act on data in real-time, on-premises, or in the cloud with a drag-and-drop visual interface. It’s being used across industries on large amounts of data that had stored in isolation which made collaboration and analysis difficult.
Join industry experts from Hortonworks and Attunity as they explain how Apache NiFi and streaming CDC technology provides a distributed, resilient platform for unlocking the value of data in new ways.
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
A Day in the Life of a Cloud Network Engineer at Netflix - NET303 - re:Invent...Amazon Web Services
Netflix is big and dynamic. At Netflix, IP addresses mean nothing in the cloud. This is a big challenge with Amazon VPC Flow Logs. VPC Flow Log entries only present network-level information (L3 and L4), which is virtually meaningless. Our goal is to map each IP address back to an application, at scale, to derive true network-level insight within Amazon VPC. In this session, the Cloud Network Engineering team discusses the temporal nature of IP address utilization in AWS and the problem with looking at OSI Layer 3 and Layer 4 information in the cloud.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Data Streaming with Apache Kafka & MongoDBconfluent
Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
In this talk, we will get into the architectural and functional details as to how we build scalable and robust data pipelines for music recommendations at Spotify. We will also discuss some of the challenges and an overview of work to address these challenges.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
Real-Time Market Data Analytics Using Kafka Streamsconfluent
(Lei Chen, Bloomberg, L.P.) Kafka Summit SF 2018
At Bloomberg, we are building a streaming platform with Apache Kafka, Kafka Streams and Spark Streaming to handle high volume, real-time processing with rapid derivative market data. In this talk, we’ll share the experience of how we utilize Kafka Streams Processor API to build pipelines that are capable of handling millions of market movements per second with ultra-low latency, as well as performing complex analytics like outlier detection, source confidence evaluation (scoring), arbitrage detection and other financial-related processing.
We’ll cover:
-Our system architecture
-Best practices of using the Processor API and State Store API
-Dynamic gap session implementation
-Historical data re-processing practice in KStreams app
-Chaining multiple KStreams apps with Spark Streaming job
Arquitetando uma instituição financeira moderna - Lucas CavalcantiiMasters
Lucas Cavalcanti - Principal Engineer, Nubank
Sistemas de software costumam ficar cada vez mais difíceis de evoluir e manter na medida que o tempo passa e mais funcionalidades são adicionadas.
No Nubank, após 4 anos de evolução, adicionar novas funcionalidades é tão ou mais fácil do que era há 3 anos atrás. Nessa palestra vamos explorar as principais características que possibilitaram essa evolução rápida e contínua de funcionalidades, como exemplo: microsserviços com o escopo bem definido, integração assíncrona entre serviços usando Kafka, verificação de schemas nas integrações, Clojure e programação funcional.
Apresentado no InterCon 2018
Why does Spotify use a microservices architecture? What are the benefits and challenges we've encountered? How does our organizational model support our architecture?
Video of the talk is posted on YouTube: https://youtu.be/7LGPeBgNFuU
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...Kevin Goldsmith
This was an extended version of the talk that I gave at InfoShare 2016 in GDansk. This version of the talk was presented at ao.com and Think Money in Manchester UK in May 2016. This is a remix of several earlier talks and some new content to tie Spotify's autonomy and continuous improvement culture to it's data-driven product development approach to show the complete picture. As usual, I tend to talk to slides instead of putting a lot of the content into the slides themselves, so sorry if these don't have all the info.
Terraform est l’outil d’infrastructure as Code qui s’est imposé sur le marché ces dernières années. Cependant, un des principaux manquements que nous pouvons lui reprocher, c’est l’absence de gestion proactive des drifts. Durant ce talk, nous verrons comment réussir à réconcilier l’état de l’infrastructure et le code Terraform qui la décrit grâce à la méthode GitOps.
You have built an event-driven system leveraging Apache Kafka. Now you face the challenge of integrating traditional synchronous request-response capabilities, such as user interaction, through an HTTP web service.
There are various techniques, each with advantages and disadvantages. This talk discusses multiple options on how to do a request-response over Kafka — showcasing producers and consumers using single and multiple topics, and more advanced considerations using the interactive queries of ksqlDB and Kafka Streams.
Advanced considerations discussed:
What a consumer rebalance means to your active request-responses.
Discuss options for blocking for the async response in the web-service.
How can the CQRS (Command Query Responsibility Segregation) be leveraged with the interactive state stores of Kafka Streams and ksqlDB?
Interactive queries of the ksqlDB and Kafka Streams state stores are not available during a rebalance. What is the active Kafka development happening that will make interactive queries a more feasible option?
Would a custom state store help with rebalancing limitations?
Can custom partitioning be used for proper routing, and what impacts could that have to the other services in your ecosystem?
We will explore the above considerations with an interactive quiz application built using Apache Kafka, Kafka Streams, and ksqlDB. With a proper implementation in place, your request-response application can scale and be performant along with handling all of the requests.
Zillow's favorite big data & machine learning toolsnjstevens
This talk covers Zillow's favorite tools for keeping track of research, cluster computing, machine learning open source, workflow management, logging, deep learning and data storage
The Five Stages of Enterprise Jupyter DeploymentFrederick Reiss
Meetup talk from May 30, 2018.
Jupyter notebooks are an important tool for data science. For a single user on a laptop, these notebooks are a simple, straightforward tool. But Jupyter in the enterprise is a much more complex affair. Enterprises have large teams of data scientists who need to run their notebooks atop scalable compute infrastructure with secure, audited access to massive, proprietary data sets; all while keeping hardware costs down.
Here at IBM’s Center for Open-Source Data and AI Technologies, we’ve seen multiple enterprise rollouts of Jupyter notebooks, both first-hand, in IBM products and services; and second-hand, in our discussions with other members of the Jupyter community.
In this talk, we merge together the stories of these projects and walk through the process of deploying high-performance, secure, mulitentant Jupyter notebooks in an enterprise setting. Our goal is here is inform others who may be at the beginning of this journey of what is coming and how to navigate the challenges ahead.
Along the way, we answer five important questions: What are Jupyter notebooks? What makes Jupyter so attractive to data scientists? Why is deploying Jupyter in the enterprise difficult? What are your deployment options today? And, what are the tradeoffs of those approaches?
We’ll finish with a description of how how IBM and other members of the Jupyter community are working towards reducing those tradeoffs with the Jupyter Enterprise Gateway project. Finally, we’ll give a demonstration of multitenant Jupyter notebooks in action.
This talk is aimed at enterprise architects who need to support growing data science teams with multi-user deployments of Jupyter. No knowledge of data science is required.
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. It provides an end-to-end platform that can collect, curate, analyze, and act on data in real-time, on-premises, or in the cloud with a drag-and-drop visual interface. It’s being used across industries on large amounts of data that had stored in isolation which made collaboration and analysis difficult.
Join industry experts from Hortonworks and Attunity as they explain how Apache NiFi and streaming CDC technology provides a distributed, resilient platform for unlocking the value of data in new ways.
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
A Day in the Life of a Cloud Network Engineer at Netflix - NET303 - re:Invent...Amazon Web Services
Netflix is big and dynamic. At Netflix, IP addresses mean nothing in the cloud. This is a big challenge with Amazon VPC Flow Logs. VPC Flow Log entries only present network-level information (L3 and L4), which is virtually meaningless. Our goal is to map each IP address back to an application, at scale, to derive true network-level insight within Amazon VPC. In this session, the Cloud Network Engineering team discusses the temporal nature of IP address utilization in AWS and the problem with looking at OSI Layer 3 and Layer 4 information in the cloud.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Data Streaming with Apache Kafka & MongoDBconfluent
Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
In this talk, we will get into the architectural and functional details as to how we build scalable and robust data pipelines for music recommendations at Spotify. We will also discuss some of the challenges and an overview of work to address these challenges.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
Real-Time Market Data Analytics Using Kafka Streamsconfluent
(Lei Chen, Bloomberg, L.P.) Kafka Summit SF 2018
At Bloomberg, we are building a streaming platform with Apache Kafka, Kafka Streams and Spark Streaming to handle high volume, real-time processing with rapid derivative market data. In this talk, we’ll share the experience of how we utilize Kafka Streams Processor API to build pipelines that are capable of handling millions of market movements per second with ultra-low latency, as well as performing complex analytics like outlier detection, source confidence evaluation (scoring), arbitrage detection and other financial-related processing.
We’ll cover:
-Our system architecture
-Best practices of using the Processor API and State Store API
-Dynamic gap session implementation
-Historical data re-processing practice in KStreams app
-Chaining multiple KStreams apps with Spark Streaming job
Arquitetando uma instituição financeira moderna - Lucas CavalcantiiMasters
Lucas Cavalcanti - Principal Engineer, Nubank
Sistemas de software costumam ficar cada vez mais difíceis de evoluir e manter na medida que o tempo passa e mais funcionalidades são adicionadas.
No Nubank, após 4 anos de evolução, adicionar novas funcionalidades é tão ou mais fácil do que era há 3 anos atrás. Nessa palestra vamos explorar as principais características que possibilitaram essa evolução rápida e contínua de funcionalidades, como exemplo: microsserviços com o escopo bem definido, integração assíncrona entre serviços usando Kafka, verificação de schemas nas integrações, Clojure e programação funcional.
Apresentado no InterCon 2018
Why does Spotify use a microservices architecture? What are the benefits and challenges we've encountered? How does our organizational model support our architecture?
Video of the talk is posted on YouTube: https://youtu.be/7LGPeBgNFuU
Ads Personalization at Spotify - NYC Data Engineering 10/23Kinshuk Mishra
Spotify engineers (Kinshuk Mishra and Noel Cody) share their experiences about building personalized ad experiences for users through iterative engineering and product development. The slide explains their process of continuous problem discovery, hypothesis generation, product development and experimentation. They deep dive into the specific ad personalization problems Spotify is solving and explain their data infrastructure technology stack in detail. They also explain how they've experimented various product hypothesis and iteratively evolved their infrastructure to keep up with the product requirements.
Spotify strives for team autonomy and independence. This means that no team should be blocked by others and they should be able to move as fast as they can. The autonomy has is a challenge for managing a centralised and coordinated experimentation infrastructure and analysis. This a talk about how we approach experimentation alignment in a fast moving company.
An experiment in connecting Internet Exchanges between 3 different countriesAPNIC
An experiment in connecting Internet Exchanges between 3 different countries, by Johar Alam Rangkuti.
A presentation given at the APNIC 40 Opening Ceremony and Keynotes session on Tue, 8 Sep 2015.
Insights driven design at spotify - meetup talkOscar Carlsson
When designing for everyone you are designing for noone.
In this design focused presentation I and Björn tell you how we do insights driven development at Spotify focusing on the navigation example.
Spotify: Playing for millions, tuning for moreNick Barkas
Barcelona Developers Conference presentation by Nick Barkas and David Poblador i Garcia, 18 November 2011. How we manage a huge collection of servers and some of the technologies we use for building a scalable, high performance music streaming service.
This talk is an updated version of my earlier talk "Failing Up" that I presented at Tom Tom and at App Builders Switzerland, 2016. It's a talk about how to create a failsafe environment for software companies and teams. It's critical to acknowledge that failure is necessary for innovation. So, if failure is a given, how do you fail well?
This version of the talk was first presented at Seattle Code Camp 2016