From Idea to Execution: Spotify's Discover WeeklyChris Johnson
Discover Weekly is a personalized mixtape of 30 highly personalized songs that's curated and delivered to Spotify's 75M active users every Monday. It's received high acclaim in the press and reached 1B streams within its first 10 weeks. In this slide deck we dive into the narrative of how Discover Weekly came to be, highlighting technical challenges, data driven development, and the Machine Learning models used to power our recommendations engine.
Presented at the Machine Learning class at Chalmers, Gothenburg.
http://www.cse.chalmers.se/research/lab/courses.php?coid=9
Trying to connect their theoretical machine learning class with industry examples.
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
In this talk, we will get into the architectural and functional details as to how we build scalable and robust data pipelines for music recommendations at Spotify. We will also discuss some of the challenges and an overview of work to address these challenges.
Algorithmic Music Recommendations at SpotifyChris Johnson
In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
Discover Weekly is a personalized mixtape of 30 highly personalized songs that's curated and delivered to Spotify's 75M active users every Monday. It's received high acclaim in the press and reached 1B streams within its first 10 weeks. In this slide deck we dive into the narrative of how Discover Weekly came to be, highlighting technical challenges, data driven development, and the Machine Learning models used to power our recommendations engine.
Presented at the Machine Learning class at Chalmers, Gothenburg.
http://www.cse.chalmers.se/research/lab/courses.php?coid=9
Trying to connect their theoretical machine learning class with industry examples.
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
In this talk, we will get into the architectural and functional details as to how we build scalable and robust data pipelines for music recommendations at Spotify. We will also discuss some of the challenges and an overview of work to address these challenges.
Algorithmic Music Recommendations at SpotifyChris Johnson
In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.
From the NYC Machine Learning meetup on Jan 17, 2013: http://www.meetup.com/NYC-Machine-Learning/events/97871782/
Video is available here: http://vimeo.com/57900625
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
In this presentation, I give an overview of the machine learning algorithms behind Spotify’s extraordinarily popular Discover Weekly playlist. I provide a brief introduction to what the playlist is, explain how music recommendation engines have evolved over time, then break down the three main algorithm types powering Spotify’s recommendations: (1) collaborative filtering, (2) Natural Language Processing (NLP), and (3) Raw audio analysis.
Video of the presentation can be found here: https://www.youtube.com/watch?v=PUtYNjInopA
Scala Data Pipelines for Music RecommendationsChris Johnson
Are you still building data pipelines with Java and Python? Are you curious about the current buzz in the Big Data community surrounding Scala as a data processing environment? In this talk I'll discuss how Spotify migrated its music recommendations pipeline from Python to Scala. I'll dive into the language specific features that make Scala the ideal candidate for big data processing as well as highlight the rich set of tools and APIs that we take advantage of to process music recommendations for our 50 Million active users including Scalding, Breeze, Kafka, Spark, Parquet, Driven and Zeppelin.
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
Spotify is the world’s largest on-demand music streaming company, with over 100 million active users who generate around 2TB of interaction data every day. With over 30 million songs to choose from, discovery and personalization play an essential role in helping users discover the best music for them. In this talk, given at the newly opened Galvanize space in NYC in March 2017, we’ll explain how Spotify uses Latent Space Models and Deep Learning to power features such as Discover Weekly and Release Radar.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
Music Recommendations at Scale with SparkChris Johnson
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page, Radio, and Related Artists. Due to the iterative nature of these models they are a natural fit to the Spark computation paradigm and suffer from the IO overhead incurred by Hadoop. In this talk, I review the ALS algorithm for Matrix Factorization with implicit feedback data and how we’ve scaled it up to handle 100s of Billions of data points using Scala, Breeze, and Spark.
- Music streaming insights and numbers from Deezer perspective
- Working with B2B2C partnerships
- Deezer content strategy
- Music industry in numbers
- Future projections
Part of my guest lecture on Data Driven Business Models at Stockholm School of Entrepreneurship. I spoke about how Data is core to the Spotify business and it drives Spotify forward.
These are the slides of a talk about some of our research at Spotify, as part of the celebration kickoff of Chalmers AI Research Centre in Gothenburg. I always like to make a story in my talk, and this time I wanted to reflect on the "push" (think recommender system) and "pull" (think search) paradigms. I am using this quote from Nicholas Belkin and Bruce Croft from their Communications of the ACM article published in 1992 to frame my story: "We conclude that information retrieval and information filtering are indeed two sides of the same coin. They work together to help people get the information needed to perform their tasks."
These are the slides of my talk at the 2019 Netflix Workshop on Personalization, Recommendation and Search (PRS). This talk is based on previous talks on research we are doing at Spotify, but here I focus on the work we do on personalizing Spotify Home, with respect to success, intent & diversity. The link to the workshop is https://prs2019.splashthat.com/. This is research from various people at Spotify, and has been published at RecSys 2018, CIKM 2018 and WWW (The Web Conference) 2019.
Today, I had the big honor to give the opening keynote at the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2020), being held virtually. HCOMP is the home of the human computation and crowdsourcing community working on frameworks, methods and systems that bring together people and machine intelligence to achieve better results. I decided to totally revamp a previous talk to focus on so-called "human in the loop" and showed how we incorporate human in the loop to personalise at scale, with some of the research at Spotify. Sharing the slides for general interests.
These are the slides I used for my talk at the BIG Track at the Web Conference 2019. This is a very similar talk to what I gave at the celebration kickoff of Chalmers AI Research Centre in Gothenburg in March 2019. It has a bit more and reflect some of the most recent work we are doing at Spotify Research. I am posted these again as people are asking for the slides. Thank you.
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
These are the slides of my invited talk at the REVEAL workshop at RecSys 2019. The workshop focuses on the offline evaluation for recommender systems, and this year’s focus was on Reinforcement Learning. Although not directly related to reinforcement learning, it is clear that there are connections to what research in reinforcement learning is attempting to achieve (defining the rewards) and metrics that are optimized by recommender systems. I presented various works and personal thoughts on how to develop metrics of user engagement, which recommender systems can optimize for. An important message was that, for recommender systems to work both in the short and the long-term, it is important to consider the heterogeneity of both user and content to formalise the notion of engagement, and in turn design the appropriate metrics to capture these and optimize for. One way to achieve this is to follow these four steps: 1) Understanding intents; 2) Optimizing for the right metric; 3) Acting on segmentation; and 4) Thinking about diversity.
An previous version of this talk was given to UMAP 2019. See https://www.slideshare.net/mounialalmas/metrics-engagement-personalization
Recommendation algorithms are intensively used to improve user experience and maximise revenue of several websites: ads targeting, recommendation of products on ecommerce websites, job offers you may be interested in, contact suggestion on social networks, ...
To do so, collaborative filtering algorithms are one of the most popular methods but several other approaches may be used.
It this presentation, we introduce the main strategies (model based vs memory based algorithms) to elaborate a recommendation engine, and we describe how to implement a collaborative filtering algorithm with only 10 lines of code.
Some highlights from Recsys 2018 presented to my team at Schibsted. Note this is a "biased" summary based on personal interest and work related to my team.
From the NYC Machine Learning meetup on Jan 17, 2013: http://www.meetup.com/NYC-Machine-Learning/events/97871782/
Video is available here: http://vimeo.com/57900625
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
In this presentation, I give an overview of the machine learning algorithms behind Spotify’s extraordinarily popular Discover Weekly playlist. I provide a brief introduction to what the playlist is, explain how music recommendation engines have evolved over time, then break down the three main algorithm types powering Spotify’s recommendations: (1) collaborative filtering, (2) Natural Language Processing (NLP), and (3) Raw audio analysis.
Video of the presentation can be found here: https://www.youtube.com/watch?v=PUtYNjInopA
Scala Data Pipelines for Music RecommendationsChris Johnson
Are you still building data pipelines with Java and Python? Are you curious about the current buzz in the Big Data community surrounding Scala as a data processing environment? In this talk I'll discuss how Spotify migrated its music recommendations pipeline from Python to Scala. I'll dive into the language specific features that make Scala the ideal candidate for big data processing as well as highlight the rich set of tools and APIs that we take advantage of to process music recommendations for our 50 Million active users including Scalding, Breeze, Kafka, Spark, Parquet, Driven and Zeppelin.
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
Spotify is the world’s largest on-demand music streaming company, with over 100 million active users who generate around 2TB of interaction data every day. With over 30 million songs to choose from, discovery and personalization play an essential role in helping users discover the best music for them. In this talk, given at the newly opened Galvanize space in NYC in March 2017, we’ll explain how Spotify uses Latent Space Models and Deep Learning to power features such as Discover Weekly and Release Radar.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
Music Recommendations at Scale with SparkChris Johnson
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page, Radio, and Related Artists. Due to the iterative nature of these models they are a natural fit to the Spark computation paradigm and suffer from the IO overhead incurred by Hadoop. In this talk, I review the ALS algorithm for Matrix Factorization with implicit feedback data and how we’ve scaled it up to handle 100s of Billions of data points using Scala, Breeze, and Spark.
- Music streaming insights and numbers from Deezer perspective
- Working with B2B2C partnerships
- Deezer content strategy
- Music industry in numbers
- Future projections
Part of my guest lecture on Data Driven Business Models at Stockholm School of Entrepreneurship. I spoke about how Data is core to the Spotify business and it drives Spotify forward.
These are the slides of a talk about some of our research at Spotify, as part of the celebration kickoff of Chalmers AI Research Centre in Gothenburg. I always like to make a story in my talk, and this time I wanted to reflect on the "push" (think recommender system) and "pull" (think search) paradigms. I am using this quote from Nicholas Belkin and Bruce Croft from their Communications of the ACM article published in 1992 to frame my story: "We conclude that information retrieval and information filtering are indeed two sides of the same coin. They work together to help people get the information needed to perform their tasks."
These are the slides of my talk at the 2019 Netflix Workshop on Personalization, Recommendation and Search (PRS). This talk is based on previous talks on research we are doing at Spotify, but here I focus on the work we do on personalizing Spotify Home, with respect to success, intent & diversity. The link to the workshop is https://prs2019.splashthat.com/. This is research from various people at Spotify, and has been published at RecSys 2018, CIKM 2018 and WWW (The Web Conference) 2019.
Today, I had the big honor to give the opening keynote at the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2020), being held virtually. HCOMP is the home of the human computation and crowdsourcing community working on frameworks, methods and systems that bring together people and machine intelligence to achieve better results. I decided to totally revamp a previous talk to focus on so-called "human in the loop" and showed how we incorporate human in the loop to personalise at scale, with some of the research at Spotify. Sharing the slides for general interests.
These are the slides I used for my talk at the BIG Track at the Web Conference 2019. This is a very similar talk to what I gave at the celebration kickoff of Chalmers AI Research Centre in Gothenburg in March 2019. It has a bit more and reflect some of the most recent work we are doing at Spotify Research. I am posted these again as people are asking for the slides. Thank you.
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
These are the slides of my invited talk at the REVEAL workshop at RecSys 2019. The workshop focuses on the offline evaluation for recommender systems, and this year’s focus was on Reinforcement Learning. Although not directly related to reinforcement learning, it is clear that there are connections to what research in reinforcement learning is attempting to achieve (defining the rewards) and metrics that are optimized by recommender systems. I presented various works and personal thoughts on how to develop metrics of user engagement, which recommender systems can optimize for. An important message was that, for recommender systems to work both in the short and the long-term, it is important to consider the heterogeneity of both user and content to formalise the notion of engagement, and in turn design the appropriate metrics to capture these and optimize for. One way to achieve this is to follow these four steps: 1) Understanding intents; 2) Optimizing for the right metric; 3) Acting on segmentation; and 4) Thinking about diversity.
An previous version of this talk was given to UMAP 2019. See https://www.slideshare.net/mounialalmas/metrics-engagement-personalization
Recommendation algorithms are intensively used to improve user experience and maximise revenue of several websites: ads targeting, recommendation of products on ecommerce websites, job offers you may be interested in, contact suggestion on social networks, ...
To do so, collaborative filtering algorithms are one of the most popular methods but several other approaches may be used.
It this presentation, we introduce the main strategies (model based vs memory based algorithms) to elaborate a recommendation engine, and we describe how to implement a collaborative filtering algorithm with only 10 lines of code.
Some highlights from Recsys 2018 presented to my team at Schibsted. Note this is a "biased" summary based on personal interest and work related to my team.
This presentation focuses on Deep Learning (DL) concepts, such as neural networks, backprop, activation functions, and Convolutional Neural Networks, followed by a TypeScript-based code sample that replicates the Tensorflow playground. Basic knowledge of matrices is helpful for this session.
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page and Radio. Due to the iterative nature of training these models they suffer from IO overhead of Hadoop and are a natural fit to the Spark programming paradigm. In this talk I will present both the right way as well as the wrong way to implement collaborative filtering models with Spark. Additionally, I will deep dive into how Matrix Factorization is implemented in the MLlib library.
This presentation focuses on Deep Learning (DL) concepts, such as neural neworks, backprop, activation functions, and Convolutional Neural Networks, with a short introduction to D3, and followed by a TypeScript-based code sample that replicates the TensorFlow playground. Basic knowledge of matrices is helpful.
This presentation focuses on Deep Learning (DL) concepts, such as neural neworks, backprop, activation functions, and Convolutional Neural Networks, with a short introduction to D3, and followed by a TypeScript-based code sample that replicates the TensorFlow playground. Basic knowledge of matrices is helpful.
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
ML Algorithms usually solve an optimization problem such that we need to find parameters for a given model that minimizes
— Loss function (prediction error)
— Model simplicity (regularization)
Cite References.Classification in Discriminant Analysis Discussi.docxclarebernice
Cite References.
Classification in Discriminant Analysis Discussion
· Discuss the concept of classification as regards to discriminant analysis.
· How do the results of the classification table tell us if we have a good or poor model for our data?
from media import *
#useful functions to manipulate sounds
#Changing a sound volume by changing the amplitude
def changeVolume (sound, factor):
for sample in getSamples(sound):
value = getSampleValue (sample)
setSampleValue (sample, value * factor)
#Normalize a sound to a maximum amplitude
def normalize (sound):
largest = 0
for s in getSamples(sound):
largest = max (largest, getSampleValue(s))
multiplier = 32767.0 / largest
print "Largest sample value in original sound was " , largest
print "Multiplier is " , multiplier
for s in getSamples(sound):
louder = multiplier * getSampleValue(s)
setSampleValue (s, louder)
#clipping a sound: we will clip a sound and copy the samples from end to start
def clip (source, start, end):
target = makeEmptySound (end - start + 1) #create a new empty sound of this size
targetIndex = 0
for sourceIndex in range (start, end + 1):
sourceValue = getSampleValueAt(source, sourceIndex)
setSampleValueAt (target, targetIndex, sourceValue)
targetIndex = targetIndex + 1
return target
#Copying a sound: we will copy a source sound into a target sound starting at index start
def copy (source, target, start):
targetIndex = start
for sourceIndex in range (0, getLength(source)):
sourceValue = getSampleValueAt (source, sourceIndex)
setSampleValueAt(target, targetIndex, sourceValue)
targetIndex = targetIndex + 1
#reversing sounds: take a source sound and produce a sound with all the samples in reverse order
def reverse(source):
target = makeEmptySound (getLength(source))
sourceIndex = getLength (source) - 1 #we start at the end of the source sound, with the last sample
for targetIndex in range (0, getLength(target)):
sourceValue = getSampleValueAt(source, sourceIndex)
setSampleValueAt (target, targetIndex, sourceValue)
sourceIndex = sourceIndex - 1
return target
#mirroring sounds
def mirrorSound (sound):
len = getLength(sound)
mirrorPoint = len / 2
for index in range (0, mirrorPoint):
left = getSampleObjectAt(sound, index)
right = getSampleObjectAt(sound, len - index - 1)
value = getSampleValue(left)
setSampleValue(right, value)
play(sound)
Laboratory Exercise 6: Manipulating Sounds
Overview
The objectives of this lab are:
To review the for loop
To review the index system for arrays
To work with ranges
To review functions with parameters and return values
I. Changing Volume
Open JES and start up a new program. Save your file with a suitable name such as Lab6.py or
something similar. Add comments to the top that describe the file as a whole, such as:
# Lab 6: Manipulating Sounds
# Your Name and your partne ...
In this project I use a stack of denoising autoencoders to learn low-dimensional
representations of images. These encodings are used as input to a locality sensitive
hashing algorithm to find images similar to a given query image. The results clearly
shows that this approach outperforms basic LSH by far.
Monads and Monoids: from daily java to Big Data analytics in Scala
Finally, after two decades of evolution, Java 8 made a step towards functional programming. What can Java learn from other mature functional languages? How to leverage obscure mathematical abstractions such as Monad or Monoid in practice? Usually people find it scary and difficult to understand. Oleksiy will explain these concepts in simple words to give a feeling of powerful tool applicable in many domains, from daily Java and Scala routines to Big Data analytics with Storm or Hadoop.
Recommendation Engine Powered by Hadoop - Pranab GhoshBigDataCloud
Personalized recommendations are ubiquitous in social network and shopping sites these days. How do they do it? As long as enough user interaction data is available for items e.g., products in shopping sites, a kind of recommendation engine based on what’s known as ' Collaborative Filtering' is not that difficult to build. Since the solution causes a combinatorial explosion, Hadoop can play a critical role in processing massive amount of data in collaborative filtering based solutions. In this presentations, I will cover a Hadoop based recommendation engine implementation using collaborative filtering.
"A fast-paced introduction to Deep Learning (DL) concepts, such as neural networks, back propagation, activation functions, and CNNs. We'll also look at JavaScript-based toolkits (such as TensorFire and deeplearning.js) that leverage the power of WebGL. Basic knowledge of elementary calculus (e.g., derivatives) is recommended in order to derive the maximum benefit from this session.
A Gentle Introduction to Coding ... with PythonTariq Rashid
A gentle introduction to coding (programming) for complete beginners. Starting from then basics - electrical wires - proceeding through variables, data structures, loops, functions, and exploring libraries for visualisation and specialist tools. Finally we use flask to make a very simple twitter clone web application.
Similar to Machine learning @ Spotify - Madison Big Data Meetup (20)
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Machine learning @ Spotify - Madison Big Data Meetup
1. Machine Learning &
Big Data @
Andy Sloane
@a1k0n
http://a1k0n.net
Madison Big Data Meetup
Jan 27, 2015
2. Big data?
60M Monthly Active Users (MAU)
50M tracks in our catalog
...But many are identical copies from different
releases (e.g. US and UK releases of the same
album)
...and only 4M unique songs have been listened to
>500 times
3. Big data?
Raw material: application logs, delivered via Apache
Kafka
Wake Me Up by Avicii has been played 330M times, by
~6M different users
"EndSong": 500GB / day
...But aggregated per-user play counts for a whole
year fit in ~60GB ("medium data")
4. Hadoop @ Spotify
900 nodes (all in London datacenter)
34 TB RAM total
~16000 typical concurrent tasks (mappers/reducers)
2GB RAM per mapper/reducer slot
5. What do we need ML for?
Recommendations
Related Artists
Radio
10. Collaborative filtering
Great, but how does that actually work?
Each time a user plays something, add it to a matrix
Compute similarity, somehow, between items based on
who played what
11. Collaborative filtering
So compute some distance between every pair of rows
and columns
That's just O( ) = O( ) operations... O_O
We need a better way...
60M
2
2
1.8 × 10
15
(BTW: Twitter has a decent approximation that can actually make this work, called DIMSUM:
https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum)
I've tried it but don't have results to report here yet :(
12. Collaborative filtering
Latent factor models
Instead, we use a "small" representation for each user &
item: -dimensional vectorsf
(here, )f = 2
and approximate the big matrix with it.
13. Why vectors?
Very compact representation of musical style or user's
taste
Only like 40-200 elements (2 shown above for
illustration)
14. Why vectors?
Dot product between items = similarity between items
Dot product between vectors = good/bad
recommendation
user x item
2 x 4 = 8
-4 x 0 = 0
2 x -2 = -4
-1 x 5 = + -5
= -1
17. Implicit Matrix Factorization
Hu, Koren, Volinsky - Collaborative Filtering for Implicit
Feedback Datasets
Tries to predict whether user listens to item :u i
P = ≈ ( )
⎛
⎝
⎜
⎜
⎜
⎜
0
0
0
1
0
1
0
0
0
1
1
0
1
0
0
1
⎞
⎠
⎟
⎟
⎟
⎟
X
⎛
⎝
⎜
⎜
⎜
Y
T
⎞
⎠
⎟
⎟
⎟
is all item vectors, is all user vectorsY X
"implicit" because users don't tell us what they like, we
only observe what they do/don't listen to
18. Goal: make close to 1 for things each user has
listened to, 0 for everything else.
Implicit Matrix Factorization
⋅xu y
i
— user 's vector
— item 's vector
— 1 if user played item , 0 otherwise
— "confidence", ad-hoc weight based on number of
times user played item ; e.g.,
— regularization penalty to avoid overfitting
xu u
y
i
i
p
ui
u i
cui
u i 1 + α ⋅
λ
Minimize:
+ λ
(
|| | + || |
)
∑
u,i
cui ( − )p
ui
x
T
u y
i
2
∑
u
xu |
2
∑
i
y
i
|
2
19. Solution: alternate solving for all users :
and all items :
Alternating Least Squares
xu
= ( Y + ( − I)Y + λIxu Y
T
Y
T
C
u
)
−1
Y
T
C
u
p
u⋅
y
i
= ( X + ( − I)X + λIy
i
X
T
X
T
C
i
)
−1
X
T
C
i
p
⋅i
= x matrix, sum of outer products of all items
same, except only items the user played
= weighted -dimensional sum of items the
user played
YY
T
f f
( − I)YY
T
C
u
Y
T
C
u
p
u
f
20. Alternating Least Squares
Key point: each iteration is linear in size of input, even
though we are solving for all users x all items, and needs
only memory to solvef
2
No learning rates, just a few tunable parameters ( , , )f λ α
All you do is add stuff up, solve an x matrix problem,
and repeat!
f f
We use dimensional vectors for
recommendations
f = 40
Matrix/vector math using numpy in Python, breeze in
scala
21. Alternating Least Squares
Adding lots of stuff up
Problem: any user (60M) can play any item (4M)
thus we may need to add any user's vector to any
item's vector
If we put user vectors in memory, it takes a lot of RAM!
Worst case: 60M users * 40 dimensions * sizeof(float) =
9.6GB of user vectors
...too big to fit in a mapper slot on our cluster
22. Solution: Split the data into a matrix
Most recent run made a 14 x 112 grid
Adding lots of stuff up
23. Input is a bunch of tuples
is the same modulo K for all users
is the same modulo L for all items
e.g., if K = 4, mapper #1 gets users 1, 5, 9, 13, ...
One map shard
(user, item, count)
user
item
24. Add up vectors from every data point
Then flip users ↔items and repeat!
Adding stuff up
(user, item, count)
def mapper(self, input): # Luigi-style python job
user, item, count = parse(input)
conf = AdHocConfidenceFunction(count) # e.g. 1 + alpha*count
# add up user vectors from previous iteration
term1 = conf * self.user_vectors[user]
term2 = np.outer(user_vectors[user], user_vectors[user])
* (conf - 1)
yield item, np.array([term1, term2])
def reducer(self, item, terms):
term1, term2 = sum(terms)
item_vector = np.solve(
self.YTY + term2 + self.l2penalty * np.identity(self.dim),
term1)
yield item, item_vector
25. Alternating Least Squares
Implemented in Java Map-Reduce framework which
runs other models, too
After about 20 iterations, we converge
Each iteration takes about 20 minutes, so about 7-8
hours total
Recomputed from scratch weekly
User vectors recomputed daily, keeping items fixed
So we have vectors, now what?
26. 60M users x 4M recommendable items
Finding Recommendations
For each user, how do we find the best items given
their vector?
Brute force is O(60M x 4M x 40) = O(9 peta-operations)!
Instead, use an approximation based on locality
sensitive hashing (LSH)
28. Annoy - github.com/spotify/annoy
Pre-built read-only database of item vectors
Internally, recursively splits random hyperplanes
Nearby points likely on the same side of random split
Builds several random trees (a forest) for better
approximation
Given an -dimensional query vector, finds similar items
in database
Index loads via mmap, so all processes on the same
machine share RAM
Queries are very, very fast, but approximate
Python implementation available, Java forthcoming
f
29. Generating recommendations
Annoy index for all items is only 1.2GB
I have one on my laptop... Live demo!
Could serve up nearest neighbors at load time, but we
precompute Discover on Hadoop
30. Generating recommendations in parallel
Send annoy index in distributed cache, load it via mmap
in map-reduce process
Reducer loads vectors + user stats, looks up ANN,
generates recommendations.
32. Related Artists
Great for music discovery
Essential for finding believable reasons for latent
factor-based recommendations
When generating recommendations, run through a list
of related artists to find potential reasons
33. Similar items use cosine distance
Cosine is similar to dot product; just add a
normalization step
Helps "factor out" popularity from similarity
34. Related Artists
How we build it
Similar to user recommendations, but with more
models, not necessarily collaborative filtering based
Implicit Matrix Factorization (shown previously)
"Vector-Exp", similar model but probabilistic in
nature, trained with gradient descent
Google word2vec on playlists
Echo Nest "cultural similarity" — based on scraping
web pages about music!
Query ANNs to generate candidates
Score candidates from all models, combine and rank
Pre-build table of 20 nearest artists to each artist
36. ML-wise, exactly the same as Related Artists!
Radio
For each track, generate candidates with ANN from
each model
Score w/ all models, rank with ensemble
Store top 250 nearest neighbors in a database
(Cassandra)
User plays radio → load 250 tracks and shuffle
Thumbs up → load more tracks from the thumbed-up
song
Thumbs down → remove that song / re-weight tracks
38. Upcoming work
Audio fingerprint based
content deduplication
~1500 Echo Nest Musical Fingerprints per track
based matching to accelerate all-pairs
similarity
Fast connected components using Hash-to-Min
algorithm - mapreduce steps
Min-Hash
O(log d)
http://arxiv.org/pdf/1203.5387.pdf
39. Thanks!
I can be reached here:
Andy Sloane
Email:
Twitter:
Special thanks to , whose slides I
plagiarized mercilessly
andy@a1k0n.net
@a1k0n
http://a1k0n.net
Erik Bernhardsson