Are you still building data pipelines with Java and Python? Are you curious about the current buzz in the Big Data community surrounding Scala as a data processing environment? In this talk I'll discuss how Spotify migrated its music recommendations pipeline from Python to Scala. I'll dive into the language specific features that make Scala the ideal candidate for big data processing as well as highlight the rich set of tools and APIs that we take advantage of to process music recommendations for our 50 Million active users including Scalding, Breeze, Kafka, Spark, Parquet, Driven and Zeppelin.
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
In this talk, we will get into the architectural and functional details as to how we build scalable and robust data pipelines for music recommendations at Spotify. We will also discuss some of the challenges and an overview of work to address these challenges.
Music Recommendations at Scale with SparkChris Johnson
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page, Radio, and Related Artists. Due to the iterative nature of these models they are a natural fit to the Spark computation paradigm and suffer from the IO overhead incurred by Hadoop. In this talk, I review the ALS algorithm for Matrix Factorization with implicit feedback data and how we’ve scaled it up to handle 100s of Billions of data points using Scala, Breeze, and Spark.
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
Discover Weekly is a personalized mixtape of 30 highly personalized songs that's curated and delivered to Spotify's 75M active users every Monday. It's received high acclaim in the press and reached 1B streams within its first 10 weeks. In this slide deck we dive into the narrative of how Discover Weekly came to be, highlighting technical challenges, data driven development, and the Machine Learning models used to power our recommendations engine.
From the NYC Machine Learning meetup on Jan 17, 2013: http://www.meetup.com/NYC-Machine-Learning/events/97871782/
Video is available here: http://vimeo.com/57900625
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
In this talk, we will get into the architectural and functional details as to how we build scalable and robust data pipelines for music recommendations at Spotify. We will also discuss some of the challenges and an overview of work to address these challenges.
Music Recommendations at Scale with SparkChris Johnson
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page, Radio, and Related Artists. Due to the iterative nature of these models they are a natural fit to the Spark computation paradigm and suffer from the IO overhead incurred by Hadoop. In this talk, I review the ALS algorithm for Matrix Factorization with implicit feedback data and how we’ve scaled it up to handle 100s of Billions of data points using Scala, Breeze, and Spark.
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
Discover Weekly is a personalized mixtape of 30 highly personalized songs that's curated and delivered to Spotify's 75M active users every Monday. It's received high acclaim in the press and reached 1B streams within its first 10 weeks. In this slide deck we dive into the narrative of how Discover Weekly came to be, highlighting technical challenges, data driven development, and the Machine Learning models used to power our recommendations engine.
From the NYC Machine Learning meetup on Jan 17, 2013: http://www.meetup.com/NYC-Machine-Learning/events/97871782/
Video is available here: http://vimeo.com/57900625
Algorithmic Music Recommendations at SpotifyChris Johnson
In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
Spotify is the world’s largest on-demand music streaming company, with over 100 million active users who generate around 2TB of interaction data every day. With over 30 million songs to choose from, discovery and personalization play an essential role in helping users discover the best music for them. In this talk, given at the newly opened Galvanize space in NYC in March 2017, we’ll explain how Spotify uses Latent Space Models and Deep Learning to power features such as Discover Weekly and Release Radar.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
These are the slides of my talk at the 2019 Netflix Workshop on Personalization, Recommendation and Search (PRS). This talk is based on previous talks on research we are doing at Spotify, but here I focus on the work we do on personalizing Spotify Home, with respect to success, intent & diversity. The link to the workshop is https://prs2019.splashthat.com/. This is research from various people at Spotify, and has been published at RecSys 2018, CIKM 2018 and WWW (The Web Conference) 2019.
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
In this presentation, I give an overview of the machine learning algorithms behind Spotify’s extraordinarily popular Discover Weekly playlist. I provide a brief introduction to what the playlist is, explain how music recommendation engines have evolved over time, then break down the three main algorithm types powering Spotify’s recommendations: (1) collaborative filtering, (2) Natural Language Processing (NLP), and (3) Raw audio analysis.
Video of the presentation can be found here: https://www.youtube.com/watch?v=PUtYNjInopA
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page and Radio. Due to the iterative nature of training these models they suffer from IO overhead of Hadoop and are a natural fit to the Spark programming paradigm. In this talk I will present both the right way as well as the wrong way to implement collaborative filtering models with Spark. Additionally, I will deep dive into how Matrix Factorization is implemented in the MLlib library.
These are the slides of a talk about some of our research at Spotify, as part of the celebration kickoff of Chalmers AI Research Centre in Gothenburg. I always like to make a story in my talk, and this time I wanted to reflect on the "push" (think recommender system) and "pull" (think search) paradigms. I am using this quote from Nicholas Belkin and Bruce Croft from their Communications of the ACM article published in 1992 to frame my story: "We conclude that information retrieval and information filtering are indeed two sides of the same coin. They work together to help people get the information needed to perform their tasks."
Presented at the Machine Learning class at Chalmers, Gothenburg.
http://www.cse.chalmers.se/research/lab/courses.php?coid=9
Trying to connect their theoretical machine learning class with industry examples.
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
These are the slides I used for my talk at the BIG Track at the Web Conference 2019. This is a very similar talk to what I gave at the celebration kickoff of Chalmers AI Research Centre in Gothenburg in March 2019. It has a bit more and reflect some of the most recent work we are doing at Spotify Research. I am posted these again as people are asking for the slides. Thank you.
Today, I had the big honor to give the opening keynote at the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2020), being held virtually. HCOMP is the home of the human computation and crowdsourcing community working on frameworks, methods and systems that bring together people and machine intelligence to achieve better results. I decided to totally revamp a previous talk to focus on so-called "human in the loop" and showed how we incorporate human in the loop to personalise at scale, with some of the research at Spotify. Sharing the slides for general interests.
Introduction to Factorization Machines model with an example. Motivations - why you should have it in your toolbox, model and it expressiveness, use case for context-aware recommendations and Field-Aware Factorization Machines.
by Harald Steck (Netflix Inc., US), Roelof van Zwol (Netflix Inc., US) and Chris Johnson (Spotify Inc., US)
Slides of the tutorial on interactive recommender systems at the 2015 conference on Recommender Systems (RecSys).
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
DATE: Wednesday, Sept 16, 2015, 11:00-12:30
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
Proud to have graduated from the University of Amsterdam MSc Information Studies program in Data Science, this was my final msc thesis presentation on my defense last July. The thesis answers to the contemporary complex problem of popularity prediction, which is highly discussed but not yet solved. An effort for optimizing feature engineering, using several regression models and answering to the question of "what makes a post popular" according to its content e.g. action, scene, people, animals etc., taking advantage in the meantime both computer vision techniques and text processing, is the summary of this work. My MSc Thesis was marked unanimously with 8/10 - one of the highest grades in UvA Informatics department - and was submitted for a publication in MMM2018 24th International Conference on Multimedia Modeling (Thailand, Bangkok).
Algorithmic Music Recommendations at SpotifyChris Johnson
In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
Spotify is the world’s largest on-demand music streaming company, with over 100 million active users who generate around 2TB of interaction data every day. With over 30 million songs to choose from, discovery and personalization play an essential role in helping users discover the best music for them. In this talk, given at the newly opened Galvanize space in NYC in March 2017, we’ll explain how Spotify uses Latent Space Models and Deep Learning to power features such as Discover Weekly and Release Radar.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
These are the slides of my talk at the 2019 Netflix Workshop on Personalization, Recommendation and Search (PRS). This talk is based on previous talks on research we are doing at Spotify, but here I focus on the work we do on personalizing Spotify Home, with respect to success, intent & diversity. The link to the workshop is https://prs2019.splashthat.com/. This is research from various people at Spotify, and has been published at RecSys 2018, CIKM 2018 and WWW (The Web Conference) 2019.
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
In this presentation, I give an overview of the machine learning algorithms behind Spotify’s extraordinarily popular Discover Weekly playlist. I provide a brief introduction to what the playlist is, explain how music recommendation engines have evolved over time, then break down the three main algorithm types powering Spotify’s recommendations: (1) collaborative filtering, (2) Natural Language Processing (NLP), and (3) Raw audio analysis.
Video of the presentation can be found here: https://www.youtube.com/watch?v=PUtYNjInopA
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page and Radio. Due to the iterative nature of training these models they suffer from IO overhead of Hadoop and are a natural fit to the Spark programming paradigm. In this talk I will present both the right way as well as the wrong way to implement collaborative filtering models with Spark. Additionally, I will deep dive into how Matrix Factorization is implemented in the MLlib library.
These are the slides of a talk about some of our research at Spotify, as part of the celebration kickoff of Chalmers AI Research Centre in Gothenburg. I always like to make a story in my talk, and this time I wanted to reflect on the "push" (think recommender system) and "pull" (think search) paradigms. I am using this quote from Nicholas Belkin and Bruce Croft from their Communications of the ACM article published in 1992 to frame my story: "We conclude that information retrieval and information filtering are indeed two sides of the same coin. They work together to help people get the information needed to perform their tasks."
Presented at the Machine Learning class at Chalmers, Gothenburg.
http://www.cse.chalmers.se/research/lab/courses.php?coid=9
Trying to connect their theoretical machine learning class with industry examples.
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
These are the slides I used for my talk at the BIG Track at the Web Conference 2019. This is a very similar talk to what I gave at the celebration kickoff of Chalmers AI Research Centre in Gothenburg in March 2019. It has a bit more and reflect some of the most recent work we are doing at Spotify Research. I am posted these again as people are asking for the slides. Thank you.
Today, I had the big honor to give the opening keynote at the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2020), being held virtually. HCOMP is the home of the human computation and crowdsourcing community working on frameworks, methods and systems that bring together people and machine intelligence to achieve better results. I decided to totally revamp a previous talk to focus on so-called "human in the loop" and showed how we incorporate human in the loop to personalise at scale, with some of the research at Spotify. Sharing the slides for general interests.
Introduction to Factorization Machines model with an example. Motivations - why you should have it in your toolbox, model and it expressiveness, use case for context-aware recommendations and Field-Aware Factorization Machines.
by Harald Steck (Netflix Inc., US), Roelof van Zwol (Netflix Inc., US) and Chris Johnson (Spotify Inc., US)
Slides of the tutorial on interactive recommender systems at the 2015 conference on Recommender Systems (RecSys).
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
DATE: Wednesday, Sept 16, 2015, 11:00-12:30
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
Proud to have graduated from the University of Amsterdam MSc Information Studies program in Data Science, this was my final msc thesis presentation on my defense last July. The thesis answers to the contemporary complex problem of popularity prediction, which is highly discussed but not yet solved. An effort for optimizing feature engineering, using several regression models and answering to the question of "what makes a post popular" according to its content e.g. action, scene, people, animals etc., taking advantage in the meantime both computer vision techniques and text processing, is the summary of this work. My MSc Thesis was marked unanimously with 8/10 - one of the highest grades in UvA Informatics department - and was submitted for a publication in MMM2018 24th International Conference on Multimedia Modeling (Thailand, Bangkok).
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
Approximate Nearest Neighbors and Vector Models by Erik BernhardssonHakka Labs
full talk video: https://www.hakkalabs.co/articles/approximate-nearest-neighbors-vector-models
Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik developed a library called "Annoy" that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.
Apache Hadoop has emerged as the storage and processing platform of choice for Big Data. In this tutorial, I will give an overview of Apache Hadoop and its ecosystem, with specific use cases. I will explain the MapReduce programming framework in detail, and outline how it interacts with Hadoop Distributed File System (HDFS). While Hadoop is written in Java, MapReduce applications can be written using a variety of languages using a framework called Hadoop Streaming. I will give several examples of MapReduce applications using Hadoop Streaming.
Lecture 9: Dimensionality Reduction, Singular Value Decomposition (SVD), Principal Component Analysis (PCA). (ppt,pdf)
Appendices A, B from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Modeling and Aggregation of Complex Annotations via Annotation Distances
- Alex Braylan and Matt Lease
- The University of Texas at Austin
ABSTRACT
Modeling annotators and their labels is valuable for ensuring col- lected data quality. Though many models have been proposed for binary or categorical labels, prior methods do not generalize to complex annotations (e.g., open-ended text, multivariate, or struc- tured responses) without devising new models for each specific task. To obviate the need for task-specific modeling, we propose to model distances between labels, rather than the labels them- selves. Our models are largely agnostic to the distance function; we leave it to the requesters to specify an appropriate distance func- tion for their given annotation task. We propose three models of annotation quality, including a Bayesian hierarchical extension of multidimensional scaling which can be trained in an unsupervised or semi-supervised manner. Results show the generality and effec- tiveness of our models across diverse complex annotation tasks: sequence labeling, translation, syntactic parsing, and ranking.
When building mobile applications performance is really crucial. If we compare React for web development with React-Native, the latter is much more restrictive in the sense that performance mistakes are sometimes far more reaching, even taking into account the golden quote of Donald Knuth “premature optimization is the root of all evil”. This talk is going to provide a deeper dive into principles and practices of making sure your app behaves as fast as possible. We will look at the common mistakes of component design, which lead to performance degradation. How to make lists more performant and animations more fluid. I will also introduce some of the tools, which might help you discover bottlenecks in your application.
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
Some highlights from Recsys 2018 presented to my team at Schibsted. Note this is a "biased" summary based on personal interest and work related to my team.
Scalable Recommendation Algorithms with LSHMaruf Aytekin
- Scalable recommendation algorithm based on Locality Sensitive Hashing (LSH) and Collaborative Filtering.
- Distributed implementation of LSH with Apache Spark.
Similar to Scala Data Pipelines for Music Recommendations (20)
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Your Digital Assistant.
Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data.
Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you.
Feasible Features
One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated
User Friendly – can be easily used on Android, iOS, and Web Interface
Multiple Accessibility – Log in through any device from any place at any time
One app for all industries – a Visitor Management System that works for any organisation.
Stress-free Sign-up
Visitor is registered and checked-in by the Receptionist
Host gets a notification, where they opt to Approve the meeting
Host notifies the Receptionist of the end of the meeting
Visitor is checked-out by the Receptionist
Host enters notes and remarks of the meeting
Customizable Components
Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings
Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors
VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information
Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments
Alerts & Notifications – Get notified on SMS, email, and application
Parking Management – Manage availability of parking space
Individual log-in – Every user has their own log-in id
Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system
Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization.
"Secure Your Premises with VizMan (VMS) – Get It Now"
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
1. January 6, 2015
Scala Data Pipelines for
Music Recommendations
Chris Johnson
@MrChrisJohnson
2. Who am I??
•Chris Johnson
– Machine Learning guy from NYC
– Focused on music recommendations
– Formerly a PhD student at UTAustin
3. Spotify in Numbers 3
•Started in 2006, now available in 58 markets
•50+ million active users, 15 million paying subscribers
•30+ million songs, 20,000 new songs added per day
•1.5 billion playlists
•1 TB user data logged per day
•900 node Hadoop cluster
•10,000+ Hadoop jobs run every day
5. How can we find good recommendations? 5
•Manual Curation
•Manually Tag Attributes
•Audio Content
•News, Blogs, Text analysis
•Collaborative Filtering
9. The Genre Toplist Problem 9
•Assume we have access to daily log data for all plays on Spotify.
•Goal: Calculate the top 1k artists on for each genre based on total daily plays
{"User": “userA”, "Date": “2015-01-10", "Artist": “Beyonce", "Track": "Halo", "Genres": ["Pop", "R&B", "Soul"]}
{"User": “userB”, "Date": “2015-01-10”, "Artist": "Led Zeppelin”, "Track": "Achilles Last Stand", "Genres": ["Rock",
"Blues Rock", "Hard Rock"]}
……….
11. 11
Scalding is a Scala library that makes it easy to specify Hadoop
MapReduce jobs. Scalding is built on top of Cascading, a Java
library that abstracts away low-level Hadoop details. Scalding is
comparable to Pig, but offers tight integration with Scala, bringing
advantages of Scala to your MapReduce jobs.
-Twitter
16. sortWithTake doesn’t fully sort 16
•Uses PriorityQueueMonoid from Algebird library
•What is a Monoid??
-Definition: A Set S and a binary operation • : S x S —> S such that
1. Associativity: For all a, b, and c in S the equation
(a • b) • c = a • (b • c) holds
2. Identity Element: There exists an element e in S such that for every
element a in S, the equations e • a = a • e = a hold
•Example: The natural numbers N under the addition operation.
(1 + 2) + 3 = 1 + (2 + 3)
0 + 1 = 1 + 0 = 1
class PriorityQueueMonoid[K](max : Int)(implicit ord :
Ordering[K]) extends Monoid[PriorityQueue[K]]
18. sortWithTake 18
•Uses PriorityQueueMonoid from Algebird
•Ok, great observation… but what’s the point of all this!??
-All monoid aggregations and reduces can begin on the Mapper side
and finish on the Reducer side since the order doesn’t matter!
-Scalding implicitly takes care of Mapper side combining and custom
combiner
-Reduces network traffic to reducers
class PriorityQueueMonoid[K](max : Int)(implicit ord :
Ordering[K]) extends Monoid[PriorityQueue[K]]
reduced traffic
20. How do we store track metadata? 20
•Lots of metadata associated with tracks (100+ columns!)
-artist, album, record label, genres, audio features, …
•Options:
1. Store each track as one long row with many columns
-Sending lots of data over network when you only need 1 or 2 columns
2. Store each column as a separate data source
-Jobs require costly joins, especially when requiring many columns
•Can we do better?..
21. Apache Parquet to the rescue! 21
•Apache Parquet is a columnar storage format available to any project in the Hadoop
ecosystem, regardless of the choice of data processing framework, data model or
programming language.
•Efficiently read a subset of columns without scanning the entire dataset
•Row group: A logical horizontal partitioning of the data into rows. There is no
physical structure that is guaranteed for a row group. A row group consists of a
column chunk for each column in the dataset.
•Column chunk: A chunk of the data for a particular column. These live in a particular
row group and is guaranteed to be contiguous in the file.
•Predicate push-down: Define predicates (<, >, <=, …) to filter out column chunks or
even full row groups, evaluated at Hadoop InputFormat layer before Avro conversion
23. Driven - job visualization and performance analytics 23
24. Luigi - data plumbing since 2012 24
•Workflow management framework developed by Spotify
•Python luigi configuration takes care of dependency resolution, job
scheduling, fault tolerance, etc.
•Support for Hive queries, MapReduce jobs, python snippets, Scalding,
Crunch, Spark, and more!
•Like Oozie but without all of the messy XML
https://github.com/spotify/luigi
27. So…. back to music recommendations! 27
•Manual Curation
•Manually Tag Attributes
•Audio Content
•News, Blogs, Text analysis
•Collaborative Filtering
28. Collaborative Filtering
28
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
29. Implicit Matrix Factorization 29
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
30. Alternating Least Squares 30
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix tracks
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
31. 31
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix tracks
Solve for users
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
32. 32
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix users
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
33. 33
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix users
Solve for tracks
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
34. 34
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix users
Solve for tracks
Repeat until convergence…
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
35. 35
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix users
Solve for tracks
Repeat until convergence…
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
36. Matrix Factorization with MapReduce
36
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
37. Matrix Factorization with MapReduce
37
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Figure via Erik Bernhardsson
38. 38
•Fast and general purpose cluster computing system
•Provides high-level apis in Java, Scala, and Python
•Takes advantage of in-memory caching to reduce I/O bottleneck of
Hadoop MapReduce
•MLlib: Scalable Machine Learning library packaged with Spark
-Collaborative Filtering and Matrix Factorization
-Classification and Regression
-Clustering
-Optimization Primitives
•Spark Streaming: Real time, scalable, fault-tolerant stream processing
•Spark SQL: allows relational queries expressed in SQL, HiveQL, or
Scala to be executed using Spark
39. Matrix Factorization with Spark
39
streams user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
40. Matrix Factorization with Spark
40
user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
streams
41. Matrix Factorization with Spark
41
user vectors item vectors
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
streams
42. Matrix Factorization with Spark
42
user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
streams
43. Matrix Factorization with Spark
43
user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
streams
44. Matrix Factorization with Spark
44
user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
streams
49. What should I be worried about? 49
•Multiple “right” ways to do the same thing
•Implicits can make code difficult to navigate
•Learning curve can be tough
•Avoid flattening before a join
•Be aware that Scala default collections are immutable (though mutable
versions are also available)
•Use monoid reduces and aggregations where possible and avoid folds
•Be patient with the compiler