Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

•

8 likes•2,523 views

Evan Casey

Presentation on scalable collaborative filtering algorithms on Apache Spark given at the the Tapad Taptalk on 6/6/2014

Technology Education

Scalable Collaborative Filtering
Recommendation Algorithms on
Apache Spark
Evan Casey
Taptech - 6/6/2014

Overview
● Apache Spark
○ Dataflow model
○ Spark vs Hadoop MapReduce
● Recommender Systems
○ Similarity-based collaborative filtering
○ Distributed implementation on Apache Spark
○ Lessons learned

Apache Spark
● Distributed data-processing
framework built on top of HDFS
● Use cases:
○ Interactive analytics
○ Graph algorithms
○ Stream processing
○ Scalable ML
○ Recommendation engines!

Spark vs Hadoop MapReduce
● In-memory data flow model
optimized for multi-stage
jobs
● Novel approach to fault
tolerance
● Similar programming style
to Scalding/Cascading

Programming Model
● Resilient Distributed Dataset (RDD)
○ Textfile, parallelize
● Parallel Operations
○ Map, GroupBy, Filter, Join, etc
● Optimizations
○ Caching, shared variables
● Demo

What are recommendation
algorithms?
● Problem:
○ “Information overload”
○ Diverse user interests
● User-Item Recommendation
○ Recommend content for each user
based on a larger training set of
user interaction histories

Motivation
● Large-scale recommender systems
○ Millions of users and items (100m+ ratings)
● Problems:
○ Memory-based approach
○ Scalability/Efficiency
○ User interaction sparsity

Collaborative Filtering
Shawn
Billy
Mary
4 3 8 9
2
4
3 4
1
2
8 8
4
● Similarity based
approach
● Two main variants:
○ User-based
○ Item-based
?
? ?
?
?

User-based Collaborative Filtering
● Step 1:
Obtain user-item
matrix denoted Mi,j

User-based Collaborative Filtering
● Step 2:
Calculate similarity between
pairwise users and compute
top-n nearest neighbors
pairwise
users
rating
vectors

User-based Collaborative Filtering
● Step 3:
Compute weighted average of
the ratings by the neighbors
and find the top-n items with
the score
recommendation
score of item
pairwise user
similarities
mean rating
co-rated user
rating

Results
Standalone Cluster: Amazon EC2 Cluster:

Lessons Learned
● Must manually specify number of tasks
○ Want 2-4 slices for each CPU in your cluster
● Use broadcast variables for shared data and cache for
data that will be reused
● Must account for the “power users”
○ Sampling heavy tailed user-interaction histories
● Need to account for the rating scale of each user!
○ Adjusted cosine similarity and pearson correlation outperform
normal cosine similarity

1. The document summarizes a presentation about Apache Mahout, an open source machine learning library. It discusses algorithms like clustering, classification, topic modeling and recommendations. 2. It provides an overview of clustering Reuters documents using K-means in Mahout and demonstrates how to generate vectors, run clustering and inspect clusters. 3. It also discusses classification techniques in Mahout like Naive Bayes, logistic regression and support vector machines and shows code examples for generating feature vectors from data.

Intro to Mahout -- DC Hadoop

Grant Ingersoll

Machine learning is used widely on the web today. Apache Mahout provides scalable machine learning libraries for common tasks like recommendation, clustering, classification and pattern mining. It implements many algorithms like k-means clustering in a MapReduce framework allowing them to scale to large datasets. Mahout functionality includes collaborative filtering, document clustering, categorization and frequent pattern mining.

Machine Learning and Apache Mahout : An Introduction

Varad Meru

An Introduction to Apache Hadoop, Mahout and HBase

Lukas Vlcek

Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It implements the MapReduce programming model pioneered by Google and a distributed file system (HDFS). Mahout builds machine learning libraries on top of Hadoop. HBase is a non-relational distributed database modeled after Google's BigTable that provides random access and real-time read/write capabilities. These projects are used by many large companies for large-scale data processing and analytics tasks.

Orchestrating the Intelligent Web with Apache Mahout

aneeshabakharia

Apache Mahout is an open source machine learning library for developing scalable algorithms. It includes algorithms for classification, clustering, recommendation engines, and frequent pattern mining. Mahout algorithms can be run locally or on Hadoop for distributed processing. Topic modeling using latent Dirichlet allocation is demonstrated for analyzing tweets and suggesting Twitter lists. While algorithms can provide benefits, some such as digital face manipulation can also be disturbing.

Recommendation Engine Powered by Hadoop

Pranab Ghosh

This document summarizes a presentation about building a recommendation engine powered by Hadoop. It discusses how Hadoop allows for parallel processing of large datasets using a functional programming model. It then describes how collaborative filtering and model-based recommendation algorithms can be implemented on Hadoop through MapReduce jobs. Specifically, it outlines two MapReduce jobs to calculate item correlations and predict user ratings for collaborative filtering. The predicted ratings can then be used to provide recommendations.

This document provides an overview of Apache Mahout, an open source machine learning library for Java. It describes what Mahout is, the machine learning algorithms it implements (including clustering, classification, recommendation and frequent itemset mining), and why it is preferred over other machine learning frameworks due to its scalability and support for Hadoop. It also discusses Mahout's architecture, components, recommendation workflow and evaluation methods.

Whats Right and Wrong with Apache Mahout

Ted Dunning

Machine Learning with Apache Mahout

Daniel Glauser

This document provides an introduction to machine learning with Apache Mahout. It defines machine learning as a branch of artificial intelligence that uses statistics and large datasets to make smart decisions. Common applications include spam filtering, credit card fraud detection, medical diagnostics, and search engines. Apache Mahout is a platform for machine learning algorithms that allows users to build their own algorithms or use existing functionality like recommender engines, classification, and clustering.

Mahout part2

Yasmine Gaber

Apache Mahout Tutorial - Recommendation - 2013/2014

Cataldo Musto

Here are the key steps for Exercise 3: 1. Create a FileDataModel object, passing in the CSV file 2. Instantiate different UserSimilarity objects like PearsonCorrelationSimilarity, EuclideanDistanceSimilarity 3. Calculate similarities between users by calling userSimilarity() on the similarity objects, passing the user IDs 4. Print out the similarities to compare the different measures The CSV file should contain enough user preference data (user IDs, item IDs, ratings) for the similarity calculations to be meaningful. This exercise demonstrates how to easily plug different similarity functions into Mahout's common interfaces.

Better {ML} Together: GraphLab Create + Spark

Turi, Inc.

This document discusses using GraphLab Create and Apache Spark together for machine learning applications. It provides an overview of Spark and how to create resilient distributed datasets (RDDs) and perform parallel operations on clusters. It then lists many machine learning algorithms available in GraphLab Create, including recommender systems, classification, regression, text analysis, image analysis, and graph analytics. The document proposes using notebooks to build data science products that help deliver personalized experiences through ML and intelligent automation. It demonstrates clustering customer transactions from an expense reporting dataset to identify customer behavior patterns.

Intro to Apache Mahout

Grant Ingersoll

This document provides an overview of machine learning and the Apache Mahout project. It defines machine learning and common use cases such as recommendations, classification, and pattern mining. It then describes what Mahout is, how to get started with Mahout including preparing data, and examples of algorithms like recommendations, clustering, topic modeling, and frequent pattern mining. Future plans for Mahout are also mentioned.

Graph Based Machine Learning on Relational Data

Benjamin Bengfort

This document discusses approaches for performing machine learning on graph-structured relational data stored in databases. It describes how machine learning is iterative like graph traversal. Common domains like healthcare and finance can be modeled as graphs stored in databases. Three approaches are described: 1) extract-transform-load methods to synchronize an external analytics system with the database, 2) storing the graph natively in the database, and 3) using a graph query language to translate queries to the database. Each approach has advantages and disadvantages regarding performance, ability to leverage the database, and flexibility to explore graph structures in the data.

Big Data Analytics with Storm, Spark and GraphLab

Impetus Technologies

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...

Spark Summit

Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data. In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity. ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

Srivatsan Ramanujam

These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.

Apache Mahout

Save Manos

Apache Mahout is an open source machine learning library that provides scalable machine learning algorithms focused on clustering, classification, and collaborative filtering. It allows building scalable machine learning tools for analyzing big data in a distributed manner using frameworks like Hadoop. Some key algorithms supported include logistic regression, Bayesian classification, k-means clustering, and item-based collaborative filtering. Companies are using Mahout for applications like movie recommendations, fraud detection, and ad recommendations by taking advantage of its scalability for large datasets.

Tutorial Mahout - Recommendation

Cataldo Musto

Learning to Rank Presentation (v2) at LexisNexis Search Guild

Sujit Pal

Apache Mahout 於電子商務的應用

James Chen

Apache Mahout是一個架構在MapReduce之上的演算法函式庫，內建許多經典的演算法，借助MapReduce的平行處理架構讓巨量資料的分析更容易。電子商務的發展越來越趨向個人化，帶來對於使用者行為分析的需求、精準推薦、精準廣告投放、個人化商品推薦，個人化內容推薦等個性化服務不斷推出，使得Apache Mahout在Hadoop Ecosystem中的角色日益重要。MapReduce的平行運算能力與Machine Learning的結合，協助電商業者從巨量的網站日誌中，提取出有價值的使用者行為數據，Etu將在這個session中，介紹Mahout內建的商品推薦演算法原理，以及如何step by step打造一個end-to-end的商品推薦系統。

Introduction to Collaborative Filtering with Apache Mahout

sscdotopen

This document provides an overview of Apache Mahout, an open-source library for scalable machine learning and data mining. It describes Mahout's collaborative filtering module and how it can be used to build recommender systems. Key classes and algorithms are explained, including item-based collaborative filtering, latent factor models like SVD, and tools for evaluating recommender quality. Potential student projects are outlined, such as implementing a novel similarity measure or improving Mahout's capabilities for temporal recommendation evaluation.

Machine Learning with Hadoop

Sangchul Song

Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.

Latent Semantic Analysis of Wikipedia with Spark

Sandy Ryza

This document describes how to perform latent semantic analysis (LSA) on Wikipedia data using Apache Spark. It discusses parsing Wikipedia XML data, creating a term-document matrix, applying singular value decomposition to reduce the matrix's rank, and interpreting the results to find concepts and related documents. Key steps include parsing Wikipedia pages into terms and documents, cleaning the data through lemmatization and removing stop words, creating a tf-idf weighted term-document matrix, applying SVD to get U, S, and V matrices, and using these to find terms and documents most strongly related to given queries.

mahout introduction

changgeng Zhang

This document introduces Apache Mahout, an open source machine learning library. It discusses common machine learning use cases like recommendations, classification, and clustering. It explains how Mahout implements scalable machine learning algorithms using Apache Hadoop. Finally, it provides examples of using Mahout's recommender systems, topic modeling, clustering and frequent pattern mining capabilities.

Deep learning and Apache Spark

QuantUniversity

Interest is growing in the Apache Spark community in using Deep Learning techniques and in the Deep Learning community in scaling algorithms with Apache Spark. A few of them to note include: · Databrick’s efforts in scaling Deep learning with Spark · Intel announcing the BigDL: A Deep learning library for Spark · Yahoo’s recent efforts to opensource TensorFlowOnSpark In this lecture we will discuss the key use cases and developments that have emerged in the last year in using Deep Learning techniques with Spark.

Apache Spark 101 - Demi Ben-Ari

Demi Ben-Ari

The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.

SEMLIB Final Conference | DERI presentation

SemLib Project

The document summarizes a recommendation engine that leverages linked open data. It describes (1) using SPARQL to input linked data and output recommendations as linked data, (2) implementing collaborative filtering and content-based recommendation algorithms adapted for linked data, and (3) building a distributed and parallel framework using MapReduce to handle large linked data inputs and ensure scalability.

What's hot

Mahout Tutorial and Hands-on (version 2015)

Cataldo Musto

Whats Right and Wrong with Apache Mahout

Ted Dunning

Machine Learning with Apache Mahout

Daniel Glauser

Mahout part2

Yasmine Gaber

Apache Mahout Tutorial - Recommendation - 2013/2014

Cataldo Musto

Better {ML} Together: GraphLab Create + Spark

Turi, Inc.

Intro to Apache Mahout

Grant Ingersoll

Graph Based Machine Learning on Relational Data

Benjamin Bengfort

Big Data Analytics with Storm, Spark and GraphLab

Impetus Technologies

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...

Spark Summit

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

Srivatsan Ramanujam

Apache Mahout

Save Manos

Tutorial Mahout - Recommendation

Cataldo Musto

Learning to Rank Presentation (v2) at LexisNexis Search Guild

Sujit Pal

Apache Mahout 於電子商務的應用

James Chen

Introduction to Collaborative Filtering with Apache Mahout

sscdotopen

Machine Learning with Hadoop

Sangchul Song

Latent Semantic Analysis of Wikipedia with Spark

Sandy Ryza

mahout introduction

changgeng Zhang

Deep learning and Apache Spark

QuantUniversity

What's hot (20)

Mahout Tutorial and Hands-on (version 2015)

Whats Right and Wrong with Apache Mahout

Machine Learning with Apache Mahout

Mahout part2

Apache Mahout Tutorial - Recommendation - 2013/2014

Better {ML} Together: GraphLab Create + Spark

Intro to Apache Mahout

Graph Based Machine Learning on Relational Data

Big Data Analytics with Storm, Spark and GraphLab

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

Apache Mahout

Tutorial Mahout - Recommendation

Learning to Rank Presentation (v2) at LexisNexis Search Guild

Apache Mahout 於電子商務的應用

Introduction to Collaborative Filtering with Apache Mahout

Machine Learning with Hadoop

Latent Semantic Analysis of Wikipedia with Spark

mahout introduction

Deep learning and Apache Spark

Similar to Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Apache Spark 101 - Demi Ben-Ari

Demi Ben-Ari

SEMLIB Final Conference | DERI presentation

SemLib Project

Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra

Anant Corporation

In Apache Cassandra Lunch #50, we will discuss how you can use Apache Spark and Apache Cassandra to perform basic Machine Learning tasks. Accompanying Blog: https://blog.anant.us/apache-cassandra-lunch-50-machine-learning-with-spark--cassandra/ Accompanying YouTube video: https://youtu.be/myIX0kkpL9U Sign Up For Our Newsletter: http://eepurl.com/grdMkn Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/ Cassandra.Link: https://cassandra.link/ Follow Us and Reach Us At: Anant: https://www.anant.us/ Awesome Cassandra: https://github.com/Anant/awesome-cassandra Cassandra.Lunch: https://github.com/Anant/Cassandra.Lunch Email: solutions@anant.us LinkedIn: https://www.linkedin.com/company/anant/ Twitter: https://twitter.com/anantcorp Eventbrite: https://www.eventbrite.com/o/anant-1072927283 Facebook: https://www.facebook.com/AnantCorp/

Apache Hive for modern DBAs

Luis Marques

Apache Hive is an open source data warehousing framework built on Hadoop. It allows users to query large datasets using SQL and handles parallelization behind the scenes. Hive supports various file formats like ORC, Parquet, and Avro. It uses a directed acyclic graph (DAG) execution engine like Tez or Spark to improve performance over traditional MapReduce. The metastore stores metadata about databases, tables, and partitions to allow data discovery and abstraction. Hive's cost-based optimizer and in-memory query processing features like LLAP improve performance for interactive queries on large datasets.

Hadoop: The Default Machine Learning Platform ?

Milind Bhandarkar

Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.

Apache spark - History and market overview

Martin Zapletal

This document provides a history and market overview of Apache Spark. It discusses the motivation for distributed data processing due to increasing data volumes, velocities and varieties. It then covers brief histories of Google File System, MapReduce, BigTable, and other technologies. Hadoop and MapReduce are explained. Apache Spark is introduced as a faster alternative to MapReduce that keeps data in memory. Competitors like Flink, Tez and Storm are also mentioned.

Graph basedrdf storeforapachecassandra

Ravindra Ranwala

This document discusses building a graph-based RDF store on Apache Cassandra. It first introduces RDF data and triple stores, then discusses challenges in building a scalable triple store on Cassandra. It reviews existing approaches like relational and graph-based models. The methodology builds a prototype RDF store on Cassandra using a graph model. Evaluation benchmarks it against other stores on DBPedia data, showing it outperforms them on more complex queries. Future work could improve scalability with a distributed implementation.

Handling the growth of data

Piyush Katariya

This document discusses strategies for handling large amounts of data as it grows over time. It begins with optimizing a small database on a single node through techniques like indexing and SSD storage. As the data grows larger, it recommends scaling vertically through techniques like replication and partitioning. For very large datasets, it recommends horizontal scaling using distributed systems like HDFS, MapReduce, BigTable and NoSQL databases. It analyzes properties of different database types and recommends systems like MongoDB, CockroachDB, ScyllaDB, Druid and graph databases based on factors like workload and data properties. Scaling choices involve tradeoffs between consistency, availability, computation and costs.

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...

MLconf

Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example. This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.

Bds session 13 14

Infinity Tech Solutions

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Formulatedby

Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc. Next DSS NYC Event 👉 https://datascience.salon/newyork/ Next DSS LA Event 👉 https://datascience.salon/la/ Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services. This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors. A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose. In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

spinningmatt

This document provides an introduction to Apache Spark, including: - A brief history of Spark, which started at UC Berkeley in 2009 and was donated to the Apache Foundation in 2013. - An overview of what Spark is - an open-source, efficient, and productive cluster computing system that is interoperable with Hadoop. - Descriptions of Spark's core abstractions including Resilient Distributed Datasets (RDDs), transformations, actions, and how it allows loading and saving data. - Mentions of Spark's machine learning, SQL, streaming, and graph processing capabilities through projects like MLlib, Spark SQL, Spark Streaming, and GraphX.

Big_data_analytics_NoSql_Module-4_Session

RUHULAMINHAZARIKA

- Apache Spark is an open-source cluster computing framework that is faster than Hadoop for batch processing and also supports real-time stream processing. - Spark was created to be faster than Hadoop for interactive queries and iterative algorithms by keeping data in-memory when possible. - Spark consists of Spark Core for the basic RDD API and also includes modules for SQL, streaming, machine learning, and graph processing. It can run on several cluster managers including YARN and Mesos.

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...

Cambridge Semantics

How to get started in Big Data for master's students

Mohamed Nadjib MAMI

Building Recommendation Platforms with Hadoop

Jayant Shekhar

This document discusses building recommendation platforms using Hadoop. It covers common recommendation patterns and algorithms such as collaborative filtering, clustering, and classification. It also describes the lambda architecture for batch and real-time processing. Architectures are presented for building computation and serving layers, including using Giraph for social recommendations, Solr for content recommendations, and Storm/HBase for real-time recommendations. Trends are analyzed using HBase counters and aggregations.

Developing Enterprise Consciousness: Building Modern Open Data Platforms

ScyllaDB

ScyllaDB, along side some of the other major distributed real-time technologies gives businesses a unique opportunity to achieve enterprise consciousness - a business platform that delivers data to the people that need when they need it any time, anywhere. This talk covers how modern tools in the open data platform can help companies synchronize data across their applications using open source tools and technologies and more modern low-code ETL/ReverseETL tools. Topics: - Business Platform Challenges - What Enterprise Consciousness Solves - How ScyllaDB Empowers Enterprise Consciousness - What can ScyllaDB do for Big Companies - What can ScyllaDB do for smaller companies.

Production-Ready BIG ML Workflows - from zero to hero

Daniel Marcous

Data science isn't an easy task to pull of. You start with exploring data and experimenting with models. Finally, you find some amazing insight! What now? How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data? Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running. Covering : * Small - Medium experimentation (R) * Big data implementation (Spark Mllib /+ pipeline) * Setting Metrics and checks in place * Ad hoc querying and exploring your results (Zeppelin) * Pain points & Lessons learned the hard way (is there any other way?)

Oslo baksia2014

Max Neunhöffer

Domain Driven Design is a software development process that focuses on finding a common language for the involved parties. This language and the resulting models are taken from the domain rather than the technical details of the implementation. The goal is to improve the communication between customers, developers and all other involved groups. Even if Eric Evan's book about this topic was written almost ten years ago, this topic remains important because a lot of projects fail for communication reasons. Relational databases have their own language and influence the design of software into a direction further away from the Domain: Entities have to be created for the sole purpose of adhering to best practices of relational database. Two kinds of NoSQL databases are changing that: Document stores and graph databases. In a document store you can model a "contains" relation in a more natural way and thereby express if this entity can exist outside of its surrounding entity. A graph database allows you to model relationships between entities in a straight forward way that can be expressed in the language of the domain. In this talk I want to look at the way a multi model database that combines a document store and a graph database can help you to model your problems in a way that is understandable for all parties involved, and explain the benefits of this approach for the software development process.

Hadoop on OpenStack - Sahara @DevNation 2014

spinningmatt

This document provides an overview of Sahara, an OpenStack project that aims to simplify managing Hadoop infrastructure and tools. Sahara allows users to create and manage Hadoop clusters through a programmatic API or web console. It uses a plugin architecture where Hadoop distribution vendors can integrate their management software. Currently there are plugins for vanilla Apache Hadoop, Hortonworks Data Platform, and Intel Distribution for Apache Hadoop. The document outlines Sahara's architecture, APIs, roadmap, and demonstrates its use through a live demo analyzing transaction data with the BigPetStore sample application on Hadoop.

Similar to Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark (20)

Apache Spark 101 - Demi Ben-Ari

SEMLIB Final Conference | DERI presentation

Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra

Apache Hive for modern DBAs

Hadoop: The Default Machine Learning Platform ?

Apache spark - History and market overview

Graph basedrdf storeforapachecassandra

Handling the growth of data

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...

Bds session 13 14

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

Big_data_analytics_NoSql_Module-4_Session

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...

How to get started in Big Data for master's students

Building Recommendation Platforms with Hadoop

Developing Enterprise Consciousness: Building Modern Open Data Platforms

Production-Ready BIG ML Workflows - from zero to hero

Oslo baksia2014

Hadoop on OpenStack - Sahara @DevNation 2014

Recently uploaded

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Speck&Tech

ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune. Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile. BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).

How to Get CNIC Information System with Paksim Ga.pptx

danishmna97

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

UiPath Test Automation using UiPath Test Suite series, part 6

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI. UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities. Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes. What will you get from this session? 1. Insights into integrating generative AI. 2. Understanding how this integration enhances test automation within the UiPath platform 3. Practical demonstrations 4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath Topics covered: What is generative AI Test Automation with generative AI and Open AI. UiPath integration with generative AI Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Full-RAG: A modern architecture for hyper-personalization

Zilliz

Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Driving Business Innovation: Latest Generative AI Advancements & Success Story

Safe Software

Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency. During the hour, we’ll take you through: Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board. Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes. Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI. We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI. This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

shyamraj55

Mind map of terminologies used in context of Generative AI

Kumud Singh

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

IndexBug

Building Production Ready Search Pipelines with Spark and Milvus

Zilliz

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

Microsoft - Power Platform_G.Aspiotis.pdf

Uni Systems S.M.S.A.

Infrastructure Challenges in Scaling RAG with Custom AI models

Zilliz

Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.

GenAI Pilot Implementation in the organizations

kumardaparthi1024

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

Recently uploaded (20)

Communications Mining Series - Zero to Hero - Session 1

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Video Streaming: Then, Now, and in the Future

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

How to Get CNIC Information System with Paksim Ga.pptx

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Climate Impact of Software Testing at Nordic Testing Days

UiPath Test Automation using UiPath Test Suite series, part 6

Full-RAG: A modern architecture for hyper-personalization

Programming Foundation Models with DSPy - Meetup Slides

Driving Business Innovation: Latest Generative AI Advancements & Success Story

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

Mind map of terminologies used in context of Generative AI

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

Building Production Ready Search Pipelines with Spark and Milvus

Pushing the limits of ePRTC: 100ns holdover for 100 days

Microsoft - Power Platform_G.Aspiotis.pdf

Infrastructure Challenges in Scaling RAG with Custom AI models

GenAI Pilot Implementation in the organizations

UiPath Test Automation using UiPath Test Suite series, part 5

Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

1. Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark Evan Casey Taptech - 6/6/2014

2. Overview ● Apache Spark ○ Dataflow model ○ Spark vs Hadoop MapReduce ● Recommender Systems ○ Similarity-based collaborative filtering ○ Distributed implementation on Apache Spark ○ Lessons learned

3. Apache Spark ● Distributed data-processing framework built on top of HDFS ● Use cases: ○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!

4. Spark vs Hadoop MapReduce ● In-memory data flow model optimized for multi-stage jobs ● Novel approach to fault tolerance ● Similar programming style to Scalding/Cascading

5. Programming Model ● Resilient Distributed Dataset (RDD) ○ Textfile, parallelize ● Parallel Operations ○ Map, GroupBy, Filter, Join, etc ● Optimizations ○ Caching, shared variables ● Demo

6. What are recommendation algorithms? ● Problem: ○ “Information overload” ○ Diverse user interests ● User-Item Recommendation ○ Recommend content for each user based on a larger training set of user interaction histories

7. Motivation ● Large-scale recommender systems ○ Millions of users and items (100m+ ratings) ● Problems: ○ Memory-based approach ○ Scalability/Efficiency ○ User interaction sparsity

8. Collaborative Filtering Shawn Billy Mary 4 3 8 9 2 4 3 4 1 2 8 8 4 ● Similarity based approach ● Two main variants: ○ User-based ○ Item-based ? ? ? ? ?

9. User-based Collaborative Filtering ● Step 1: Obtain user-item matrix denoted Mi,j

10. User-based Collaborative Filtering ● Step 2: Calculate similarity between pairwise users and compute top-n nearest neighbors pairwise users rating vectors

11. User-based Collaborative Filtering ● Step 3: Compute weighted average of the ratings by the neighbors and find the top-n items with the score recommendation score of item pairwise user similarities mean rating co-rated user rating

12. Results Standalone Cluster: Amazon EC2 Cluster:

13. Evaluation

14. Lessons Learned ● Must manually specify number of tasks ○ Want 2-4 slices for each CPU in your cluster ● Use broadcast variables for shared data and cache for data that will be reused ● Must account for the “power users” ○ Sampling heavy tailed user-interaction histories ● Need to account for the rating scale of each user! ○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity

Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Similar to Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark