This document discusses building data products at LinkedIn using Hadoop. It describes how LinkedIn builds recommendations products like "People You May Know" by processing member connection data with Hadoop tools. The workflow involves using Kafka to transfer data to HDFS, Pig and MapReduce to process the data, Azkaban to manage Hadoop jobs, and Voldemort to store results and serve recommendations to members. Triangle closing algorithms in Pig are used to find common connections between members and predict potential new connections. The results are pushed to production systems to power features like "People You May Know" recommendations.
This document discusses social network analysis at LinkedIn. It provides an overview of LinkedIn's analytics products, including People You May Know, Profile Stats, and InMaps. It then describes in more detail how LinkedIn builds features like Connection Strength and Related Searches using tools like Hadoop MapReduce, Kafka, Azkaban, and Voldemort. The document outlines LinkedIn's data processing cycle and workflow management for serving analytics data to users.
Speaker: André Spiegel
Many applications require processes that load large amounts of data into MongoDB. It is easy to get these processes wrong, resulting in hours or days of loading time when it could be done in minutes. This talk identifies common mistakes and pitfalls and shows design patterns that can dramatically improve performance. The patterns introduced here can be used with any tool or programming language.
Powerful Analysis with the Aggregation PipelineMongoDB
Speaker: Asya Kamsky
Think you need to move your data "elsewhere" to do powerful analysis? Think again. The most efficient way to analyze your data is where it already lives. MongoDB Aggregation Pipeline has been getting more and more powerful and using new stages, expressions and tricks we can do extensive analysis of our data inside MongoDB Server.
Speaker: Charlie Swanson
Learn how MongoDB answers your queries from a query system engineer. If you've ever had a performance problem with a query but didn't know how to find the cause, or if you've ever needed to confirm that your shiny new index is being put to work, the explain command is an excellent place to start. MongoDB's explain system is a powerful tool for solving this type of problem, but can be intimidating and unwieldy to use. In this talk, we will discuss how the explain command works and break down its output into digestible pieces.
A Search Index is Not a Database Index - Full Stack Toronto 2017Toria Gibbs
A search engine is not a database. Search systems are optimized for fast search using an internal data structure called an inverted index. Databases have a similar feature to allow quick access, also called an index, but it’s a totally different thing!
In this talk, Toria Gibbs will take you on a tour of the internals of a search index, comparing it to common implementations of indexing in relational databases. We’ll see how search engines can outperform databases and discuss the tradeoffs in implementing and maintaining such a system. No prior knowledge of database or search index implementations required; experience creating or querying database tables will be helpful.
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.
The document discusses association rule mining with R. It provides an overview of association rule mining concepts like support, confidence and lift. It then demonstrates how to use the apriori() function in R to generate association rules from the Titanic dataset. The document shows how to remove redundant rules, interpret rules and visualize rules using scatter plots and matrices.
This document provides an introduction to using IPython and improving Python code through various techniques like list comprehensions and collections. It discusses features of IPython like interactive shells and Jupyter notebooks. It also covers enhancing for loops, dictionaries, and using namedtuple to organize data from collections. The goal is to help readers become more proficient in Python.
This document discusses social network analysis at LinkedIn. It provides an overview of LinkedIn's analytics products, including People You May Know, Profile Stats, and InMaps. It then describes in more detail how LinkedIn builds features like Connection Strength and Related Searches using tools like Hadoop MapReduce, Kafka, Azkaban, and Voldemort. The document outlines LinkedIn's data processing cycle and workflow management for serving analytics data to users.
Speaker: André Spiegel
Many applications require processes that load large amounts of data into MongoDB. It is easy to get these processes wrong, resulting in hours or days of loading time when it could be done in minutes. This talk identifies common mistakes and pitfalls and shows design patterns that can dramatically improve performance. The patterns introduced here can be used with any tool or programming language.
Powerful Analysis with the Aggregation PipelineMongoDB
Speaker: Asya Kamsky
Think you need to move your data "elsewhere" to do powerful analysis? Think again. The most efficient way to analyze your data is where it already lives. MongoDB Aggregation Pipeline has been getting more and more powerful and using new stages, expressions and tricks we can do extensive analysis of our data inside MongoDB Server.
Speaker: Charlie Swanson
Learn how MongoDB answers your queries from a query system engineer. If you've ever had a performance problem with a query but didn't know how to find the cause, or if you've ever needed to confirm that your shiny new index is being put to work, the explain command is an excellent place to start. MongoDB's explain system is a powerful tool for solving this type of problem, but can be intimidating and unwieldy to use. In this talk, we will discuss how the explain command works and break down its output into digestible pieces.
A Search Index is Not a Database Index - Full Stack Toronto 2017Toria Gibbs
A search engine is not a database. Search systems are optimized for fast search using an internal data structure called an inverted index. Databases have a similar feature to allow quick access, also called an index, but it’s a totally different thing!
In this talk, Toria Gibbs will take you on a tour of the internals of a search index, comparing it to common implementations of indexing in relational databases. We’ll see how search engines can outperform databases and discuss the tradeoffs in implementing and maintaining such a system. No prior knowledge of database or search index implementations required; experience creating or querying database tables will be helpful.
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.
The document discusses association rule mining with R. It provides an overview of association rule mining concepts like support, confidence and lift. It then demonstrates how to use the apriori() function in R to generate association rules from the Titanic dataset. The document shows how to remove redundant rules, interpret rules and visualize rules using scatter plots and matrices.
This document provides an introduction to using IPython and improving Python code through various techniques like list comprehensions and collections. It discusses features of IPython like interactive shells and Jupyter notebooks. It also covers enhancing for loops, dictionaries, and using namedtuple to organize data from collections. The goal is to help readers become more proficient in Python.
Visualization of Supervised Learning with {arules} + {arulesViz}Takashi J OZAKI
This document discusses visualizing supervised learning models using association rules and the arules and arulesViz packages in R. It shows how association rules generated from sample user activity data can be represented as graphs, allowing intuitive visualization of relationships between variables even in high-dimensional data. The visualizations are compared to results from GLMs and random forests to show how nodes are located based on their "closeness" in different supervised learning models. While less quantitative, this technique provides a more intuitive understanding of supervised learning that is useful for presentations.
This document discusses building regression and classification models in R, including linear regression, generalized linear models, and decision trees. It provides examples of building each type of model using various R packages and datasets. Linear regression is used to predict CPI data. Generalized linear models and decision trees are built to predict body fat percentage. Decision trees are also built on the iris dataset to classify flower species.
The document discusses the deque collection in Python. Some key points:
- Deque allows fast appends and pops from either side of the list, with O(1) time complexity, unlike regular lists which are slow (O(n)) for pop(0) and insert(0,v).
- Deque provides methods like append, appendleft, popleft, pop for adding/removing elements from either side of the list.
- It can be initialized with a maximum length to act as a sliding window, discarding old elements as new ones are added.
- Methods like rotate rotate the deque a given number of positions, extending adds multiple elements at once. Deque is useful when
Functional Pe(a)rls - the Purely Functional Datastructures editionosfameron
All new material, this time about one of the fundamental functional datastructures, the Linked List, and the overview of an implementation in Moosey Perl.
This covers some of the same material as https://github.com/osfameron/pure-fp-book but perhaps with more explanation (and covering much less material - it was only a 20 minute talk)
PostgreSQL: Advanced features in practiceJano Suchal
The document discusses several advanced features of PostgreSQL including:
1) Transactional DDL which allows DDL statements to be executed transactionally.
2) Cost-based query optimization and graphical EXPLAIN plans which help choose the most efficient query plan.
3) Features like partial indexes, function indexes, k-nearest search, views, and window functions which provide powerful ways to query and analyze data.
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
If you are struggling to make a plot, tear yourself away from stackoverflow for a moment and ... take a hard look at your data. Is it really in the most favorable form for the task at hand? Time and time again I have found that my visualization struggles are really a symptom of unfinished data wrangling. R has long had excellent facilities for data aggregation or "split-apply-combine": split an object into pieces, compute on each piece, and glue the result back together again. Recent developments, especially in the purrr package, have made "split-apply-combine" even easier and more general. But this requires a certain comfort level with lists, especially with lists that are columns inside a data frame. This is unfamiliar to most of us. I give an overview of this set of problems and match them up with solutions based on grouped, nested, and split data frames.
An overview of two types of graph databases: property databases and knowledge/RDF databases, together with their dominant respective query languages, Cypher and SPARQL. Also a quick look at some property DB frameworks, including TinkerPop and its query language, Gremlin.
The document discusses using R for text analysis tasks such as document summarization. It introduces tokenization of text, correspondence analysis, topic modeling with LDA, and graph-based summarization using LexRank. Sample code is provided to preprocess text into a tokenized dataframe, perform correspondence analysis and LDA topic modeling, and generate a summary by ranking sentences based on their connectivity in a similarity graph.
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)MongoSF
The document appears to be notes from a MongoDB training session that discusses various MongoDB features like MapReduce, geospatial indexes, and GridFS. It also covers topics like database commands, indexing, and querying documents with embedded documents and arrays. Examples are provided for how to implement many of these MongoDB features and functions.
A tour of Python: slides from presentation given in 2012.
[Some slides are not properly rendered in SlideShare: the original is still available at http://www.aleksa.org/2015/04/python-presentation_7.html.]
This document provides an overview of arrays, generics, wildcards, and lambda expressions in Java. It discusses array initialization and multi-dimensional arrays. It explains generics and how they avoid type casting. It covers wildcard types, subclass/superclass constraints, and limitations of generics. The document also demonstrates lambda expressions and functional interfaces like Comparator. It provides examples of common data structures in Java like List, Set, and Map as well as streams and iterators.
The document discusses clustering and numpy arrays in Python. It shows how to create arrays using numpy, perform operations like summing and finding min/max values, and access elements and slices. It also introduces Cython and demonstrates compiling a simple "Hello World" Cython program and using Cython to optimize a Python prime number generation function for improved performance.
The document describes a Haskell program that translates characters in one string to characters in another string. It defines a translate function that maps characters from the first string (set1) to the corresponding characters in the second string (set2). A translateString function applies the translate function to a given string, and the main function gets the set1 and set2 strings from arguments, reads stdin, applies translateString, and writes the result to stdout, catching any errors.
Patterns for slick database applicationsSkills Matter
Slick is Typesafe's open source database access library for Scala. It features a collection-style API, compact syntax, type-safe, compositional queries and explicit execution control. Community feedback helped us to identify common problems developers are facing when writing Slick applications. This talk suggests particular solutions to these problems. We will be looking at reducing boiler-plate, re-using code between queries, efficiently modeling object references and more.
Ciklum net sat12112011-alexander fomin-expressions and all, all, allCiklum Ukraine
The document discusses expressions in C# and LINQ. It provides examples of expression trees representing lambda expressions and method calls. It also describes using expressions to implement reflection at compile time and build an IQueryable provider by compiling expressions to IL.
Large scale social recommender systems at LinkedInMitul Tiwari
This document discusses the large-scale social recommender systems used at LinkedIn. It describes how LinkedIn uses recommender systems to suggest news articles, people you may know connections, related searches, and skill endorsements. It also discusses the challenges of scaling these recommender systems given LinkedIn's large user base and social graph. Key recommender systems covered include recommending news items to users, identifying potential connections through people you may know, and suggesting skill endorsements. The document also discusses how social graph analysis and modeling virality helps these recommender systems.
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
LinkedIn has a large professional network with 360M members. They build data-driven products using members' rich profile data. To do this, they ingest online data into offline systems using Apache Kafka. The data is then processed using Hadoop, Spark, Samza and Cubert to compute features and train models. Results are moved back online using Voldemort and Kafka. For example, People You May Know recommendations are generated by triangle closing in Hadoop and Cubert to count common connections faster. Site speed is monitored in real-time using Samza to join logs from different services.
Modeling Impression discounting in large-scale recommender systemsMitul Tiwari
Recommender systems have become very important for many online activities, such as watching movies, shopping for products, and connecting with friends on social networks. User behavioral analysis and user feedback (both explicit and implicit) modeling are crucial for the improvement of any online recommender system. Widely adopted recommender systems at LinkedIn such as “People You May Know” and “Suggested Skills Endorsement” are evolving by analyzing user behaviors on impressed recommendation items.
In this paper, we address modeling impression discounting of
recommended items, that is, how to model users’ no-action feedback on impressed recommended items. The main contributions of this paper include (1) large-scale analysis of impression data from LinkedIn and KDD Cup; (2) novel anti-noise regression techniques, and its application to learn four different impression discounting functions including linear decay, inverse decay, exponential decay, and quadratic decay; (3) applying these impression discounting functions to LinkedIn’s “People You May Know” and “Suggested Skills Endorsements” recommender systems.
Large scale social recommender systems and their evaluationMitul Tiwari
This talk will give an overview of some of the large-scale recommender systems at LinkedIn such as People You May Know (PYMK) and Suggested Skills Endorsements. This talk will also address how we formulate machine learning modeling problems to build these recommender systems and evaluate our models. Modeling for these recommender systems involves careful feature engineering and incorporating user feedback - both explicit and implicit. This talk will describe how we feature engineer through an example of modeling organizational overlap between people for link prediction and community detection over social graph. Also, how we incorporate user feedback through impression discounting ignored recommended results will be described. Careful evaluation of modeling changes both offline and online (A/B testing) is inherent part of measuring effectiveness of our recommender systems. We have built a sophisticated end-to-end A/B testing and evaluation platform called XLNT at LinkedIn and this talk will also cover how we use XLNT for power analysis, A/B testing, and measuring confidence of the results.
Metaphor: A system for related searches recommendationsMitul Tiwari
Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members’ search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 175 million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity.
The system, which has been in live operation for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation
click behavior. We also discuss some of the practical concerns in deploying related search recommendations.
Visualization of Supervised Learning with {arules} + {arulesViz}Takashi J OZAKI
This document discusses visualizing supervised learning models using association rules and the arules and arulesViz packages in R. It shows how association rules generated from sample user activity data can be represented as graphs, allowing intuitive visualization of relationships between variables even in high-dimensional data. The visualizations are compared to results from GLMs and random forests to show how nodes are located based on their "closeness" in different supervised learning models. While less quantitative, this technique provides a more intuitive understanding of supervised learning that is useful for presentations.
This document discusses building regression and classification models in R, including linear regression, generalized linear models, and decision trees. It provides examples of building each type of model using various R packages and datasets. Linear regression is used to predict CPI data. Generalized linear models and decision trees are built to predict body fat percentage. Decision trees are also built on the iris dataset to classify flower species.
The document discusses the deque collection in Python. Some key points:
- Deque allows fast appends and pops from either side of the list, with O(1) time complexity, unlike regular lists which are slow (O(n)) for pop(0) and insert(0,v).
- Deque provides methods like append, appendleft, popleft, pop for adding/removing elements from either side of the list.
- It can be initialized with a maximum length to act as a sliding window, discarding old elements as new ones are added.
- Methods like rotate rotate the deque a given number of positions, extending adds multiple elements at once. Deque is useful when
Functional Pe(a)rls - the Purely Functional Datastructures editionosfameron
All new material, this time about one of the fundamental functional datastructures, the Linked List, and the overview of an implementation in Moosey Perl.
This covers some of the same material as https://github.com/osfameron/pure-fp-book but perhaps with more explanation (and covering much less material - it was only a 20 minute talk)
PostgreSQL: Advanced features in practiceJano Suchal
The document discusses several advanced features of PostgreSQL including:
1) Transactional DDL which allows DDL statements to be executed transactionally.
2) Cost-based query optimization and graphical EXPLAIN plans which help choose the most efficient query plan.
3) Features like partial indexes, function indexes, k-nearest search, views, and window functions which provide powerful ways to query and analyze data.
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
If you are struggling to make a plot, tear yourself away from stackoverflow for a moment and ... take a hard look at your data. Is it really in the most favorable form for the task at hand? Time and time again I have found that my visualization struggles are really a symptom of unfinished data wrangling. R has long had excellent facilities for data aggregation or "split-apply-combine": split an object into pieces, compute on each piece, and glue the result back together again. Recent developments, especially in the purrr package, have made "split-apply-combine" even easier and more general. But this requires a certain comfort level with lists, especially with lists that are columns inside a data frame. This is unfamiliar to most of us. I give an overview of this set of problems and match them up with solutions based on grouped, nested, and split data frames.
An overview of two types of graph databases: property databases and knowledge/RDF databases, together with their dominant respective query languages, Cypher and SPARQL. Also a quick look at some property DB frameworks, including TinkerPop and its query language, Gremlin.
The document discusses using R for text analysis tasks such as document summarization. It introduces tokenization of text, correspondence analysis, topic modeling with LDA, and graph-based summarization using LexRank. Sample code is provided to preprocess text into a tokenized dataframe, perform correspondence analysis and LDA topic modeling, and generate a summary by ranking sentences based on their connectivity in a similarity graph.
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)MongoSF
The document appears to be notes from a MongoDB training session that discusses various MongoDB features like MapReduce, geospatial indexes, and GridFS. It also covers topics like database commands, indexing, and querying documents with embedded documents and arrays. Examples are provided for how to implement many of these MongoDB features and functions.
A tour of Python: slides from presentation given in 2012.
[Some slides are not properly rendered in SlideShare: the original is still available at http://www.aleksa.org/2015/04/python-presentation_7.html.]
This document provides an overview of arrays, generics, wildcards, and lambda expressions in Java. It discusses array initialization and multi-dimensional arrays. It explains generics and how they avoid type casting. It covers wildcard types, subclass/superclass constraints, and limitations of generics. The document also demonstrates lambda expressions and functional interfaces like Comparator. It provides examples of common data structures in Java like List, Set, and Map as well as streams and iterators.
The document discusses clustering and numpy arrays in Python. It shows how to create arrays using numpy, perform operations like summing and finding min/max values, and access elements and slices. It also introduces Cython and demonstrates compiling a simple "Hello World" Cython program and using Cython to optimize a Python prime number generation function for improved performance.
The document describes a Haskell program that translates characters in one string to characters in another string. It defines a translate function that maps characters from the first string (set1) to the corresponding characters in the second string (set2). A translateString function applies the translate function to a given string, and the main function gets the set1 and set2 strings from arguments, reads stdin, applies translateString, and writes the result to stdout, catching any errors.
Patterns for slick database applicationsSkills Matter
Slick is Typesafe's open source database access library for Scala. It features a collection-style API, compact syntax, type-safe, compositional queries and explicit execution control. Community feedback helped us to identify common problems developers are facing when writing Slick applications. This talk suggests particular solutions to these problems. We will be looking at reducing boiler-plate, re-using code between queries, efficiently modeling object references and more.
Ciklum net sat12112011-alexander fomin-expressions and all, all, allCiklum Ukraine
The document discusses expressions in C# and LINQ. It provides examples of expression trees representing lambda expressions and method calls. It also describes using expressions to implement reflection at compile time and build an IQueryable provider by compiling expressions to IL.
Large scale social recommender systems at LinkedInMitul Tiwari
This document discusses the large-scale social recommender systems used at LinkedIn. It describes how LinkedIn uses recommender systems to suggest news articles, people you may know connections, related searches, and skill endorsements. It also discusses the challenges of scaling these recommender systems given LinkedIn's large user base and social graph. Key recommender systems covered include recommending news items to users, identifying potential connections through people you may know, and suggesting skill endorsements. The document also discusses how social graph analysis and modeling virality helps these recommender systems.
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
LinkedIn has a large professional network with 360M members. They build data-driven products using members' rich profile data. To do this, they ingest online data into offline systems using Apache Kafka. The data is then processed using Hadoop, Spark, Samza and Cubert to compute features and train models. Results are moved back online using Voldemort and Kafka. For example, People You May Know recommendations are generated by triangle closing in Hadoop and Cubert to count common connections faster. Site speed is monitored in real-time using Samza to join logs from different services.
Modeling Impression discounting in large-scale recommender systemsMitul Tiwari
Recommender systems have become very important for many online activities, such as watching movies, shopping for products, and connecting with friends on social networks. User behavioral analysis and user feedback (both explicit and implicit) modeling are crucial for the improvement of any online recommender system. Widely adopted recommender systems at LinkedIn such as “People You May Know” and “Suggested Skills Endorsement” are evolving by analyzing user behaviors on impressed recommendation items.
In this paper, we address modeling impression discounting of
recommended items, that is, how to model users’ no-action feedback on impressed recommended items. The main contributions of this paper include (1) large-scale analysis of impression data from LinkedIn and KDD Cup; (2) novel anti-noise regression techniques, and its application to learn four different impression discounting functions including linear decay, inverse decay, exponential decay, and quadratic decay; (3) applying these impression discounting functions to LinkedIn’s “People You May Know” and “Suggested Skills Endorsements” recommender systems.
Large scale social recommender systems and their evaluationMitul Tiwari
This talk will give an overview of some of the large-scale recommender systems at LinkedIn such as People You May Know (PYMK) and Suggested Skills Endorsements. This talk will also address how we formulate machine learning modeling problems to build these recommender systems and evaluate our models. Modeling for these recommender systems involves careful feature engineering and incorporating user feedback - both explicit and implicit. This talk will describe how we feature engineer through an example of modeling organizational overlap between people for link prediction and community detection over social graph. Also, how we incorporate user feedback through impression discounting ignored recommended results will be described. Careful evaluation of modeling changes both offline and online (A/B testing) is inherent part of measuring effectiveness of our recommender systems. We have built a sophisticated end-to-end A/B testing and evaluation platform called XLNT at LinkedIn and this talk will also cover how we use XLNT for power analysis, A/B testing, and measuring confidence of the results.
Metaphor: A system for related searches recommendationsMitul Tiwari
Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members’ search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 175 million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity.
The system, which has been in live operation for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation
click behavior. We also discuss some of the practical concerns in deploying related search recommendations.
Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members’ search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 175 million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity.
The system, which has been in live operation for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation
click behavior. We also discuss some of the practical concerns in deploying related search recommendations.
Large-scale Social Recommendation Systems: Challenges and OpportunityMitul Tiwari
Keynote talk at 4th International Workshop on Social Recommender Systems (SRS 2013)
In conjunction with 22nd International World Wide Web Conference (WWW 2013). More details: http://cslinux0.comp.hkbu.edu.hk/~fwang/srs2013/
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
9. Data Products: Key Ideas
Recommendations
People You May Know, Viewers of this profile ...
Analytics and Insight
Profile Stats: Who Viewed My Profile, Skills
Visualization
InMaps
9
9
10. Data Products: Challenges
LinkedIn: 2nd largest social network
120 million members on LinkedIn
Billions of connections
Billions of pageviews
Terabytes of data to process
10
10
11. Outline
What do I mean by Data Products?
Systems and Tools we use
Let’s build “People You May Know”
Managing workflow
Serving data in production
Data Quality
Performance 11
11
12. Systems and Tools
Kafka (LinkedIn)
Hadoop (Apache)
Azkaban (LinkedIn)
Voldemort (LinkedIn)
12
12
13. Systems and Tools
Kafka
publish-subscribe messaging system
transfer data from production to HDFS
Hadoop
Azkaban
Voldemort
13
13
17. Outline
What do I mean by Data Products?
Systems and Tools we use
Let’s build “People You May Know”
Managing workflow
Serving data in production
Data Quality
Performance 17
17
18. People You May Know
How do people Alice
know each other?
Bob Carol
18
18
19. People You May Know
How do people Alice
know each other?
Bob Carol
19
19
20. People You May Know
How do people Alice
know each other?
Bob Carol
Triangle closing
20
20
21. People You May Know
How do people Alice
know each other?
Bob Carol
Triangle closing
Prob(Bob knows Carol) ~ the # of common connections
21
21
22. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
22
22
23. Pig Overview
Load: load data, specify format
Store: store data, specify format
Foreach, Generate: Projections, similar to select
Group by: group by column(s)
Join, Filter, Limit, Order, ...
User Defined Functions (UDFs)
23
23
24. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
24
24
25. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
25
25
26. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
26
26
27. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
27
27
28. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
28
28
29. Triangle Closing Example
Alice
Bob Carol
connections = LOAD `connections` USING
1.(A,B),(B,A),(A,C),(C,A) PigStorage();
2.(A,{B,C}),(B,{A}),(C,{A})
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
29
29
30. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
group_conn = GROUP connections BY
2.(A,{B,C}),(B,{A}),(C,{A}) source_id;
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
30
30
31. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})
pairs = FOREACH group_conn GENERATE
3.(A,{B,C}),(A,{C,B}) generatePair(connections.dest_id) as (id1, id2);
4.(B,C,1), (C,B,1)
31
31
32. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A}) common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn
3.(A,{B,C}),(A,{C,B}) GENERATE flatten(group) as (source_id, dest_id),
4.(B,C,1), (C,B,1) COUNT(pairs) as common_connections;
32
32
36. Outline
What do I mean by Data Products?
Systems and Tools we use
Let’s build “People You May Know”
Managing workflow
Serving data in production
Data Quality
Performance 36
36
49. Outline
What do I mean by Data Products?
Systems and Tools we use
Let’s build “People You May Know”
Managing workflow
Serving data in production
Data Quality
Performance
49
49
50. Production Storage
Requirements
Large amount of data/Scalable
Quick lookup/low latency
Versioning and Rollback
Fault tolerance
Offline index building
50
50
51. Voldemort Storage
Large amount of data/Scalable
Quick lookup/low latency
Versioning and Rollback
Fault tolerance through replication
Read only
Offline index building
51
51
55. Outline
What do I mean by Data Products?
Systems and Tools we use
Let’s build “People You May Know”
Managing workflow
Serving data in production
Data Quality
Performance 55
55
57. Outline
What do I mean by Data Products?
Systems and Tools we use
Let’s build “People You May Know”
Managing workflow
Serving data in production
Data Quality
Performance 57
57
61. Performance
Symmetry
Bob knows Carol then Carol knows Bob
Limit
Ignore members with > k connections
Sampling
Sample k-connections
58
58
62. Things Covered
What do I mean by Data Products?
Systems and Tools we use
Let’s build “People You May Know”
Managing workflow
Serving data in production
Data Quality
Performance 59
59
63. SNA Team
Thanks to SNA Team at LinkedIn
http://sna-projects.com
We are hiring!
60
60