This is my strata NY talk about how to build recommendation engines using common items. In particular, I show how multi-modal recommendations can be built using the same framework.
Using Mahout and a Search Engine for RecommendationTed Dunning
I presented this talk at the Open World Forum in Paris in 2013. The ideas here are that you can do basic recommendations and extended forms of recommendation such as intelligent search or cross recommendation or multi-modal recommendation using Mahout's cooccurrence analysis together with a search engine.
Ted Dunning discusses recommendation systems and Apache Mahout. He explains how recommendation works using user-item interaction data to identify patterns and make predictions. Recommendation engines can suggest additional items like movies, music, or restaurants based on a user's preferences and behaviors. Dunning outlines the process of building recommendation models with Mahout including transforming data into user-item matrices and using techniques like log likelihood ratios to identify meaningful relationships.
The document discusses the evolution of data and analytics. It notes that early predictions of future "big data" were inaccurate and that scaling laws are changing radically. The document then summarizes MapR's data platform which enhances Apache Hadoop to provide better performance, reliability, integration and administration compared to other Hadoop distributions. MapR delivers a unified platform for file, analytics and NoSQL workloads with innovations like lockless storage and high throughput.
This is the position talk that I gave at CIKM. Included are 4 algorithms that I feel don't get much academic attention, but which are very important industrially. It isn't necessarily true that these algorithms *should* get academic attention, but I do feel that it is true that they are quite important pragmatically speaking.
Apache Mahout is changing radically. Here is a report on what is coming, notably including an R like domain specific language that can use multiple computational engines such as Spark.
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
This talk describes how indicator-based recommendations can be evolved in real time. Normally, indicator-based recommendations use a large off-line computation to understand the general structure of items to be recommended and then make recommendations in real-time to users based on a comparison of their recent history versus the large-scale product of the off-line computation.
In this talk, I show how the same components of the off-line computation that guarantee linear scalability in a batch setting also give strict real-time bounds on the cost of a practical real-time implementation of the indicator computation.
These are the slides from my talk at FAR Con in Minneapolis recently. The topics are the implications of buried treasure hoards on data security, horror stories and new, simpler and provably secure methods for public data disclosure.
Using Mahout and a Search Engine for RecommendationTed Dunning
I presented this talk at the Open World Forum in Paris in 2013. The ideas here are that you can do basic recommendations and extended forms of recommendation such as intelligent search or cross recommendation or multi-modal recommendation using Mahout's cooccurrence analysis together with a search engine.
Ted Dunning discusses recommendation systems and Apache Mahout. He explains how recommendation works using user-item interaction data to identify patterns and make predictions. Recommendation engines can suggest additional items like movies, music, or restaurants based on a user's preferences and behaviors. Dunning outlines the process of building recommendation models with Mahout including transforming data into user-item matrices and using techniques like log likelihood ratios to identify meaningful relationships.
The document discusses the evolution of data and analytics. It notes that early predictions of future "big data" were inaccurate and that scaling laws are changing radically. The document then summarizes MapR's data platform which enhances Apache Hadoop to provide better performance, reliability, integration and administration compared to other Hadoop distributions. MapR delivers a unified platform for file, analytics and NoSQL workloads with innovations like lockless storage and high throughput.
This is the position talk that I gave at CIKM. Included are 4 algorithms that I feel don't get much academic attention, but which are very important industrially. It isn't necessarily true that these algorithms *should* get academic attention, but I do feel that it is true that they are quite important pragmatically speaking.
Apache Mahout is changing radically. Here is a report on what is coming, notably including an R like domain specific language that can use multiple computational engines such as Spark.
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
This talk describes how indicator-based recommendations can be evolved in real time. Normally, indicator-based recommendations use a large off-line computation to understand the general structure of items to be recommended and then make recommendations in real-time to users based on a comparison of their recent history versus the large-scale product of the off-line computation.
In this talk, I show how the same components of the off-line computation that guarantee linear scalability in a batch setting also give strict real-time bounds on the cost of a practical real-time implementation of the indicator computation.
These are the slides from my talk at FAR Con in Minneapolis recently. The topics are the implications of buried treasure hoards on data security, horror stories and new, simpler and provably secure methods for public data disclosure.
Many statistics are impossible to compute precisely on streaming data. There are some very clever algorithms, however, which allow us to compute very good approximations of these values efficiently in terms of CPU and memory.
The document discusses machine learning and recommendations. It provides an overview of Mahout and how it can be used to build recommender systems. Specifically, it explains how recommendation algorithms work by analyzing cooccurrence patterns in user behavior logs. It then provides a hypothetical example of a working recommender system that collects user history and item metadata, performs cooccurrence analysis with Mahout, and posts results to a search engine to provide recommendations.
This talk focuses on how larger data sets are not only enabling advanced techniques, but also increasing the number of problems within reach of relatively simple techniques, that is "cheap learning".
The document discusses time series data storage and analysis. It begins with an overview of how time series data can be collected from sensors at high volumes, such as millions of data points per second. It then discusses challenges with storing and analyzing this volume of time series data using traditional databases. The document proposes storing time series data in wide tables in MapR-DB and describes how this can enable ingesting data at very high rates, such as over 100 million data points per second. This approach provides viable solutions for industrial applications generating large volumes of time series data.
This talk describes the general architecture common to anomaly detections systems that are based on probabilistic models. By examining several realistic use cases, I illustrate the common themes and practical implementation methods.
Recent work in recommendations allows some really amazing simplicity of implementation while extending the inputs handled to multiple kinds of interactions against items different from the ones being recommended.
Anomaly Detection - New York Machine LearningTed Dunning
Anomaly detection is the art of finding what you don't know how to ask for. In this talk, I walk through the why and how of building probabilistic models for a variety of problems including continuous signals and web traffic. This talk blends theory and practice in a highly approachable way.
Cognitive computing with big data, high tech and low tech approachesTed Dunning
I explain some very approachable methods for analyzing big data via a detour through clipper ships and the 19th century open source scene.
Note that I mixed up the route of the Flying Cloud record in this talk. The Flying Cloud's record was actually from New York to San Francisco and was even more impressive than what I said. The usual time had been about 180 days. With Maury's charts, the time was reduced to about 135 days. The Flying Cloud's time was 89 days.
Thanks to Chen Kung for noticing my error.
This was one of the talks that I gave at the Strata San Jose conference. I migrated my topic a bit, but here is the original abstract:
Application developers and architects today are interested in making their applications as real-time as possible. To make an application respond to events as they happen, developers need a reliable way to move data as it is generated across different systems, one event at a time. In other words, these applications need messaging.
Messaging solutions have existed for a long time. However, when compared to legacy systems, newer solutions like Apache Kafka offer higher performance, more scalability, and better integration with the Hadoop ecosystem. Kafka and similar systems are based on drastically different assumptions than legacy systems and have vastly different architectures. But do these benefits outweigh any tradeoffs in functionality? Ted Dunning dives into the architectural details and tradeoffs of both legacy and new messaging solutions to find the ideal messaging system for Hadoop.
Topics include:
* Queues versus logs
* Security issues like authentication, authorization, and encryption
* Scalability and performance
* Handling applications that span multiple data centers
* Multitenancy considerations
* APIs, integration points, and more
These are the slides that we used to ignite the conversation with the audience at Hadoop Summit EU. Come over to the Mahout dev list to be part of the ongoing conversation.
This talk shows practical methods for find changes in a variety of kinds of data as well as giving real-world examples from finance, telecom, systems monitoring and natural language processing.
This document discusses combining real-time and batch processing by using a semi-aggregated strategy with snapshots. It presents a simple example of counting events in real-time using Storm and periodically aggregating the counts in Hadoop. This strategy allows for real-time processing with Storm while still being able to run batch jobs on historical data in Hadoop. It also discusses how this approach can be extended to other online algorithms like Bayesian bandits by representing distributions with sampling and updating counts in real-time.
The document discusses how different technologies like Hadoop, Storm, Solr, and D3 can be integrated to build a real-time search and recommendation system. It provides examples of how unprocessed data can be stored in Hadoop and indexed by Storm and Solr in real-time to power a search engine. User queries, clicks, and engagement data would then be analyzed to update the search index and provide personalized recommendations. Visualizations from usage data could also be generated in real-time using D3 and node.js.
This document discusses techniques for making recommendations in real-time using co-occurrence analysis. It describes how interaction cut and frequency cut downsampling allow batch co-occurrence analysis to scale to large datasets. These same techniques also enable an online approach to updating recommendations in real-time with each new user interaction. The key insights are that limiting user histories and item frequencies results in a bounded number of updates needed for each new data point, allowing real-time recommendations using MapR's distributed data platform.
MapR offers an enterprise distribution of Hadoop that supports a broad range of use cases. It has been chosen by major companies like Google and Amazon for its capabilities. The document discusses three stories of how companies have benefited from using MapR: 1) A telecom company was able to offload ETL processing to gain a 20x cost performance advantage. 2) A company improved the performance of a recommendation engine on large datasets. 3) A machine learning expert was able to reproduce models accurately and explain previous recommendations.
The document discusses various techniques for anomaly detection in streaming data. It begins by outlining the basic steps of building an anomaly detection model and detecting anomalies in new data. It then discusses challenges in setting an appropriate threshold to determine what constitutes an anomaly. The document explores using adaptive thresholds and algorithms like t-digest to help determine outliers. It also discusses challenges like non-stationary data and more complex models, as well as techniques like clustering and autoencoders to model time series data.
This document discusses t-digest, which provides a compact way to represent a distribution of values. T-digest uses adaptive bins that are smaller near the edges, allowing it to accurately track quantiles even with a limited number of bins. It works by taking data samples, sorting them, and grouping them into bins while respecting a maximum size. The bins can then be merged across samples or time periods. T-digest is useful for applications that need to track distributions over many variables or time periods with limited space.
The document describes the "lambda + epsilon" architecture for combining real-time and batch processing using Hadoop and Storm. It addresses the challenge that Hadoop is not suitable for real-time processing, while Storm lacks batch processing capabilities. The architecture divides computation into two parts for real-time approximation and long-term accurate results to provide a blended view over time.
I gave this talk at Buzzwords just now to fill in for an ill speaker.
The topics include things that are being added to or taken out of Mahout. These include cruft (out), fast clustering (in), nearest neighbor search (in), Pig bindings for Mahout (who knows).
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
This document discusses techniques for co-occurrence-based recommendations using Apache Mahout, Scala, and Spark. It describes how Mahout computes the co-occurrence matrix ATA using a row-outer product formulation that executes in a single pass over the row-partitioned matrix A. It also explains how the computation is optimized physically by using specialized operators like Transpose-Times-Self to avoid repartitioning the matrix. Finally, it provides examples of how the distributed computation of ATA is implemented across worker nodes.
Here are the key steps for Exercise 3:
1. Create a FileDataModel object, passing in the CSV file
2. Instantiate different UserSimilarity objects like PearsonCorrelationSimilarity, EuclideanDistanceSimilarity
3. Calculate similarities between users by calling userSimilarity() on the similarity objects, passing the user IDs
4. Print out the similarities to compare the different measures
The CSV file should contain enough user preference data (user IDs, item IDs, ratings) for the similarity calculations to be meaningful. This exercise demonstrates how to easily plug different similarity functions into Mahout's common interfaces.
Many statistics are impossible to compute precisely on streaming data. There are some very clever algorithms, however, which allow us to compute very good approximations of these values efficiently in terms of CPU and memory.
The document discusses machine learning and recommendations. It provides an overview of Mahout and how it can be used to build recommender systems. Specifically, it explains how recommendation algorithms work by analyzing cooccurrence patterns in user behavior logs. It then provides a hypothetical example of a working recommender system that collects user history and item metadata, performs cooccurrence analysis with Mahout, and posts results to a search engine to provide recommendations.
This talk focuses on how larger data sets are not only enabling advanced techniques, but also increasing the number of problems within reach of relatively simple techniques, that is "cheap learning".
The document discusses time series data storage and analysis. It begins with an overview of how time series data can be collected from sensors at high volumes, such as millions of data points per second. It then discusses challenges with storing and analyzing this volume of time series data using traditional databases. The document proposes storing time series data in wide tables in MapR-DB and describes how this can enable ingesting data at very high rates, such as over 100 million data points per second. This approach provides viable solutions for industrial applications generating large volumes of time series data.
This talk describes the general architecture common to anomaly detections systems that are based on probabilistic models. By examining several realistic use cases, I illustrate the common themes and practical implementation methods.
Recent work in recommendations allows some really amazing simplicity of implementation while extending the inputs handled to multiple kinds of interactions against items different from the ones being recommended.
Anomaly Detection - New York Machine LearningTed Dunning
Anomaly detection is the art of finding what you don't know how to ask for. In this talk, I walk through the why and how of building probabilistic models for a variety of problems including continuous signals and web traffic. This talk blends theory and practice in a highly approachable way.
Cognitive computing with big data, high tech and low tech approachesTed Dunning
I explain some very approachable methods for analyzing big data via a detour through clipper ships and the 19th century open source scene.
Note that I mixed up the route of the Flying Cloud record in this talk. The Flying Cloud's record was actually from New York to San Francisco and was even more impressive than what I said. The usual time had been about 180 days. With Maury's charts, the time was reduced to about 135 days. The Flying Cloud's time was 89 days.
Thanks to Chen Kung for noticing my error.
This was one of the talks that I gave at the Strata San Jose conference. I migrated my topic a bit, but here is the original abstract:
Application developers and architects today are interested in making their applications as real-time as possible. To make an application respond to events as they happen, developers need a reliable way to move data as it is generated across different systems, one event at a time. In other words, these applications need messaging.
Messaging solutions have existed for a long time. However, when compared to legacy systems, newer solutions like Apache Kafka offer higher performance, more scalability, and better integration with the Hadoop ecosystem. Kafka and similar systems are based on drastically different assumptions than legacy systems and have vastly different architectures. But do these benefits outweigh any tradeoffs in functionality? Ted Dunning dives into the architectural details and tradeoffs of both legacy and new messaging solutions to find the ideal messaging system for Hadoop.
Topics include:
* Queues versus logs
* Security issues like authentication, authorization, and encryption
* Scalability and performance
* Handling applications that span multiple data centers
* Multitenancy considerations
* APIs, integration points, and more
These are the slides that we used to ignite the conversation with the audience at Hadoop Summit EU. Come over to the Mahout dev list to be part of the ongoing conversation.
This talk shows practical methods for find changes in a variety of kinds of data as well as giving real-world examples from finance, telecom, systems monitoring and natural language processing.
This document discusses combining real-time and batch processing by using a semi-aggregated strategy with snapshots. It presents a simple example of counting events in real-time using Storm and periodically aggregating the counts in Hadoop. This strategy allows for real-time processing with Storm while still being able to run batch jobs on historical data in Hadoop. It also discusses how this approach can be extended to other online algorithms like Bayesian bandits by representing distributions with sampling and updating counts in real-time.
The document discusses how different technologies like Hadoop, Storm, Solr, and D3 can be integrated to build a real-time search and recommendation system. It provides examples of how unprocessed data can be stored in Hadoop and indexed by Storm and Solr in real-time to power a search engine. User queries, clicks, and engagement data would then be analyzed to update the search index and provide personalized recommendations. Visualizations from usage data could also be generated in real-time using D3 and node.js.
This document discusses techniques for making recommendations in real-time using co-occurrence analysis. It describes how interaction cut and frequency cut downsampling allow batch co-occurrence analysis to scale to large datasets. These same techniques also enable an online approach to updating recommendations in real-time with each new user interaction. The key insights are that limiting user histories and item frequencies results in a bounded number of updates needed for each new data point, allowing real-time recommendations using MapR's distributed data platform.
MapR offers an enterprise distribution of Hadoop that supports a broad range of use cases. It has been chosen by major companies like Google and Amazon for its capabilities. The document discusses three stories of how companies have benefited from using MapR: 1) A telecom company was able to offload ETL processing to gain a 20x cost performance advantage. 2) A company improved the performance of a recommendation engine on large datasets. 3) A machine learning expert was able to reproduce models accurately and explain previous recommendations.
The document discusses various techniques for anomaly detection in streaming data. It begins by outlining the basic steps of building an anomaly detection model and detecting anomalies in new data. It then discusses challenges in setting an appropriate threshold to determine what constitutes an anomaly. The document explores using adaptive thresholds and algorithms like t-digest to help determine outliers. It also discusses challenges like non-stationary data and more complex models, as well as techniques like clustering and autoencoders to model time series data.
This document discusses t-digest, which provides a compact way to represent a distribution of values. T-digest uses adaptive bins that are smaller near the edges, allowing it to accurately track quantiles even with a limited number of bins. It works by taking data samples, sorting them, and grouping them into bins while respecting a maximum size. The bins can then be merged across samples or time periods. T-digest is useful for applications that need to track distributions over many variables or time periods with limited space.
The document describes the "lambda + epsilon" architecture for combining real-time and batch processing using Hadoop and Storm. It addresses the challenge that Hadoop is not suitable for real-time processing, while Storm lacks batch processing capabilities. The architecture divides computation into two parts for real-time approximation and long-term accurate results to provide a blended view over time.
I gave this talk at Buzzwords just now to fill in for an ill speaker.
The topics include things that are being added to or taken out of Mahout. These include cruft (out), fast clustering (in), nearest neighbor search (in), Pig bindings for Mahout (who knows).
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
This document discusses techniques for co-occurrence-based recommendations using Apache Mahout, Scala, and Spark. It describes how Mahout computes the co-occurrence matrix ATA using a row-outer product formulation that executes in a single pass over the row-partitioned matrix A. It also explains how the computation is optimized physically by using specialized operators like Transpose-Times-Self to avoid repartitioning the matrix. Finally, it provides examples of how the distributed computation of ATA is implemented across worker nodes.
Here are the key steps for Exercise 3:
1. Create a FileDataModel object, passing in the CSV file
2. Instantiate different UserSimilarity objects like PearsonCorrelationSimilarity, EuclideanDistanceSimilarity
3. Calculate similarities between users by calling userSimilarity() on the similarity objects, passing the user IDs
4. Print out the similarities to compare the different measures
The CSV file should contain enough user preference data (user IDs, item IDs, ratings) for the similarity calculations to be meaningful. This exercise demonstrates how to easily plug different similarity functions into Mahout's common interfaces.
Multi-model recommendation engines use multiple kinds of behavior as input and can be implemented using standard search engine technology. I show how and why starting with basic recommendations all the way through full multi-modal systems.
Utilizing Mahout, implement a Collaborative Filtering framework using historical data, in this instance, movie ratings by 943 users, to provide Item-based recommendations. Three item based recommendations will be provided for each user.
How to create a cutting edge recommender that is fast, scalable, can use almost any applicable data, and is extremely flexible for use in many different contexts. Uses Spark, Mahout, and a search engine.
Latent factor models for Collaborative Filteringsscdotopen
The document discusses latent factor models for collaborative filtering. It describes how latent factor models (1) map both users and items to a latent factor space to characterize them, (2) approximate ratings as the dot product of user and item vectors, and (3) can be used to predict unknown ratings. It also covers techniques like stochastic gradient descent and alternating least squares for training latent factor models on explicit and implicit feedback data.
Matrix Factorization Techniques For Recommender SystemsLei Guo
The document discusses matrix factorization techniques for recommender systems. It begins by describing common recommender system strategies like content-based and collaborative filtering approaches. It then introduces matrix factorization methods, which characterize both users and items by vectors of latent factors inferred from rating patterns. The basic matrix factorization model approximates user ratings as the inner product of user and item vectors in the joint latent factor space. Learning algorithms like stochastic gradient descent and alternating least squares are used to compute the user and item vectors by minimizing a regularized error function on known ratings.
DFW Big Data talk on Mahout RecommendersTed Dunning
This talk focussed on how to build recommenders using new technology and capabilities from Mahout. The key here is that recommenders can be built much more easily than you might expect.
Multi-model recommendation engines use multiple kinds of behavior as input and can be implemented using standard search engine technology. I show how and why starting with basic recommendations all the way through full multi-modal systems.
Recent work in recommendations allows some really amazing simplicity of implementation while extending the inputs handled to multiple kinds of interactions against items different from the ones being recommended.
This document discusses predictive analytics using Hadoop. It provides examples of recommendation and classification using big data. It describes obtaining large training datasets through crowdsourcing and implicit feedback. It also discusses operational considerations for predictive models, including snapshotting data, leveraging NFS for ingestion, and ensuring high availability. The document concludes with a question and answer section.
MapR offers an enterprise distribution of Hadoop that supports a broad range of use cases. It has been chosen by major companies like Google and Amazon for its capabilities. The document discusses three stories of how companies have benefited from using MapR: 1) A telecom company was able to offload ETL processing to gain a 20x cost performance advantage. 2) A company improved the performance of a recommendation engine on large datasets. 3) A machine learning expert was able to reproduce models accurately and explain previous recommendations.
Complement Deep Learning with Cheap Learning: Recent results of deep learning on hard problems has set the data world all a titter and made deep learning the fashion of the time.
But it is very important to remember that as data expands, the learning problems that are encountered are often nearly green field problems and it is often possible to solve these problems using remarkably simple techniques. Indeed, on many problems these simple techniques will give results as good as more complex ones, not because they are profound, but because many problems become simpler at scale.
That said, it isn’t always obvious how to do this. I will describe some of these techniques and show how they can be applied in practice.
The document discusses how different technologies like Hadoop, Storm, Solr, and D3 can be integrated together using common storage platforms. It provides examples of how real-time and batch processing can be combined for applications like search and recommendations. The document advocates that hybrid systems integrating these technologies can provide benefits over traditional tiered architectures and be implemented today.
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
The document discusses techniques for generating recommendations based on item co-occurrence analysis. It describes how to build a user-item history matrix from log files and transform it into an item-item co-occurrence matrix. It discusses using anomalous co-occurrences as indicators to make recommendations and scaling the analysis using interaction cuts and frequency limits. It also describes how to update the co-occurrence matrix incrementally in real-time to enable online recommendations.
Ted Dunning presents on algorithms that really matter for deploying machine learning systems. The most important advances are often not the algorithms but how they are implemented, including making them deployable, robust, transparent, and with the proper skillsets. Clever prototypes don't matter if they can't be standardized. Sketches that produce many weighted centroids can enable online clustering at scale. Recursive search and recommendations, where one implements the other, can also be important.
The document discusses how big data has enabled new opportunities by changing scaling laws and problem landscapes. Specifically, linearly scaling costs with big data now make it feasible to process large amounts of data, opening up many problems that were previously impossible or too difficult. This has created many "green field" opportunities where simple approaches can solve important problems. Two examples discussed are using log analysis to detect security threats and using transaction histories to find a common point of compromise for a data breach.
SMAC - Presentation from RetailWeek Technology Summit, Sept 23AirTight Networks
The document discusses the concept of #SMAC, which stands for social, mobile, analytics, and cloud technologies and how these technologies are transforming businesses. It provides examples of how various retailers have leveraged #SMAC technologies like mobile apps, secure Wi-Fi networks, social media, and cloud-based analytics to improve customer experiences, increase engagement and sales, and reduce IT costs. The document advocates that businesses must adopt #SMAC strategies to remain competitive and highlights how the convergence of these technologies presents opportunities for new business models and customer experiences.
Google Analytics Konferenz 2018_Rock your Data - Aktiviere deine Daten_ Thoma...e-dialog GmbH
Kennst Du das auch? Du hast ein Trackingsystem auf Deiner Website und in Deiner App? Du hast vielleicht sogar Datenerfassung in Stores? Du hast also eine Unmenge an relevanten Daten aus denen es nun gilt, smarte Aktionen mit Mehrwert für Dein Unternehmen und Deine Kunden zu generieren?
In diesem Vortrag zeigen wir Dir, wie man mit Hilfe von Customer Journey und dem Einsatz von DMPs und CDPs Daten richtig erfasst und aktiviert, damit man aus diesen nicht nur Graphen sondern einen echten Uplift in ROI und Kundennutzen schafft. Praktische Beispiele und Use-Cases bringen wir natürlich mit.
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
This discusses the architecture of an end-to-end application that combines streaming data with machine learning to do real-time analysis and visualization of where and when Uber cars are clustered, so as to analyze and visualize the most popular Uber locations.
SparkScore (The Social Net Promoter Score): A methodology for measuring socia...SocialMedia.org
In her Brands-Only Summit presentation, Satmetrix's VP of Innovation and Strategy, discusses SparkScore -- the Social Net Promoter Score.
She explains how this social media analytic marries the insight from structured, survey-based solutions to the "social universe" through a single, standard metric.
In September, I presented the Lima Consulting Group Digital Transformation Maturity Model to the closed-circuit television to Sanofi employees in the Americas. Here's the material!
This document discusses techniques for detecting advanced persistent threats (APTs). It provides examples of APT attacks and outlines strategies for analyzing event sequences and symbol co-occurrences in large datasets to identify anomalous patterns that can reveal APT activity. Statistical tests like log-likelihood ratio tests are recommended for finding interesting coincidences in tables of symbol co-occurrence data that may indicate security threats.
The document discusses how to generate leads using Brandwatch by refining search queries through an iterative process of starting broad, adding purchase intent phrases and brands, and continually updating the search terms. It recommends setting up alerts and dashboards to manage the incoming leads and integrating with tools like Hootsuite to assign leads efficiently. The key takeaways are to test searches, do outside research on trends, use operators like NEAR to improve relevance, and refine searches on an ongoing basis.
SMAC _ Can It Maximise Staff and Customer Engagement? RWTSAirTight Networks
SMAC _ Can It Maximise Staff and Customer Engagement? RWTS
@DevinAkin keynote at Retail Week Technology Summit - London, UK - September 26 2013 | RWTS @retailweek
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
Telecom operators need to find operational anomalies in their networks very quickly. This need, however, is shared with many other industries as well so there are lessons for all of us here. Spark plus a streaming architecture can solve these problems very nicely. I will present both a practical architecture as well as design patterns and some detailed algorithms for detecting anomalies in event streams. These algorithms are simple but quite general and can be applied across a wide variety of situations.
Interoperability in a B2B Word (NordicAPIS April 2014)Nordic APIs
The document discusses how B2B integration is evolving from traditional methods like EDI and FTP to use of APIs and web services. It notes that B2B objectives of securely transacting with partners hasn't changed, but the technologies used are modernizing from SOA, SOAP, and REST to focus on APIs. It emphasizes that B2B strategy should include an API strategy and consider both developers and humans. Whiteboarding APIs and dealing with integration challenges are also discussed.
Similar to Building multi-modal recommendation engines using search engines (20)
We introduce the idea that metadata, including project information, data labels, data characteristics and indications of valuable use, can be propagated through a data processing lineage graph. Further, finding examples of significant cooccurrence of propagated and original metadata gives us the basis of an interesting kind of search engine gives interesting recommendations of data given a problem statement even in a near cold-start situation.
This document discusses progress in using Kubernetes for big data applications. It begins by introducing Kubernetes and explaining its growing popularity due to support from major cloud providers and an open source community. It then discusses some challenges with using containers, particularly around state management. The document proposes using MapR's data platform to provide a global namespace and support for files, streams and tables to address state issues when using Kubernetes for big data applications.
The folk wisdom has always been that when running stateful applications inside containers, the only viable choice is to externalize the state so that the containers themselves are stateless or nearly so. Keeping large amounts of state inside containers is possible, but it’s considered a problem because stateful containers generally can’t preserve that state across restarts.
In practice, this complicates the management of large-scale Kubernetes-based infrastructure because these high-performance storage systems require separate management. In terms of overall system management, it would be ideal if we could run a software-defined storage system directly in containers managed by Kubernetes, but that has been hampered by lack of direct device access and difficult questions about what happens to the state on container restarts.
Ted Dunning describes recent developments that make it possible for Kubernetes to manage both compute and storage tiers in the same cluster. Container restarts can be handled gracefully without loss of data or a requirement to rebuild storage structures and access to storage from compute containers is extremely fast. In some environments, it’s even possible to implement elastic storage frameworks that can fold data onto just a few containers during quiescent periods or explode it in just a few seconds across a large number of machines when higher speed access is required.
The benefits of systems like this extend beyond management simplicity, because applications can be more Agile precisely because the storage layer is more stable and can be uniformly accessed from any container host. Even better, it makes it a snap to configure and deploy a full-scale compute and storage infrastructure.
Ellen Friedman and I spoke at the ACM meetup about how stream-first architecture can have a big impact and how the logistics of machine learning is a great example of that impact.
This is my half of the presentation.
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
This document discusses tensors and their use in machine learning. It explains that tensors were originally developed for physics but are now commonly used in computing to represent important patterns of computation. Tensors make it easier to code numerical algorithms by capturing operations like element-wise computations, outer products, reductions, and matrix/vector products. Additionally, automatic differentiation is now possible using tensor frameworks, which allows gradients to be computed automatically rather than derived by hand. This has significantly advanced machine learning by enabling new optimization algorithms and the training of complex neural networks. Tensor systems also allow the same code to run on CPUs, GPUs, and clusters, improving productivity.
The logistics of machine learning typically take waaay more effort than the machine learning itself. Moreover, machine learning systems aren't like normal software projects so continuous integration takes on new meaning.
How the Internet of Things is Turning the Internet Upside DownTed Dunning
This is a wide-ranging talk that goes into how the internet is architected, how that architecture is changing as a result of internet of things, how the internet of things worked in the 19th century big data, open-source community and how to build time-series databases to make this all possible.
Really.
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
Apache Kylin (incubating) is a new project to bring OLAP cubes to Hadoop. I walk through the project and describe how it works and how users see the project.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Note to speaker: Move quickly through 1st two slides just to set the tone of familiar use cases but somewhat complicated under-the-covers math and algorithms… You don’t need to explain or discuss these examples at this point… just mention one or twoTalk track: Machine learning shows up in many familiar everyday examples, from product recommendations to listing news topics to filtering out that nasty spam from email….
Talk track: Under the covers, machine learning looks very complicated. So how do you get from here to the familiar examples? Tonight’s presentation will show you some simple tricks to help you apply machine learning techniques to build a powerful recommendation engine.
Note to trainers: the next series of slides start with a cartoon example just to set the pattern of how to find co-occurrence and use it to find indicators of what to recommend. Of course, real examples require a LOT of data of user-item interaction history to actually work, so this is just an analogy to get the idea across…
* A history of what everybody has done. Obviously this is just a cartoon because large numbers of users and interactions with items would be required to build a recommender* Next step will be to predict what a new user might like…
*Bob is the “new user” and getting apple is his history
*Here is where the recommendation engine needs to go to work…Note to trainer: you might see if audience calls out the answer before revealing next slide…
Now you see the idea of co-occurrence as a basis for recommendation…
*Now we have a new user, Amelia. Like everybody else, she gets a pony… what should the recommender offer her based on her history?
* Pony not interesting because it is so widespread that it does not differentiate a pattern
Note to trainer: This is the situation similar to that in which we started, with three users in our history. The difference is that now everybody got a pony. Bob has apple and pony but not a puppy…yet
*Binary matrix is stored sparsely
*Convert by MapReduce into a binary matrixNote to trainer: Whether consider apple to have occurred with self is open question
*Convert by MapReduce into a binary matrixNote to trainer: diagonal gives total occurrence for each item (self to self) and is a distraction/ not helpful, so the diagonal here is left blank
Old joke: all the world can be divided into 2 categories: Scotch tape and non-Scotch tape… This is a way to think about the co-occurrence
Note to trainer: Give students time to offer comments. There’s a lot to discuss here.*Upper left: In context of A, B occurs the largest number of times, 13 times out of 1013 appearances with over 100,000 samples. But that’s only ~1.3% as co-occurrence with A out of of all times B appears.*Upper right: B occurs in context of A 33% of time, but counts so small as to be of concern.*Lower right: most significant anomaly in that B still occurs a small number of times of over 100,000 samples, but it ALWAYS co-occurs with A when it does appear.
*The test Mahout uses for this is Log Likelihood Ration (LLR)* Red circle marks the choice that displays highest confidenceNote to trainer: Slide animates with click to show LLR results. SECOND Click animates the choice that has highest confidence.
Note to trainer: we go back to the earlier matrix as a reminder…
Only important co-occurrence is puppy follows apple
*Take that row of matrix and combine with all the meta data we might have…*Important thing to get from the co-occurrence matrix is this indicator..Cool thing: analogous to what a lot of recommendation engines do*This row forms the indicator field in a Solr document containing meta-data (you do NOT have to build a separate index for the indicators)Find the useful co-occurrence and get rid of the rest. Sparsify and get the anomalous co-occurrence
Note to trainer: take a little time to explore this here and on the next couple of slides. Details enlarged on next slide
*This indicator field is where the output of the Mahout recommendation engine are stored (the row from the indicator matrix that identified significant or interesting co-occurrence. *Keep in mind that this recommendation indicator data is added to the same original document in the Solr index that contains meta data for the item in question
This is a diagnostics window in the LucidWorksSolr index (not the web interface a user would see). It’s a way for the developer to do a rough evaluation (laugh test) of the choices offered by the recommendation engine.In other words, do these indicator artists represented by their indicator Id make reasonable recommendations Note to trainer: artist 303 happens to be The Beatles. Is that a good match for Chuck Berry?
Here we recap what we have in the different components of the recommenderWe start with the meta data for an item stored in the Solr index
*Here we’ve added examples of indicator data for the indicator field(s) of the document
*Here we show you what information might be in the sample query
Note to trainer: you could ask the class to consider which data is related… for example, the first 3 bullets of the query relate to meta data for the item, not to data produced by the recommendation algorithm. The last 3 bullets refer to data in the sample query related to data in the indicator field(s) that were produced by the Mahout recommendation engine.