This document discusses a publish/subscribe model for top-k matching over continuous data streams. It begins by motivating the need to address drawbacks in traditional boolean matching approaches. The research problem is defined as how to define an efficient scoring algorithm that integrates multiple metrics, and how to adapt existing indexing structures to support top-k matching queries over large subscription volumes and high event rates. The document outlines the proposed design, which includes a centralized architecture with personalized subscriptions, relevance scoring, and dual indexing mechanisms.
Learn more about the tools, techniques and technologies for working productively with data at any scale. This presentation introduces the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Jon Einkauf, Senior Product Manager, Elastic MapReduce, AWS
Alan Priestley, Marketing Manager, Intel and Bob Harris, CTO, Channel 4
Locality Sensitive Hashing (LSH) is a technique for solving near neighbor queries in high dimensional spaces. It works by using random projections to map similar data points to the same "buckets" with high probability, allowing efficient retrieval of nearest neighbors. The key properties required of the hash functions used are that they are locality sensitive, meaning nearby points are hashed to the same value more often than distant points. LSH allows solving near neighbor queries approximately in sub-linear time versus expensive exact algorithms like kd-trees that require at least linear time.
Modern Database Development Oow2008 Lucas JellemaLucas Jellema
This document summarizes an Oracle database expert's presentation on optimal use of Oracle Database 10g and 11g for modern application development. Some key points covered include how modern applications are distributed, global, and service-oriented; how new Oracle database features support cloud computing, analytics, and internationalization; and guidelines for developing applications that leverage the database while maintaining independence.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
Как да разберем интересите на посетителитеNetpeakBG
На 4 април 2014 на традиционната българска SEO конференция SEO аналитик на Netpeak и CEO на Prodvigator Олег Саламаха изнесе лекция на тема "Как да разберем интересите на посетителите", в която засегна:
типове търсения на потребителите (информационни, транзакционни)
как да разберем потребностите (интент) на потребителите
откъде да черпим информация за потребностите на потребителите
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
This document summarizes benchmarking the MinHash and Locality Sensitive Hashing (LSH) algorithm for calculating pairwise similarity on Reddit post data in Spark. The MinHash algorithm was used to reduce the dimensionality of the data before applying LSH to further reduce dimensionality and find similar items. Benchmarking showed that MinHash+LSH was significantly faster than a brute force approach, calculating similarities in 7.68 seconds for 100k entries compared to 9.99 billion seconds for brute force. Precision was lower for MinHash+LSH at 0.009 compared to 1 for brute force, but recall was higher at 0.036 compared to vanishingly small for brute force. The techniques were also applied to a real-time streaming
Coherence Overview - OFM Canberra July 2014Joelith
Slides from the July Oracle Middleware Forum held in Canberra, Australia. Provides an overview of Coherence. Check out our blog for more details: ofmcanberra.wordpress.com
Learn more about the tools, techniques and technologies for working productively with data at any scale. This presentation introduces the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Jon Einkauf, Senior Product Manager, Elastic MapReduce, AWS
Alan Priestley, Marketing Manager, Intel and Bob Harris, CTO, Channel 4
Locality Sensitive Hashing (LSH) is a technique for solving near neighbor queries in high dimensional spaces. It works by using random projections to map similar data points to the same "buckets" with high probability, allowing efficient retrieval of nearest neighbors. The key properties required of the hash functions used are that they are locality sensitive, meaning nearby points are hashed to the same value more often than distant points. LSH allows solving near neighbor queries approximately in sub-linear time versus expensive exact algorithms like kd-trees that require at least linear time.
Modern Database Development Oow2008 Lucas JellemaLucas Jellema
This document summarizes an Oracle database expert's presentation on optimal use of Oracle Database 10g and 11g for modern application development. Some key points covered include how modern applications are distributed, global, and service-oriented; how new Oracle database features support cloud computing, analytics, and internationalization; and guidelines for developing applications that leverage the database while maintaining independence.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
Как да разберем интересите на посетителитеNetpeakBG
На 4 април 2014 на традиционната българска SEO конференция SEO аналитик на Netpeak и CEO на Prodvigator Олег Саламаха изнесе лекция на тема "Как да разберем интересите на посетителите", в която засегна:
типове търсения на потребителите (информационни, транзакционни)
как да разберем потребностите (интент) на потребителите
откъде да черпим информация за потребностите на потребителите
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
This document summarizes benchmarking the MinHash and Locality Sensitive Hashing (LSH) algorithm for calculating pairwise similarity on Reddit post data in Spark. The MinHash algorithm was used to reduce the dimensionality of the data before applying LSH to further reduce dimensionality and find similar items. Benchmarking showed that MinHash+LSH was significantly faster than a brute force approach, calculating similarities in 7.68 seconds for 100k entries compared to 9.99 billion seconds for brute force. Precision was lower for MinHash+LSH at 0.009 compared to 1 for brute force, but recall was higher at 0.036 compared to vanishingly small for brute force. The techniques were also applied to a real-time streaming
Coherence Overview - OFM Canberra July 2014Joelith
Slides from the July Oracle Middleware Forum held in Canberra, Australia. Provides an overview of Coherence. Check out our blog for more details: ofmcanberra.wordpress.com
The document summarizes some unexpected uses of the Apache Lucene library beyond traditional text search. In 3 sentences: Lucene can be used as a fast key-value store, to index and store content in various file formats, and for machine learning tasks like classifying unlabeled documents into predefined categories using vector space models and analyzing document similarity. It also discusses using Lucene for record linkage, question answering systems, randomized testing to improve code quality, and performance improvements in newer Lucene versions.
Oracle Coherence is a data grid that provides reliable, scalable universal data access and management. It manages information in a grid environment where multiple servers work together to store, process, and manage data as a service. Coherence uses different topologies like replicated, partitioned, and near caching to distribute data across servers. It supports features like events, queries, and various caching modes like read-through, write-through, and write-behind caching. Coherence improves performance by reducing latency through locality of data and parallel processing. It increases availability through redundancy and removes single points of failure. Scalability is achieved through scale-out functionality and the ability to add more nodes to the Coherence cluster.
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
This document discusses using locality sensitive hashing (LSH) to solve large-scale search problems by clustering similar data points together. It presents an example of using LSH to find Facebook friends with similar interests. The key steps are: (1) representing each user as a vector of interests and computing minhashes, (2) clustering users into buckets based on minhash similarity, and (3) comparing a candidate to others in their bucket to find nearest neighbors. The performance of LSH involves tuning parameters like the number of minhashes and bands to balance false positives and negatives. Implementing LSH on MapReduce can make it scalable to large datasets.
Web application performance correlates with page views. Find out in this session how to maximize the performance of the OCI8 database extension to build fast, scalable web sites and attract users. Includes discussion of Oracle Database 11.2 and the upcoming PHP OCI8 1.4 extension.
This document discusses using locality sensitive hashing (LSH) to detect trips with overlapping routes in large GPS datasets. It describes challenges with noisy GPS data and large search spaces. The approach involves representing trips as sets of area segments, computing Jaccard similarity, and using MinHash to map similar trips to the same buckets with high probability. Multiple hash functions are applied to increase probability. Approaches for efficient distributed processing on Spark are discussed, including reducing network usage. Future work involves migrating to Spark ML APIs and handling streaming inserts.
All marketing aspects including financial and HR policies are explained elaborately . Subsidiaries, value system , competitors. A comparison study among TCS INFOSYS and Wipro is given Briefly.
The Future of BriteCore - Product DevelopmentPhil Reynolds
Over the next five years, BriteCore plans to completely rewrite its software suite. By making the suite more modular, stable, and scalable, BriteCore will be able to support the needs of all insurers globally.
Original: Lean Data Model Storming for the Agile EnterpriseDaniel Upton
This original publication, aimed at data project leaders, describes a set of methods for agile modeling and delivery of an enterprise data warehouse, which together make it quicker to deliver, faster to load, and more easily adaptable to unexpected changes in source data, business rules or reporting/analytic requirements.
With this set of methods, the parts of data warehouse development that used to be the most resistant to sprint-sized / agile work breakdown -- data modeling and ETL -- are now completely agile, so that this tasking, too, can now be sized purely based on customer requirements, rather than the dictates of a traditional data warehouse architecture.
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
The document discusses Socialmetrix's evolution of their real-time social media analytics architecture over 4 iterations to meet growing customer and data demands. It describes how they moved from a monolithic to distributed setup using technologies like AWS, Spark, Kafka and Cassandra to improve scalability, costs and resilience while adding new data sources and features. Key lessons included automating deployments, monitoring systems, and using AWS services like S3, EMR and DynamoDB to enable rapid prototyping and reprocessing as needed to support real-time and batch analytics.
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
This document discusses using Oracle Business Intelligence Enterprise Edition (OBIEE) and the Data Vault data modeling technique to virtualize a business intelligence environment in an agile way. Data Vault provides a flexible and adaptable modeling approach that allows for rapid changes. OBIEE allows for the virtualization of dimensional models built on a Data Vault foundation, enabling quick iteration and delivery of reports and dashboards to users. Together, Data Vault and OBIEE provide an agile approach to business intelligence.
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
David Durst of BlackRock presents TopNotch, a system for systematically quality controlling big data. TopNotch uses assertions to define and measure data quality, reuses commands across data sets to maximize efficiency, and institutionalizes knowledge of data sets through plans and commands. It provides a unit testing framework for data with assertions to verify facts, diffs to compare data sets, and views to transform data. This solves the problems of defining data quality, efficiently quality controlling many data sets, and institutionalizing knowledge of data sets.
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsLooker
Infectious Media runs on data. But, as an ad-tech company that records hundreds of thousands of web events per second, they have have to deal with data at a scale not seen by most companies. You can not make decisions with data when people need to write manual SQL only for queries take 10-20 minutes to return. Infectious Media made the switch to Google BigQuery and Looker and now every member of every team can get the data they need in seconds.
Infectious Media shares:
- Why they chose their current stack
- Why faster data means happier customers
- Advantages and practical implications of storing and processing that much data
Check out the recording at https://info.looker.com/h/i/308848878-power-to-the-people-a-stack-to-empower-every-user-to-make-data-driven-decisions
It is a fascinating, explosive time for enterprise analytics.
It is from the position of analytics leadership that the mission will be executed and company leadership will emerge. The data professional is absolutely sitting on the performance of the company in this information economy and has an obligation to demonstrate the possibilities and originate the architecture, data, and projects that will deliver analytics. After all, no matter what business you’re in, you’re in the business of analytics.
The coming years will be full of big changes in enterprise analytics and Data Architecture. William will kick off the fourth year of the Advanced Analytics series with a discussion of the trends winning organizations should build into their plans, expectations, vision, and awareness now.
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
A Flexible Recommendation System for Cable TVFrancisco Couto
1. The document proposes a flexible recommendation system for cable TV to address issues like information overflow and dissatisfaction from users.
2. It describes extracting implicit feedback from users and engineering contextual features to create a large-scale dataset for learning recommendations.
3. An evaluation of the recommendation system shows that a learning to rank approach with contextual information outperforms other methods in accuracy while maintaining diversity and novelty, though recommending new programs requires more investigation.
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
The document discusses Conviva's Unified Framework (CUF) for analyzing video streaming data in real-time, near real-time, and offline using Spark and Databricks. It summarizes Conviva's platform for measuring video quality of experience across devices and networks. The framework unifies the three analysis stacks onto Spark to share code and insights. Using Databricks improves the offline analysis speed and enables data scientists to independently explore large datasets and build machine learning models.
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuNeo4j
Neo4j allows for faster development and performance compared to relational databases for a content collaboration platform. The graph database reduces complexity, improves query performance, and enables faster development cycles. Visualizing the graph data provides valuable analytics and insights into user behavior to further improve the product.
How Celtra Optimizes its Advertising Platformwith DatabricksGrega Kespret
Leading brands such as Pepsi and Macy’s use Celtra’s technology platform for brand advertising. To inform better product design and resolve issues faster, Celtra relies on Databricks to gather insights from large-scale, diverse, and complex raw event data. Learn how Celtra uses Databricks to simplify their Spark deployment, achieve faster project turnaround time, and empower people to make data-driven decisions.
In this webinar, you will learn how Databricks helps Celtra to:
- Utilize Apache Spark to power their production analytics pipeline.
- Build a “Just-in-Time” data warehouse to analyze diverse data sources such as Elastic Load Balancer access logs, raw tracking events, operational data, and reportable metrics.
- Go beyond simple counting and group events into sequences (i.e., sessionization) and perform more complex analysis such as funnel analytics.
The document summarizes some unexpected uses of the Apache Lucene library beyond traditional text search. In 3 sentences: Lucene can be used as a fast key-value store, to index and store content in various file formats, and for machine learning tasks like classifying unlabeled documents into predefined categories using vector space models and analyzing document similarity. It also discusses using Lucene for record linkage, question answering systems, randomized testing to improve code quality, and performance improvements in newer Lucene versions.
Oracle Coherence is a data grid that provides reliable, scalable universal data access and management. It manages information in a grid environment where multiple servers work together to store, process, and manage data as a service. Coherence uses different topologies like replicated, partitioned, and near caching to distribute data across servers. It supports features like events, queries, and various caching modes like read-through, write-through, and write-behind caching. Coherence improves performance by reducing latency through locality of data and parallel processing. It increases availability through redundancy and removes single points of failure. Scalability is achieved through scale-out functionality and the ability to add more nodes to the Coherence cluster.
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
This document discusses using locality sensitive hashing (LSH) to solve large-scale search problems by clustering similar data points together. It presents an example of using LSH to find Facebook friends with similar interests. The key steps are: (1) representing each user as a vector of interests and computing minhashes, (2) clustering users into buckets based on minhash similarity, and (3) comparing a candidate to others in their bucket to find nearest neighbors. The performance of LSH involves tuning parameters like the number of minhashes and bands to balance false positives and negatives. Implementing LSH on MapReduce can make it scalable to large datasets.
Web application performance correlates with page views. Find out in this session how to maximize the performance of the OCI8 database extension to build fast, scalable web sites and attract users. Includes discussion of Oracle Database 11.2 and the upcoming PHP OCI8 1.4 extension.
This document discusses using locality sensitive hashing (LSH) to detect trips with overlapping routes in large GPS datasets. It describes challenges with noisy GPS data and large search spaces. The approach involves representing trips as sets of area segments, computing Jaccard similarity, and using MinHash to map similar trips to the same buckets with high probability. Multiple hash functions are applied to increase probability. Approaches for efficient distributed processing on Spark are discussed, including reducing network usage. Future work involves migrating to Spark ML APIs and handling streaming inserts.
All marketing aspects including financial and HR policies are explained elaborately . Subsidiaries, value system , competitors. A comparison study among TCS INFOSYS and Wipro is given Briefly.
The Future of BriteCore - Product DevelopmentPhil Reynolds
Over the next five years, BriteCore plans to completely rewrite its software suite. By making the suite more modular, stable, and scalable, BriteCore will be able to support the needs of all insurers globally.
Original: Lean Data Model Storming for the Agile EnterpriseDaniel Upton
This original publication, aimed at data project leaders, describes a set of methods for agile modeling and delivery of an enterprise data warehouse, which together make it quicker to deliver, faster to load, and more easily adaptable to unexpected changes in source data, business rules or reporting/analytic requirements.
With this set of methods, the parts of data warehouse development that used to be the most resistant to sprint-sized / agile work breakdown -- data modeling and ETL -- are now completely agile, so that this tasking, too, can now be sized purely based on customer requirements, rather than the dictates of a traditional data warehouse architecture.
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
The document discusses Socialmetrix's evolution of their real-time social media analytics architecture over 4 iterations to meet growing customer and data demands. It describes how they moved from a monolithic to distributed setup using technologies like AWS, Spark, Kafka and Cassandra to improve scalability, costs and resilience while adding new data sources and features. Key lessons included automating deployments, monitoring systems, and using AWS services like S3, EMR and DynamoDB to enable rapid prototyping and reprocessing as needed to support real-time and batch analytics.
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
This document discusses using Oracle Business Intelligence Enterprise Edition (OBIEE) and the Data Vault data modeling technique to virtualize a business intelligence environment in an agile way. Data Vault provides a flexible and adaptable modeling approach that allows for rapid changes. OBIEE allows for the virtualization of dimensional models built on a Data Vault foundation, enabling quick iteration and delivery of reports and dashboards to users. Together, Data Vault and OBIEE provide an agile approach to business intelligence.
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
David Durst of BlackRock presents TopNotch, a system for systematically quality controlling big data. TopNotch uses assertions to define and measure data quality, reuses commands across data sets to maximize efficiency, and institutionalizes knowledge of data sets through plans and commands. It provides a unit testing framework for data with assertions to verify facts, diffs to compare data sets, and views to transform data. This solves the problems of defining data quality, efficiently quality controlling many data sets, and institutionalizing knowledge of data sets.
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsLooker
Infectious Media runs on data. But, as an ad-tech company that records hundreds of thousands of web events per second, they have have to deal with data at a scale not seen by most companies. You can not make decisions with data when people need to write manual SQL only for queries take 10-20 minutes to return. Infectious Media made the switch to Google BigQuery and Looker and now every member of every team can get the data they need in seconds.
Infectious Media shares:
- Why they chose their current stack
- Why faster data means happier customers
- Advantages and practical implications of storing and processing that much data
Check out the recording at https://info.looker.com/h/i/308848878-power-to-the-people-a-stack-to-empower-every-user-to-make-data-driven-decisions
It is a fascinating, explosive time for enterprise analytics.
It is from the position of analytics leadership that the mission will be executed and company leadership will emerge. The data professional is absolutely sitting on the performance of the company in this information economy and has an obligation to demonstrate the possibilities and originate the architecture, data, and projects that will deliver analytics. After all, no matter what business you’re in, you’re in the business of analytics.
The coming years will be full of big changes in enterprise analytics and Data Architecture. William will kick off the fourth year of the Advanced Analytics series with a discussion of the trends winning organizations should build into their plans, expectations, vision, and awareness now.
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
A Flexible Recommendation System for Cable TVFrancisco Couto
1. The document proposes a flexible recommendation system for cable TV to address issues like information overflow and dissatisfaction from users.
2. It describes extracting implicit feedback from users and engineering contextual features to create a large-scale dataset for learning recommendations.
3. An evaluation of the recommendation system shows that a learning to rank approach with contextual information outperforms other methods in accuracy while maintaining diversity and novelty, though recommending new programs requires more investigation.
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
The document discusses Conviva's Unified Framework (CUF) for analyzing video streaming data in real-time, near real-time, and offline using Spark and Databricks. It summarizes Conviva's platform for measuring video quality of experience across devices and networks. The framework unifies the three analysis stacks onto Spark to share code and insights. Using Databricks improves the offline analysis speed and enables data scientists to independently explore large datasets and build machine learning models.
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuNeo4j
Neo4j allows for faster development and performance compared to relational databases for a content collaboration platform. The graph database reduces complexity, improves query performance, and enables faster development cycles. Visualizing the graph data provides valuable analytics and insights into user behavior to further improve the product.
How Celtra Optimizes its Advertising Platformwith DatabricksGrega Kespret
Leading brands such as Pepsi and Macy’s use Celtra’s technology platform for brand advertising. To inform better product design and resolve issues faster, Celtra relies on Databricks to gather insights from large-scale, diverse, and complex raw event data. Learn how Celtra uses Databricks to simplify their Spark deployment, achieve faster project turnaround time, and empower people to make data-driven decisions.
In this webinar, you will learn how Databricks helps Celtra to:
- Utilize Apache Spark to power their production analytics pipeline.
- Build a “Just-in-Time” data warehouse to analyze diverse data sources such as Elastic Load Balancer access logs, raw tracking events, operational data, and reportable metrics.
- Go beyond simple counting and group events into sequences (i.e., sessionization) and perform more complex analysis such as funnel analytics.
The story map plans a drone delivery service targeting professional customers like Patrice the deli owner. The MVP focuses on delivering small packages efficiently and safely within cities. Subsequent releases expand the service to more customers and locations while ensuring regulatory approval and community acceptance through minimal noise and environmental impact. The core value is fast, personalized delivery that saves customers time and money.
Cost Control Across Cloud, On-Premise and VM Computers by Mark Lavi, Calm.ioDocker, Inc.
Anecdotal numbers suggest that more than 40% compute resources are under utilized -- from unused cloud instances to virtual machines running on bare-metal. Hundreds of QA & Dev nodes to thousands of production instances could be shutdown, and brought back to the same state on demand. That's what cloud is about -- agility and efficiency, but our on-premise datacenter habits have migrated to the cloud as well.
Calm's DevOps automation platform helps fix our old habits. Calm provides a single pane of glass across cloud and on-premise, integrating with Chef, Puppet and Docker ecosystems. The single pane of glass enables orchestration, cost-control and on-demand provisioning.
Managing Large Amounts of Data with SalesforceSense Corp
Critical "design skew" problems and solutions - Engaging Big Objects, MuleSoft, Snowflake and Tableau at the right time
Salesforce’s ability to handle large workloads and participate in high-consumption, mobile-application-powering technologies continues to evolve. Pub/sub-models and the investment in adjacent properties like Snowflake, Kafka, and MuleSoft, has broadened the development scope of Salesforce. Solutions now range from internal and in-platform applications to fueling world-scale mobile applications and integrations. Unfortunately, guidance on the extended capabilities is not well understood or documented. Knowing when to move your solution to a higher-order is an important Architect skill.
In this webinar, Paul McCollum, UXMC and Technical Architect at Sense Corp, will present an overview of data and architecture considerations. You’ll learn to identify reasons and guidelines for updating your solutions to larger-scale, modern reference infrastructures, and when to introduce products like Big Objects, Kafka, MuleSoft, and Snowflake.
This document discusses moving data warehousing to the cloud with Pivotal Greenplum. It recommends obeying the laws of data gravity by leaving data where it is generated, adopting a software data warehouse that can run anywhere, and separating compute and storage. It positions Greenplum as a massively parallel, open source data warehouse that can run on-premises, in the cloud, or in hybrid environments with real separation of compute and storage. The document provides examples of customers successfully using Greenplum in the cloud at AWS for analytics, reporting, and migrating workloads from legacy data warehouses.
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks
This document summarizes S&P Global's use of Solr for search capabilities across their large datasets. It discusses how S&P Global indexes over 50 million documents into Solr monthly and handles over 5 million queries per week. It outlines challenges faced with an on-premise Solr deployment and how migrating to Solr Cloud helped address issues like performance, availability, and scalability. Next steps discussed include improving relevancy through data science, continuing to leverage new Solr features, and exploring ways to integrate machine learning into search capabilities.
Similar to [Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams (20)
The document discusses privacy in social networks and the design of a social media simulator called MCAS. MCAS aims to predict information cascades across platforms using endogenous and exogenous signals. Scenario 1 uses only endogenous Reddit data to predict discussion thread growth, evaluating against baselines. Scenario 2 predicts Twitter activity using both endogenous social media discussions and exogenous news articles. The goal is to generate realistic simulations for applications like disaster response and trend analysis.
Drivers of Polarized Discussions on Twitter during Venezuela Political CrisisSameera Horawalavithana
Social media activity is driven by real-world events (natural disasters, political unrest, etc.) and by processes within the platform itself (viral content, posts by influentials, etc). Understanding how these different factors affect social media conversations in polarized communities has practical implications, from identifying polarizing users to designing content promotion algorithms that alleviate polarization. Based on two datasets that record real-world events (ACLED and GDELT), we investigate how internal and external factors drive related Twitter activity in the highly polarizing context of the Venezuela’s political crisis from early 2019. Our findings show that antagonistic communities react differently to different exogenous sources depending on the language they tweet. The engagement of influential users within particular topics seem to match the different levels of polarization observed in the networks.
https://dl.acm.org/doi/10.1145/3447535.3462496
Twitter Is the Megaphone of Cross-platform Messaging on the White HelmetsSameera Horawalavithana
Abstract. This work provides a quantitative analysis of the cross- platform disinformation campaign on Twitter against the Syrian Civil Defence group known as the White Helmets. Based on four months of Twitter messages, this article analyzes the promotion of urls from differ- ent websites, such as alternative media, YouTube, and other social media platforms. Our study shows that alternative media urls and YouTube videos are heavily promoted together; fact-checkers and official government sites are rarely mentioned; and there are clear signs of a coordinated campaign manifested through repeated messaging from the same user accounts. paper: https://link.springer.com/chapter/10.1007/978-3-030-61255-9_23
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Sameera Horawalavithana
The document discusses research into quantifying the relationship between a graph's properties and its vulnerability to deanonymization attacks. It presents three research questions: 1) How topological properties affect attacks, 2) How node attribute placement affects vulnerability, and 3) How diffusion processes impact vulnerability. The methodology section outlines generating synthetic and real-world graphs, modeling attacks, and measuring success. Key findings include some topological properties like transitivity and assortativity impacting privacy independent of degree distribution. Node attribute diversity increases vulnerability more than attribute homophily. Faster spreading diffusions see higher vulnerability growth. The implications are discussed for data owners and privacy researchers.
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...Sameera Horawalavithana
Presented at Machine Learning in Network Science, co-located with NetSci'19, VT.
Abstract:
We introduce a generative/discriminative mechanism to predict the temporal dynamics of information cascade with the support of probabilistic models and Long-Short Term Memory (LSTM) neural networks. Our approach is to train a machine-learning algorithm to act as a filter for identifying realistic cascades for a particular social platform from a large pool of generated cascades. Our goal is to select the most realistic cascade with an accurate de-construction of user activity time-line. As an example in Twitter, we predict which user performs a retweet, and when she does such, in addition to the underlying cascade structure.
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...Sameera Horawalavithana
This document describes a study on the risk of node re-identification in labeled social graphs. It presents a motivating scenario where a data scientist tries to re-identify nodes in an anonymized network by mapping it to another public dataset. The study aims to quantify how much node attributes improve re-identification compared to just network structure, and how attribute placement affects vulnerability. It generates synthetic networks, simulates attacks using machine learning on node features, and measures increased vulnerability from attributes. Key findings are that vulnerability rises with population diversity but not with attribute homophily, and topological risks exceed those from attributes alone.
This document describes a project to detect duplicate documents from the Hoaxy dataset using linguistic features and propagation dynamics on Twitter. It discusses collecting documents and diffusion networks from Hoaxy, preprocessing text, using LDA, LSI, and HDP for document clustering, extracting features on propagation dynamics, and training a random forest classifier on the clustered documents and features. The random forest achieves an F1-score of 0.72 for LDA, 0.75 for LSI, and 0.71 for HDP clusters in determining if document pairs are duplicates. The approach aims to predict topics of "dead" web pages using their diffusion networks on Twitter.
Invited guest lecture at UCSC for MSc. Distributed System, Talk includes a recap of stream processing buzzwords with an introduction to dynamic graph streams.
Special Thanks goes to Martin Kleppman (LinkedIn) and Vasia Kalavri (KTH) for the knowledge hub
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation Sameera Horawalavithana
The presentation done at ACM/IFIP/USENIX Middleware workshop 2015
Adaptive and Reflective Middleware (ARM) is the main forum for researchers on adaptive and reflective middleware platforms and systems. It was the first ever workshop to be held with the ACM/IFIP/USENIX International Middleware Conference, dating back to the year 2000, in Palisades, NY (Middleware 2000) and has been running every year since.
Authors:
Y.S.Horawalavithana
D.N.Ranasinghe
http://dl.acm.org/citation.cfm?id=2834975
Citation:
Y. S. Horawalavithana and D. N. Ranasinghe. 2015. An Efficient Incremental Indexing Mechanism for Extracting Top-k Representative Queries Over Continuous Data-streams. In Proceedings of the 14th International Workshop on Adaptive and Reflective Middleware (ARM 2015). ACM, New York, NY, USA, , Article 8 . DOI=http://dx.doi.org/10.1145/2834965.2834975
Elasticsearch is an open-source search and analytics engine that allows for searching both structured and unstructured data in (near) real-time. The document discusses how Elasticsearch uses Lucene's inverted index architecture under the hood and can be used as a plug-and-play replacement for other search engines. It then provides examples of how the company uses Elasticsearch for centralized logging, log monitoring, network monitoring, and generating comparison reports by modeling data as graphs in Elasticsearch.
The document describes how to generate combinations of items according to a Zipf distribution. It explains that the Zipf distribution assigns probabilities to ranks, with the highest ranked item having the greatest probability and each subsequent rank having less probability. It then shows how to calculate the Zipf probabilities for a set of 5 items and generate all possible combinations of those items weighted by their Zipf probabilities.
This document discusses publish/subscribe systems and top-k publish/subscribe systems. It provides background on publish/subscribe communication paradigms and taxonomies. It then discusses requirements for top-k publish/subscribe systems to limit the number of matching publications delivered to k best within a time window. Several research papers on distributed top-k publish/subscribe systems are summarized, including their approaches to ranking publications, computing top-k over sliding windows, and delivering top-k results.
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingSameera Horawalavithana
This document summarizes a presentation on Spotify's large-scale, low-latency peer-to-peer music streaming system. Spotify uses a hybrid client-server and P2P approach to stream over 8 million tracks to 24 million users. The key aspects covered include Spotify's custom protocol, unstructured P2P overlay, and evaluation of the system's performance based on real data. Evaluation results showed median playback latencies of 265ms, stutter rates below 1%, and that the system was able to efficiently locate peers and was not severely impacted by client churn.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
6. Drawbacks in Boolean Matching
Traditional
Publish/SubscribePublish
Subscribe
Notify
Bob likes to update about
smartphones. He prefers to get
notify on products from Verizon &
AT&T.
But Ideally Bob prefers to get
notify on products from Verizon
only if there are not enough
notifications from AT&T.
7. Drawbacks in Boolean Matching (Contd.)
• Subscriptions & matching publications are
considered as equally important.
• Publications are delivered to Bob
whenever there is a satisfied
subscription.
• Bob may be either overloaded with
publications or receive too few
publications over time,
• Impossible to compare different matching
publications with respect to Bob’s
subscriptions as ranking functions are not
defined, and
• Partial matching between subscriptions and
publications is not supported.
8. Top-k Publish/Subscribe
• Expressive stateful query processing systems
• to overcome the drawbacks identified in traditional pub/sub systems
• User defined parameter k restricts the delivered publications
• Pub/Sub Matching?
• Top-k pub/sub scoring or ranking
• Pub/Sub Indexing?
• Indexing to support personalized subscriptions
• Indexing to support continuous Top-k publications retrieval
9. Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List
10. Research Goal
How to alleviate the Information Overload
problem based on publish/subscribe
communication paradigm which is augmented
by different scoring mechanisms over
continuous information-streams?
11. Research Problem
1. How to define an efficient scoring algorithm by integrating query
independent & dependent score metrics taken into account?
- Relevance, Freshness & Diversity
2. How to adapt existing indexing data structures used in state-of-the-
art publish/subscribe systems under
a) large subscription volume,
b) high event rate(velocity) and,
c) the variety of subscribable attributes,
to support top-k matching queries?
15. Why not client centered Top-k matching with
Traditional pub/sub layer on Top?
• In subscriber point of view,
• We support partial matching between subscriptions & publications
• Personalized subscriptions
• We address the overlapping interest of many subscribers
• Experiment with system resiliency: Retrieve Top-k results on domain knowledge
• We can have large volume of subscription space with variety of attributes through an
efficient in-memory indexing mechanism
• In publisher point of view,
• Depend on the order of incoming matched publications
16. Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List
20. Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List
21. Comparison: Subscription (Contd.)
Typical Pub/Sub
• Just matching a publication
whenever there’s a satisfied
subscription
Top-k Pub/Sub
• A publication is scored against a
satisfied subscription space
Item = Smartphone
Item = Smartphone
Carrier = AT&T
Carrier = AT&T
Item = Smartphone
Carrier = AT&T
Item = Smartphone
Item = Smartphone
Carrier = AT&T
Carrier = AT&T
Item = Smartphone
Carrier = AT&T
22. Comparison: Subscription
Typical Pub/Sub
• All subscriptions are considered
equally
• No personalized subscriptions
Top-k Pub/Sub
• Subscribers can express some
events are more important than
others by ranking subscriptions
• can have a degree of user interest
over subscription space
• limit redundancy by avoiding
results with overlapping content
• “AT&T Smartphone" include in
“Smartphone“
• Make rare events visible
23. How to assign preference over subscription?
Quantitative approach
• Assign interest to each
subscription
Qualitative approach
• Specify the interest between two
subscriptions
Item = Smartphone
Item = Smartphone
Carrier = AT&T
Carrier = AT&T
0.7
0.5
0.9
Item = Smartphone
Item = Smartphone
Carrier = AT&T
Carrier = AT&T
>
<
24. Personalized subscriptions
Explicit Global Ordering Explicit Local Ordering Explicit Local + Implicit Global Ordering
Subscription Preferences Attribute Preferences Attribute-Subscription Preferences
Carrier = AT&T
OS = Android
0.9
Carrier = Verizon
OS = iOS
0.7
>
Carrier = AT&T
Carrier = Verizon
>
OS = iOS
OS = Android
<
Carrier = AT&T (0.6)
OS = Android (0.3)
Carrier = Verizon (0.2)
OS = iOS (0.5)
Carrier = AT&T (0.3)
OS = iOS (0.7)
Brand = Apple (0.4)
25. We Propose: Relating Attributes
a) Subscription covering b) Subscription Merging c) Relating Attributes
attribute1
attribute2
attribute1
attribute2
attribute1
attribute2
S1
S2
S3
S1 S2
S3
32. Subscription Indexing
• Can have a performance bottleneck when,
• Matching between publication & user personalized subscription space.
• Extensively studied in pub/sub community
• Don’t re-invent the wheel
• We extend an existing indexing mechanism to,
• Apply our personalized subscription model
33. Decision Making
opIndex
• Dynamically adopt to the variety of
attributes
• Two-space partitioning
• Attribute & operator
• Can support a wide range of operators
• Ex: Regular Expression
• Perform better when subscription
space become larger
• index construction time,
• memory cost and,
• query processing time.
k-Index, BE* Index
• Can’t deal with the variety of attributes
• Three-space partitioning
• Subscription size, Attribute & Value
• Supports only a small set of operators
• Are outperformed by opIndex
34. Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List
35. Events Novelty
• Motivation:
• A popular news pub/sub system like Google news maintain publications
within last 30 days, but most of the time produce top-k results within last day
or two.
• Most important in Top-k computation,
• Demonstration using time policy to compute Top-k results
36. When to compute Top-k results?
• Our matching model deal with continuous data-stream
• Impossible to filter an unbounded stream
• We should have a time policy to compute Top-k results per subscription
I. Continuous
II. Periodic
III. Sliding Windows
37. Sliding Window Top-k computation
• Compute top-k results based on publications within moving windows
(time or events) e.g. w=2
P1 P2 P3 P4 P5 P6 P7 P8 P9 …
T 2T 3T 4T 5T
P1 P2 P4
38. Remark: Sliding Window
• Adaptive than continuous & periodic
• when w = 1; act as continuous
• when w = T; act as periodic
• But here w is Flexible
• We can dynamically change w based on event arrival rate
• Can address streams other than Poisson distribution
• Without losing generality, our model based on sliding event windows
• But when event window becomes larger?
39. Freshness: Time Decaying
Problem
• Older publications may prevent the newer publications to enter into
top-k results
Solution
• Lease or Expire using a time decay function
• We combine Freshness with relevancy score
40. Time Decaying Function
• We consider “Forward decay” to compute the publication age
• So we don’t have to compute the decay score each window
41. Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List
43. Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List
44. Event Diversity
• In Top-k publish/subscribe,
• getting a diverse results within Top-k publications play a major role
• As an example, Bob would like to get notify about smart-phones from
the carrier=AT&T and brand=HTC.
• Without the notion of diversity, delivered top-k publications may have much
similarity between them.
• Even though, the received publications are personalized, Bob may recognize
such a system as not effective.
46. Dissimilarity
• Choosing to deliver items that are dissimilar to each other
• P-dispersion problem
• Selecting k items out of n, such that, the average pairwise distance between
the selected items is maximized
• NP-Hard
• k-diversity problem
• Is based on p-dispersion problem
• Rely on heuristics to solve large instance of the problem
47. K-diversity problem
• Let P be the set of matching publications; |P| = n, and given a
distance metric d to express the dissimilarity between publication
points, finding the diverse set 𝑆∗of P such that
𝑆∗ = arg max 𝑓 𝑆, 𝑑 ;
49. Not to reinvent the wheel
• Most diversity definitions are aligned with,
• P-dispersion problem
• Here, we do consider to combine diversity & relevancy as,
• mono-objective formulation
• Not more based on p-dispersion
50. Beyond Diversity & Relevance
• We select a set of diverse set which,
• increase the "global" importance of a selected publication, and
• reduce the "global" importance of a non-selected publication.
• We define the problem in static version,
• MAXDIVREL k-diversity problem
• We define the problem in continuous version,
• MAXDIVREL continuous k-diversity problem
53. MAXDIVREL k-diversity problem
• Can map into Top-k representative query problem in graph databases
which is NP-Hard
• Specialized version of set cover problem
• Can prove!
55. MAXDIVREL Continuous k-diversity problem
• Continuity Requirements
• Durability
• an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in
𝑖 + 1 𝑡ℎ window if it's not expired & other valid items in 𝑖 + 1 𝑡ℎwindow are failed to
compete with it.
• Order
• Publication stream follow the chronological order
• We avoid the selection of item j as diverse later, when we already selected an item i
which is not-older than j.
57. Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List
58. MAXDIVREL continuous k-diversity problem
• Apply MAXDIVREL k-diversity Greedy algorithm in each window
• Time complexity
• When re-calculating neighborhood
• We propose an incremental MAXDIVREL algorithm
• Calculate neighborhood at window 𝑖 + 1 𝑡ℎ using already calculated neighborhood at
window 𝑖 𝑡ℎ
• Indexing publications at each window
• Combine with subscription indexing
• Dual-indexing mechanism!
59. Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List
60. To Do List: Implementation
• Indexing based on inverted-index
• Why inverted index?
• Centralized, will try Cloud Based
• Using message broker system E.g. RabbitMQ, ZeroMQ, ActiveMQ
• Why RabbitMQ?
61. To Do List: Evaluation
• Multiple Directions
• Zipf property
• Using synthetic & real data-set (e.g. zipf distribution tool, Ebay, AOL Query logs)
• Algorithm efficiency
• Experiment with,
• The volume of subscriptions
• The variety of publications
• The arrival rate of publications (e.g. dynamic sliding window model)
• Using POIKILO evaluation tool
• Dual-Indexing Performance & Scalability
• Experiment with,
• Index construction time at each window
• Memory cost
• Query processing time (e.g. Neighborhood calculation)
Hence, it addresses the efficient processing of top-k queries over multiple data streams which filters out irrelevant data stream objects, and delivers only top-k objects relevant to user interests.
Traditional: Users can only express their interest over a set of predicates or expressions in the subscription.