On 2019, the 30th edition of the International Symposium on Software Reliability Engineering (ISSRE 2019) took place in Berlin, Germany, October 28-31. The first edition took place in Washington, DC, USA, in 1990.
To celebrate this very important anniversary, we promoted an initiative to identify the ISSRE most influential papers, called "Highlights from 30 years of ISSRE". We looked for ISSRE papers that had a great influence and impact in the community. The goal of the initiative is to remember those papers and their authors, which, in practice, tell a good part of the story of our conference.
On 2019, the 30th edition of the International Symposium on Software Reliability Engineering (ISSRE 2019) took place in Berlin, Germany, October 28-31. The first edition took place in Washington, DC, USA, in 1990.
To celebrate this very important anniversary, we promoted an initiative to identify the ISSRE most influential papers, called "Highlights from 30 years of ISSRE". We looked for ISSRE papers that had a great influence and impact in the community. The goal of the initiative is to remember those papers and their authors, which, in practice, tell a good part of the story of our conference.
College Students Using Einstein Analytics to Analyze Admissions DataSalesforce.org
Presentation from Salesforce.org Higher Ed Summit 2018 by: Nathan Baker, Alex Hunter, Joe Schuette, and Mitch Whedon.
Three students from Taylor University (IN) showcase how they have used Salesforce and Einstein Analytics to assist enrollment management with funnel management to make educated, strategic decisions. These students presenting take part in the Data Analytics Team at Taylor, a co-curricular opportunity that utilizes school curriculum as well as providing valuable hands-on experience. During their time on the Data Analytics Team, these students work with Salesforce data and have learned how to create reports and analyze the contents to make strategic decisions on the behalf of the university. The presentation will consist of looking at the stages of the Admissions funnel and showing how Einstein Analytics has helped identify trends and patterns driving students from one stage to the next. Furthermore, they will share how the analysis has helped in optimizing the communication strategy with prospective students.
Watch a recording of this presentation: https://youtu.be/E27K8wtki6o
On 2019, the 30th edition of the International Symposium on Software Reliability Engineering (ISSRE 2019) took place in Berlin, Germany, October 28-31. The first edition took place in Washington, DC, USA, in 1990.
To celebrate this very important anniversary, we promoted an initiative to identify the ISSRE most influential papers, called "Highlights from 30 years of ISSRE". We looked for ISSRE papers that had a great influence and impact in the community. The goal of the initiative is to remember those papers and their authors, which, in practice, tell a good part of the story of our conference.
On 2019, the 30th edition of the International Symposium on Software Reliability Engineering (ISSRE 2019) took place in Berlin, Germany, October 28-31. The first edition took place in Washington, DC, USA, in 1990.
To celebrate this very important anniversary, we promoted an initiative to identify the ISSRE most influential papers, called "Highlights from 30 years of ISSRE". We looked for ISSRE papers that had a great influence and impact in the community. The goal of the initiative is to remember those papers and their authors, which, in practice, tell a good part of the story of our conference.
College Students Using Einstein Analytics to Analyze Admissions DataSalesforce.org
Presentation from Salesforce.org Higher Ed Summit 2018 by: Nathan Baker, Alex Hunter, Joe Schuette, and Mitch Whedon.
Three students from Taylor University (IN) showcase how they have used Salesforce and Einstein Analytics to assist enrollment management with funnel management to make educated, strategic decisions. These students presenting take part in the Data Analytics Team at Taylor, a co-curricular opportunity that utilizes school curriculum as well as providing valuable hands-on experience. During their time on the Data Analytics Team, these students work with Salesforce data and have learned how to create reports and analyze the contents to make strategic decisions on the behalf of the university. The presentation will consist of looking at the stages of the Admissions funnel and showing how Einstein Analytics has helped identify trends and patterns driving students from one stage to the next. Furthermore, they will share how the analysis has helped in optimizing the communication strategy with prospective students.
Watch a recording of this presentation: https://youtu.be/E27K8wtki6o
This presentation was provided by Suzanna Conrad of the California State University - Sacramento during the NISO webinar, Using Analytics to Extract Value from the Library's Data, held on September 12.
ACM ICTIR 2019 Slides - Santa Clara, USAIadh Ounis
Talk entitled "Unifying Explicit and Implicit Feedback for Rating Prediction and Ranking Recommendation Tasks" presented at the ACM ICTIR 2019, Santa Clara. 2019.
Reference:
Jadidinejad, A. , Macdonald, C. and Ounis, I. (2019) Unifying Explicit and Implicit Feedback for Rating Prediction and Ranking Recommendation Tasks. In: 5th ACM SIGIR International Conference on the Theory of Information Retrieval, Santa Clara, CA, USA, 02-05 Oct 2019
URL: https://dl.acm.org/citation.cfm?id=3344225
CIRPA 2016: It's Show Time: Are Your Data Ready to be the "Next Big Thing"?Stephen Childs
Growing interest in using “administrative data” for research, and government adoption of open data policies, are putting institutional data practices in the spotlight. Are your data ready for prime time? Do you have robust policies on sharing, access and archiving? Are your data well documented, with clear policies on governance? Will the data be re‐useable by others, to add to the body of knowledge in the area? This session will provide an overview of the principles and practices of data management, with a
case study that examines one institution’s experience in making its data available to the EPRI tax linkage project.
Tips and Tricks to be an Effective Data ScientistLisa Cohen
Data Science is an evolving field, that requires a diverse skill set. From Analytical Techniques to Career Advice, this talk is full of practical tips that you can apply immediately to your job.
Data science concept by Raj Krishna PaulSubir Paul
Get clear concept about What is Data Science . Why it is the emerging area of research and jobs. How to go about it.
Developed by Rajkrishna Paul, B S Engg ( USA), Technical Lead ,Verizon Data Services
A solutions-based approach, illustrated by case studies, which show how inferences can be improved from surveys administered to biased, low response rate and non-probability samples.
It addresses how to improve the accuracy of the survey estimates we generate from poorer quality and non-probability samples.
Using data from twenty-one speed dating events to create a new dating app, we can connect two individuals based on their interest and preferences thus expediting the dating process. The app will direct the user to rate other users’ profiles based on not only the user’s image but also how much he/she likes the other user based on their profile information. The profiles will include demographic information, shared Interests, and other attributes such as fun factor, attractiveness, etc. After evaluating each user’s preferences and rating, the app will suggest partners who have similar interests and matching preferences.
Distribution Problems in Recommender SystemsDaniel McEnnis
Traditional machine learning and collaborative filtering pay little attention to the sources of the data they use. The differences between the distribution backing the learning data, the distribution backing the algorithm output, and the distribution backing the ground truth are often completely different and almost unrelated to the target distribution: true ratings across all items for every user.
DSD-INT 2021 The choice - A workshop for modelersDeltares
Presentation by Lieke Melsen (Wageningen University), Janneke Remmers (Wageningen University) and Carine Wesselius (Deltares), at the The choice - A workshop for modelers, during Delft Software Days - Edition 2021. Wednesday, 17 November 2021.
Commonalities in LibQUAL+® (Dis)satisfaction: An international trend?Selena Killick
International research presented in 2013 identified a commonality in library customer satisfaction as measured by the LibQUAL+® survey methodology. The findings established a statistically significant link between customer satisfaction with the Information Control dimension and satisfaction overall; and customer dissatisfaction with the Affect of Service dimension and dissatisfaction overall. The findings concluded that both information resources and customer service affects the overall opinion of the library service for all customer groups.
Is this unique to European libraries, or is it an international trend? The research has now been replicated with the LibQUAL+® survey results from all ARL participants in 2013.
DoWhy Python library for causal inference: An End-to-End toolAmit Sharma
As computing systems are more frequently and more actively intervening in societally critical domains such as healthcare, education, and governance, it is critical to correctly predict and understand the causal effects of these interventions. Without an A/B test, conventional machine learning methods, built on pattern recognition and correlational analyses, are insufficient for causal reasoning.
Much like machine learning libraries have done for prediction, "DoWhy" is a Python library that aims to spark causal thinking and analysis. DoWhy provides a unified interface for causal inference methods and automatically tests many assumptions, thus making inference accessible to non-experts.
For a quick introduction to causal inference, check out amit-sharma/causal-inference-tutorial. We also gave a more comprehensive tutorial at the ACM Knowledge Discovery and Data Mining (KDD 2018) conference: causalinference.gitlab.io/kdd-tutorial.
Learning Analytics – From Reactive to PredictiveLearningCafe
Overview
While the term Learning Analytics has been around for some time, it has been mostly restricted to data collecting from the Learning Management Systems such as completions data. Learning analytics has to evolve beyond simply reporting to making predictions. We discuss current trends in Learning Analytics and how xAPI, Artificial Intelligence will impact Learning Analytics.
Panelists
sarajit-poddar-learningcafe-150x150 Learning Analytics - From Reactive to Predictive Featured LearningCafe Webinars Vanessa-Blewitt-LearningCafe-100 Learning Analytics - From Reactive to Predictive Featured LearningCafe Webinars Jeevan-Joshi Learning Analytics - From Reactive to Predictive Featured LearningCafe Webinars
Sarajit Poddar – Workforce Planning & Analytics SME at Ericsson
Vanessa Blewitt – Global Transformation Lead – Learning Intelligence and Effectiveness at Nestle
Jeevan Joshi – Founder – LearningCafe & CapabilityCafe
We discuss
Why Learning data needs from a reactive mode of collecting completion information to using predictive data to make Learning more effective.
How xAPI and other emerging standards provide a platform for better analytics but have implementation challenges.
The opportunities to link learning analytics with business outcomes.
How Artificial Intelligence/ Machine Learning will demand better Learning Analytics.
Data Quality Doesn’t Just Happen: And Here’s What Some of the Industry’s Most...InsightInnovation
Data quality isn’t always the sexiest topic, but it’s critical and one that buyers and suppliers often neglect to have. The ramifications of ignoring it can cost millions of dollars. Some of the industry’s largest buyers and suppliers have found a simple solution though and it’s one that is available to everyone else too. Come here about how the issue of data quality concerns haven’t gone away, and what others are doing to make sure they and their insights are protected.
This presentation was provided by Suzanna Conrad of the California State University - Sacramento during the NISO webinar, Using Analytics to Extract Value from the Library's Data, held on September 12.
ACM ICTIR 2019 Slides - Santa Clara, USAIadh Ounis
Talk entitled "Unifying Explicit and Implicit Feedback for Rating Prediction and Ranking Recommendation Tasks" presented at the ACM ICTIR 2019, Santa Clara. 2019.
Reference:
Jadidinejad, A. , Macdonald, C. and Ounis, I. (2019) Unifying Explicit and Implicit Feedback for Rating Prediction and Ranking Recommendation Tasks. In: 5th ACM SIGIR International Conference on the Theory of Information Retrieval, Santa Clara, CA, USA, 02-05 Oct 2019
URL: https://dl.acm.org/citation.cfm?id=3344225
CIRPA 2016: It's Show Time: Are Your Data Ready to be the "Next Big Thing"?Stephen Childs
Growing interest in using “administrative data” for research, and government adoption of open data policies, are putting institutional data practices in the spotlight. Are your data ready for prime time? Do you have robust policies on sharing, access and archiving? Are your data well documented, with clear policies on governance? Will the data be re‐useable by others, to add to the body of knowledge in the area? This session will provide an overview of the principles and practices of data management, with a
case study that examines one institution’s experience in making its data available to the EPRI tax linkage project.
Tips and Tricks to be an Effective Data ScientistLisa Cohen
Data Science is an evolving field, that requires a diverse skill set. From Analytical Techniques to Career Advice, this talk is full of practical tips that you can apply immediately to your job.
Data science concept by Raj Krishna PaulSubir Paul
Get clear concept about What is Data Science . Why it is the emerging area of research and jobs. How to go about it.
Developed by Rajkrishna Paul, B S Engg ( USA), Technical Lead ,Verizon Data Services
A solutions-based approach, illustrated by case studies, which show how inferences can be improved from surveys administered to biased, low response rate and non-probability samples.
It addresses how to improve the accuracy of the survey estimates we generate from poorer quality and non-probability samples.
Using data from twenty-one speed dating events to create a new dating app, we can connect two individuals based on their interest and preferences thus expediting the dating process. The app will direct the user to rate other users’ profiles based on not only the user’s image but also how much he/she likes the other user based on their profile information. The profiles will include demographic information, shared Interests, and other attributes such as fun factor, attractiveness, etc. After evaluating each user’s preferences and rating, the app will suggest partners who have similar interests and matching preferences.
Distribution Problems in Recommender SystemsDaniel McEnnis
Traditional machine learning and collaborative filtering pay little attention to the sources of the data they use. The differences between the distribution backing the learning data, the distribution backing the algorithm output, and the distribution backing the ground truth are often completely different and almost unrelated to the target distribution: true ratings across all items for every user.
DSD-INT 2021 The choice - A workshop for modelersDeltares
Presentation by Lieke Melsen (Wageningen University), Janneke Remmers (Wageningen University) and Carine Wesselius (Deltares), at the The choice - A workshop for modelers, during Delft Software Days - Edition 2021. Wednesday, 17 November 2021.
Commonalities in LibQUAL+® (Dis)satisfaction: An international trend?Selena Killick
International research presented in 2013 identified a commonality in library customer satisfaction as measured by the LibQUAL+® survey methodology. The findings established a statistically significant link between customer satisfaction with the Information Control dimension and satisfaction overall; and customer dissatisfaction with the Affect of Service dimension and dissatisfaction overall. The findings concluded that both information resources and customer service affects the overall opinion of the library service for all customer groups.
Is this unique to European libraries, or is it an international trend? The research has now been replicated with the LibQUAL+® survey results from all ARL participants in 2013.
DoWhy Python library for causal inference: An End-to-End toolAmit Sharma
As computing systems are more frequently and more actively intervening in societally critical domains such as healthcare, education, and governance, it is critical to correctly predict and understand the causal effects of these interventions. Without an A/B test, conventional machine learning methods, built on pattern recognition and correlational analyses, are insufficient for causal reasoning.
Much like machine learning libraries have done for prediction, "DoWhy" is a Python library that aims to spark causal thinking and analysis. DoWhy provides a unified interface for causal inference methods and automatically tests many assumptions, thus making inference accessible to non-experts.
For a quick introduction to causal inference, check out amit-sharma/causal-inference-tutorial. We also gave a more comprehensive tutorial at the ACM Knowledge Discovery and Data Mining (KDD 2018) conference: causalinference.gitlab.io/kdd-tutorial.
Learning Analytics – From Reactive to PredictiveLearningCafe
Overview
While the term Learning Analytics has been around for some time, it has been mostly restricted to data collecting from the Learning Management Systems such as completions data. Learning analytics has to evolve beyond simply reporting to making predictions. We discuss current trends in Learning Analytics and how xAPI, Artificial Intelligence will impact Learning Analytics.
Panelists
sarajit-poddar-learningcafe-150x150 Learning Analytics - From Reactive to Predictive Featured LearningCafe Webinars Vanessa-Blewitt-LearningCafe-100 Learning Analytics - From Reactive to Predictive Featured LearningCafe Webinars Jeevan-Joshi Learning Analytics - From Reactive to Predictive Featured LearningCafe Webinars
Sarajit Poddar – Workforce Planning & Analytics SME at Ericsson
Vanessa Blewitt – Global Transformation Lead – Learning Intelligence and Effectiveness at Nestle
Jeevan Joshi – Founder – LearningCafe & CapabilityCafe
We discuss
Why Learning data needs from a reactive mode of collecting completion information to using predictive data to make Learning more effective.
How xAPI and other emerging standards provide a platform for better analytics but have implementation challenges.
The opportunities to link learning analytics with business outcomes.
How Artificial Intelligence/ Machine Learning will demand better Learning Analytics.
Data Quality Doesn’t Just Happen: And Here’s What Some of the Industry’s Most...InsightInnovation
Data quality isn’t always the sexiest topic, but it’s critical and one that buyers and suppliers often neglect to have. The ramifications of ignoring it can cost millions of dollars. Some of the industry’s largest buyers and suppliers have found a simple solution though and it’s one that is available to everyone else too. Come here about how the issue of data quality concerns haven’t gone away, and what others are doing to make sure they and their insights are protected.
Slides for Xin Fu and Hernan Asorey's tutorial at 2015 KDD conference. Talk covers key aspects of Data Science partnership with Product, including how to create a solid foundation for the partnership, how to leverage technology, as well as recruiting and team structure.
Opening/Framing Comments: John Behrens, Vice President, Center for Digital Data, Analytics, & Adaptive Learning Pearson
Discussion of how the field of educational measurement is changing; how long held assumptions may no longer be taken for granted and that new terminology and language are coming into the.
Panel 1: Beyond the Construct: New Forms of Measurement
This panel presents new views of what assessment can be and new species of big data that push our understanding for what can be used in evidentiary arguments.
Marcia Linn, Lydia Liu from UC Berkeley and ETS discuss continuous assessment of science and new kinds of constructs that relate to collaboration and student reasoning.
John Byrnes from SRI International discusses text and other semi-structured data sources and different methods of analysis.
Kristin Dicerbo from Pearson discusses hidden assessments and the different student interactions and events that can be used in inferential processes.
Panel 2: The Test is Just the Beginning: Assessments Meet Systems Context
This panel looks at how assessments are not the end game, but often the first step in larger big-data practices at districts/state/national levels.
Gerald Tindal from the University of Oregon discusses State data systems and special education, including curriculum-based measurement across geographic settings.
Jack Buckley Commissioner of the National Center for Educational Statistics discussing national datasets where tests and other data connect.
Lindsay Page, Will Marinell from the Strategic Data Project at Harvard discussing state and district datasets used for evaluating teachers, colleges of education, and student progress.
Panel 3: Connecting the Dots: Research Agendas to Integrate Different Worlds
This panel will look at how research organizations are viewing the connections between the perspectives presented in Panels 1 and 2; what is known, what is still yet to be discovered in order to achieve the promised of big connected data in education.
Andrea Conklin Bueschel Program Director at the Spencer Foundation
Ed Dieterle Senior Program Officer at the Bill and Melinda Gates Foundation
Edith Gummer Program Manager at National Science Foundation
Best Practices in Recommender System ChallengesAlan Said
Recommender System Challenges such as the Netflix Prize, KDD Cup, etc. have contributed vastly to the development and adoptability of recommender systems. Each year a number of challenges or contests are organized covering different aspects of recommendation. In this tutorial and panel, we present some of the factors involved in successfully organizing a challenge, whether for reasons purely related to research, industrial challenges, or to widen the scope of recommender systems applications.
With the increasing access to big data, organizations are finding new ways to utilize this information within their talent acquisition strategy. During this Spotlight Webinar, we’ll focus on HR analytics and how organizations are leveraging this data to strengthen their recruiting strategies when identifying talent.
During this spotlight webinar, learners will:
Identify how analytics play a role in forecasting the time required to identify and hire candidates
Determine how to leverage analytics to strengthen recruiting strategy
Learn how vendor partnerships can provide HR analytics that support workforce planning.
Microsoft: A Waking Giant In Healthcare Analytics and Big DataHealth Catalyst
In 2005, Northwestern Memorial Healthcare embarked upon a strategic Enterprise Data Warehousing (EDW) initiative with the Microsoft technology platform as the foundation. Dale Sanders was CIO at Northwestern and led the development of Northwestern’s Microsoft-based EDW. At that time, Microsoft as an EDW platform was not en vogue and there were many who doubted the success of the Northwestern project. While other organizations were spending millions of dollars and years developing EDW’s and analytics on other platforms, Northwestern achieved great and rapid value at a fraction of the cost of the more typical technology platforms. Now, there are more healthcare data warehouses built around Microsoft products than any other vendor. The risky bet on Microsoft in 2005 paid off.
Ten years ago, critics didn’t believe that Microsoft could scale in the second generation of relational data warehouses, but they did. More recently, many of these same pundits have criticized Microsoft for missing the technology wave du jour in cloud offerings, mobile technology, and big data. But, once again, Microsoft has been quietly reengineering its culture and products, and as a result, they now offer the best value and most visionary platform for cloud services, big data, and analytics in healthcare.
In this context, Dale will talk about:
His up and down journey with Microsoft as an Air Force and healthcare CIO, and why he is now more bullish on Microsoft like never before
A quick review of the Healthcare Analytics Adoption Model and Closed Loop Analytics in healthcare, and how Microsoft products relate to both
The rise of highly specialized, cloud-based analytic services and their value to healthcare organizations’ analytics strategies
Microsoft’s transformation from a closed-system, desktop PC company to an open-system consumer and business infrastructure company
The current transition period of enterprise data warehouses between the decline of relational databases and the rise of non-relational databases, and the new Microsoft products, notably Azure and the Analytic Platform System (APS), that bridge the transition of skills and technology while still integrating with core products like Office, Active Directory, and System Center
Microsoft’s strategy with its PowerX product line, and geospatial analysis and machine learning visualization tools
Using Net Promoter Score (NPS) to Increase Course EngagementLambda Solutions
A core activity of measuring how Learners engage with your course is measuring their reaction to it. A popular technique to measure customer experience is Net Promoter Score (NPS). Most organizations struggle to effectively structure an NPS survey, which overwhelms or makes it extraordinarily hard to use the data to make improvements.
In this webinar, we explore best practices in creating NPS surveys, analyzing the data, and applying lean learning analytics techniques to use the feedback to continuously improve your courses.
Tune in!
Serving tens of billions of personalized recommendations a day under a latency of 30 milliseconds is a challenge. In this talk I'll share our algorithmic architecture, including its Spark-based offline layer, and its Elasticsearch-based serving layer, that enable running complex models under difficult scale constrains and shorten the cycle between research and production.
Sonya Liberman leads the Personalization team @ Outbrain's Recommendations group, developing large-scale machine learning algorithms for Outbrain's content recommendations platform serving tens of billions real-time recommendations a day. She specializes in Information Retrieval, Machine Learning, and Computational Linguistics. Before joining Outbrain, she led the Research and Algorithms @ ConvertMedia (acquired by Taboola). She holds an MSc in Computer Science and a BSc in Computer Science and Computational Biology.
This invited talk was given at PyData Meetup, April 2019
https://www.meetup.com/PyData-Tel-Aviv/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
6. DataScience@SMU
Yelp – Some Data Limitations
6
Most of non-
mobile app users
are not registered
Recommendation
success/failure is
difficult to measure
User time of visit is
unknown
Ideal data structure
-?
12. DataScience@SMU 12
I. Make ordinal star ratings more
meaningful
II. Improve the modern generation
recommendation system with the
value from the text
Objectives
30. DataScience@SMU
The Effects of Unsafe
Recommendation Products
• Attackers gain the economic benefit.
• Good products or companies lose
market share.
30
Are users’ livelihoods directly tied
to our system?
33. DataScience@SMU
Conclusions
• The use of text data makes star ratings
more meaningful.
• The text of review improves the
recommendation system quality, even
with small sample sizes.
• Fake reviews penalty is too harsh in Yelp
system.
33
Editor's Notes
Recommendations are ubiquitous in today's technology products. They save us time so we don’t have to continuously search. On LinkedIn we are recommended to connect with individuals closely related to us and on YouTube we see an endless array of quality data science videos.
On Amazon we are recommended products we may have never even heard of but would like to buy.
Yelp we are shown popular restaurants we may not have heard of.
Let's take a look at how these systems are built today.
What started out simple as a simple item, user, and rating algorithm has now evolved. In the current age, these systems have evolved to include not just this recommendation layer but also preprocessing on inputs so that the recommendations can leverage all new types of data. New data points are now included to add value.
As data science methods related to natural language processing evolve, a new layer is being introduced to this stack. The power of language is being added in what experts are calling an ontology model, where a user is no longer just a rating tied to an item, but instead a complex web of interests and opinions.
From this added complexity we can conclude that while data may be valuable for a business purpose, not all of these valuable data points naturally translate over to being used in recommendations. We want to know our data is representative and can be trusted for these algorithms!
Looking at our Venn Diagram, Yelp is missing a majority of these dimensions. Most non-mobile users are not registered, so no additional data is captured. Measuring the success and failure is generally difficult,,, and finally we rarely know when a user visited.
__________________
1. No need to have user logged in to view the ratings and reviews. This makes most visitors of yelp as unregistered users thus making personalized user based recommendations less useful.
2. Without user information who was provided the recommendation, it's not possible to verify if the user actually went to the restaurant recommended and get users feedback on the same. If this was possible then the accuracy of recommendation system could have been improved over time.
3. Reviewers time of visit to the business is not recorded in the system making the calculation of delay in time elapsed between the visit to the restaurant and review writing time unavailable. This information would have been critical as its more likely as the more is this delay, more chances that the rating provided by user would be a little lower. "Almost everyone remembers negative things more strongly and in more detail - Clifford Nass, a professor of communication at Stanford University".
Yelp's primary data comes in the form of a business review like this. User's post these with a star rating, review text, and can add images or check-in if they want. There also is some meta-data about the actual reviews. For recommendations, the 1 to 5 star rating seems like the clear starting point.
The Yelp the star rating is an ordinal data point. It is naturally ambiguous, where we can’t be sure exactly how different each rank is from the other. This distance is key, because the algorithms that allow us to provide useful recommendations will leverage this distance in a very literal mathematical way.
This ambiguous nature of star ratings is clear with a closer look. We noted that star distributions were heavily skewed, which, becomes a large issue when training the recommender algorithm. A better shape would be what we see on the right, a distribution based around an average experience.
So what if we can transform these stars into something better? Infusing each with the text of a review, perhaps we can achieve a more normal distribution and then increase the recommendations our model gives off.
_________________________________
Recommendation algorithms at a high level provide you with information based on others who look similarly. If we cannot clearly pin down what is bad, average, and great for a user in a clean numeric format then we are not leveraging these tools as best we could. Our data looks like the left here out of the box, but what we would actually prefer is something closer to the right.
So what if we could create this better star rating? We would just want something that has this clear information in a numeric format and our recommendations would benefit heavily, increasing the relevance of what were before noise that was confusing our algorithm.
Our recommendations will greatly benefit from this data!
https://docs.google.com/drawings/d/1bLuT9jKVSbOXag8GFGljIWhTSUHFeR6m7bKhvVMSV7o/edit?usp=sharing
The information in the text of a review is very personal. Different individuals use different word choices, word order, grammar, and more. Using these pieces of information we will attempt to redistribute stars so that they hold more information for our recommender models.
Our strategy, at a high level, is to focus our efforts on the star creation aspect of the pipeline, taking the yelp data and creating new stars. We then will build recommendation models and review the effect of these stars to get a sense of the functionality added by our methods.
https://docs.google.com/drawings/d/1NV41F5ZzstL9TcgRneOyTnQNcnTV5dBfrjiT0VbeFkA/edit?usp=sharing
Our star factory has two sides. One neural network learns the normal review style for each business, and then another learns it for each user. The combined star output considers text regarding both entities. Our final result is therefore a new star rating.
https://docs.google.com/drawings/d/131waFzcnjVKWB8CWjPlOG_jQNOZQiGgjT6RkBbSrxQw/edit?usp=sharing
Building a recommendation algorithm and making sure that the format does not distort the results is key. We used an algorithm in the same family as Yelp's called the Collaborative Filtering Algorithm family. These methods will make much better use of our adjusted stars compared to the previous type.
Drawings
https://docs.google.com/drawings/d/1tBXzUZKp_1Item7H6sb3TRpWPXC4rLrwXe_6rxn5aZo/edit?usp=sharing
We use measurement to verify if our methods will provide an improvement to the final recommendation model. To achieve this we look at the adjusted star distribution on a very large level, simply a histogram of all the new star ratings.
https://docs.google.com/drawings/d/1TzaYyh4KV2ywfXKenQH00MIa1mEI2uuITfr-ZF50tq8/edit?usp=sharing
We also measure the skewness of all user star distributions to ensure its having an effect across all our users. We will explain this in more detail in a moment, but you can think of our second metric as a way to average all the histograms for each user's specific star distribution.
https://docs.google.com/drawings/d/1TzaYyh4KV2ywfXKenQH00MIa1mEI2uuITfr-ZF50tq8/edit?usp=sharing
Our basic star method is, as we would expect, weak along the dimensions we are using to measure success. It's distribution is heavily skewed, and the tails for the skew plot extend from –3 to 2. Simply put, our standard format isn’t ideal for recommendation.
If we use a basic sentiment score as our new stars, one of the simpler NLP applications, we notice a change in the overall distribution. Looking at the skewness of users as a whole however, we note that the user skewness plot is almost an exact match. This means that our method reshapes the stars overall but does not help user distributions.
Our final combined star method achieves both of our goals, producing a normal distribution on the top plot and reducing the tails of our users' skewness by a fair margin.
________________
Looking at the plot on the bottom right, we see the tails have shortened as well. This highlights that our user's star data now looks closer to a bell curve on average, and we have begun to successfully change star ratings on the user level, adding a great depth of information.
???? Change the New label to Super Star
The best way to think about our Deep Learning approach is to understand the basic goal of the architecture. The goal is to learn what a user's and business' historic reviews look like, then analyze the current text and use that respective value to adjust the star rating.
__________________
In this manner derivations from normal behavior, like a very strict user giving a positive and flowery review for the first time will be noted as very important compared to the same review text from another user who is generally very positive.
https://docs.google.com/drawings/d/10JTmP1x3o3Cq1yFVtayavczlcE8YmRCwzz_J2xfEJIs/edit?usp=sharing
Our new stars are able to perform well on an out of sample dataset using the FCP score, a common metric for these methods where higher is better, although the ultimate validation for this method would have to be delivering users our recommendations via a product and reviewing the response.
__________________
In this manner derivations from normal behavior, like a very strict user giving a positive and flowery review for the first time will be noted as very important compared to the same review text from another user who is generally very positive.
https://docs.google.com/drawings/d/10JTmP1x3o3Cq1yFVtayavczlcE8YmRCwzz_J2xfEJIs/edit?usp=sharing
Let's compare where Yelp is today. Why aren't they doing this? Yelp is implements their recommendation model in Spark, sandwiching it between two layers of filtering and machine learning. Within these layers they use things like NLP and Deep Learning to remove bad (maybe “fake” is better word?) reviews.
https://docs.google.com/drawings/d/1IGARr_gqf1CMmqGEfoHrs3sKxkn_a45bbWhCGL_hpDM/edit?usp=sharing
Yelp's implementation has a few limitations. The Spark API while scalable, is difficult to extend. Yelp's preprocessing is often considered clunky, where reviews that are valid and useful are randomly discarded. This can cascade into some odd suggestions for users.
https://docs.google.com/drawings/d/1BuM0HodwdqTQpdMcnZi9S8DejneYFqZnITRaj2tQ4VU/edit?usp=sharing
While Yelp is focused on providing a valuable service, it's incentive is somewhat contradictory to providing the best recommendations. Making most of its revenue from advertising businesses near the top of search results; adjusting what is shown to users would damage revenues. Should recommendation results be mixed between paid and unpaid?
Ethics discussion.
http://www.yelp-ir.com/news-releases/news-release-details/yelp-reports-fourth-quarter-and-full-year-2017-financial-results
New methods extend the data past just what's in the data set
3rd generation recommender systems are not just about the collaborative filtering algorithm anymore. Building additional components to amplify accuracy or increase robustness against noise is the key to building these products in the coming age. As we have shown in this presentation so far, this algorithm layer of the recommendation stack is open for data scientists to do whatever they feel could improve the end result.
Thinking about the results of a heavily relied upon recommendation system becomes important as the product grows. Google has had growing pains in this space as it grew important enough to cause damage, helping criminals finding protected persons or destroying businesses that relied on their search rank due to an unfair automated ban.
These cascading effects can be determined by asking oneself what type of value does the platform deliver, and what additional benefit is there to being ranked near the top? For Yelp, the additional free publicity on one of the most trafficked restaurant review websites is highly valuable. Not only do these attackers benefit, but also good products and companies lose market share. We must ask ourselves, are user's livelihoods directly tied to our system?
When these systems become economically important, privacy becomes an issue of user safety. If an Elite yelp user has heavy sway in the success of a business, what expectations are there of technologists to maintain anonymity for users without sacraficing quality.
There are some weaknesses when relying on these methods. Adversarial agents that have been trained in a similar way can dupe our system and get ranked highly. Adding a large volume of fake reviews to a system like this would cause a large restructuring of the results delivered, which would have some cascading effects.
1. We can improve ordinal star ratings by translating them with text data to create a new star rating.
2. These new star ratings can them simply be passed to a Yelp recommendation algorithm and will test highly on out-off-sample data.