Steffen Rendle, Research Scientist, Google at MLconf SFMLconf
Title: Factorization Machines
Abstract:
Developing accurate recommender systems for a specific problem setting seems to be a complicated and time-consuming task: models have to be defined, learning algorithms derived and implementations written. In this talk, I present the factorization machine (FM) model which is a generic factorization approach that allows to be adapted to problems by feature engineering. Efficient FM learning algorithms are discussed among them SGD, ALS/CD and MCMC inference including automatic hyperparameter selection. I will show on several tasks, including the Netflix prize and KDDCup 2012, that FMs are flexible and generate highly competitive accuracy. With FMs these results can be achieved by simple data preprocessing and without any tuning of regularization parameters or learning rates.
Steffen Rendle, Research Scientist, Google at MLconf SFMLconf
Title: Factorization Machines
Abstract:
Developing accurate recommender systems for a specific problem setting seems to be a complicated and time-consuming task: models have to be defined, learning algorithms derived and implementations written. In this talk, I present the factorization machine (FM) model which is a generic factorization approach that allows to be adapted to problems by feature engineering. Efficient FM learning algorithms are discussed among them SGD, ALS/CD and MCMC inference including automatic hyperparameter selection. I will show on several tasks, including the Netflix prize and KDDCup 2012, that FMs are flexible and generate highly competitive accuracy. With FMs these results can be achieved by simple data preprocessing and without any tuning of regularization parameters or learning rates.
Slides to accompany Patrick McSweeney's winning pitch in the Open Repositories 2012 DevCSI Developer Challenge.
More information about this entry can be found at http://devcsi.ukoln.ac.uk/or2012-developer-challenge-data-engine
This is a presentation on natural language processing (NLP), machine learning (ML), and Big Data. I introduce neural networks, unsupervised machine learning, and a variety of natural language processing techniques, and I cover architectural best practices, ways to implement data science for practical applications, and how to deal with organizational road-blocks.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this? In this session I will share insights and knowledge that I have gained from building up a Data Science department from scratch. The talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organization.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Paragon_Science_Inc
In this talk, we describe our recent work in the analysis of Twitter-based network graphs, including the Ebola crisis in 2014 and the stock market in 2015.
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)Stian Soiland-Reyes
Presentation at BOSC 2013 / ISMB 2013. (PowerPoint 2013 source)
PDF: https://www.slideshare.net/soilandreyes/2013-0719bosc-2013myexperimentresearchobjectsslides
See also poster at http://www.slideshare.net/soilandreyes/2013-0718bosc-2013myexperimentresearchobjectsposter-24242509 or
submitted abstract: https://docs.google.com/document/d/1jaAuPV-EnbsyI14L56HKHBQP7eDVfeXGLlK-LwohnWw/edit?usp=sharing
We have evolved Research Objects as a mechanism to preserve digital resources related to research, by providing mechanisms, formats and architecture for describing aggregated resources (hypothesis, workflow, datasets, scripts, services), their relations (is input for, explains, used by), provenance (graph was derived from dataset A, B and C) and attribution (who contributed what, and when?).
The website myExperiment is already popular for collaborating on, publishing and sharing scientific workflows, however we have found that for understanding and preserving a workflow over time, its definition is not enough, specially faced with workflow decay, services and tools that change over time. We have therefore adapted the research object model as a foundation for the myExperiment packs, allowing uploading of workflow runs, inputs, outputs and other files relevant to the workflow, relating them with annotations and integrated the Wf4Ever architecture for performing decay analysis and tracking a research object’s evolution as it and its constituent resources change over time.
Part of the ongoing effort with Skater for enabling better Model Interpretation for Deep Neural Network models presented at the AI Conference.
https://conferences.oreilly.com/artificial-intelligence/ai-ny/public/schedule/detail/65118
The previous research has focused on quick and efficient generation of wrappers; the
development of tools for wrapper maintenance has received less attention. This is an important research
problem because Web sources often change in ways that prevent the wrappers from extracting data
correctly. Present an efficient algorithm that extract unstructured data to structural data from web. The
wrapper verification system detects when a wrapper is not extracting correct data, usually because the
Web source has changed its format. The Verification framework automatically recovers data using
Dimension Reduction Techniques from changes in the Web source by identifying data on Web pages.
After apply wrapped data to One Class Classification in Numerical features for avoid classification
problem. Finally, the result data apply in Top-K query for provide best rank based on probabilities
scores. Wrapper verification system relies on one-class classification techniques to beat previous
weaknesses to identify the problem by analysing both the signature and the classifier output. If there are
sufficient mislabelled slots, a technique to find a pattern could be explored.
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
Carole Goble is a professor in the school of computer science at the University of Manchester.
In this keynote, Carole offered her insights into research data management and data centres.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Slides to accompany Patrick McSweeney's winning pitch in the Open Repositories 2012 DevCSI Developer Challenge.
More information about this entry can be found at http://devcsi.ukoln.ac.uk/or2012-developer-challenge-data-engine
This is a presentation on natural language processing (NLP), machine learning (ML), and Big Data. I introduce neural networks, unsupervised machine learning, and a variety of natural language processing techniques, and I cover architectural best practices, ways to implement data science for practical applications, and how to deal with organizational road-blocks.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this? In this session I will share insights and knowledge that I have gained from building up a Data Science department from scratch. The talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organization.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Paragon_Science_Inc
In this talk, we describe our recent work in the analysis of Twitter-based network graphs, including the Ebola crisis in 2014 and the stock market in 2015.
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)Stian Soiland-Reyes
Presentation at BOSC 2013 / ISMB 2013. (PowerPoint 2013 source)
PDF: https://www.slideshare.net/soilandreyes/2013-0719bosc-2013myexperimentresearchobjectsslides
See also poster at http://www.slideshare.net/soilandreyes/2013-0718bosc-2013myexperimentresearchobjectsposter-24242509 or
submitted abstract: https://docs.google.com/document/d/1jaAuPV-EnbsyI14L56HKHBQP7eDVfeXGLlK-LwohnWw/edit?usp=sharing
We have evolved Research Objects as a mechanism to preserve digital resources related to research, by providing mechanisms, formats and architecture for describing aggregated resources (hypothesis, workflow, datasets, scripts, services), their relations (is input for, explains, used by), provenance (graph was derived from dataset A, B and C) and attribution (who contributed what, and when?).
The website myExperiment is already popular for collaborating on, publishing and sharing scientific workflows, however we have found that for understanding and preserving a workflow over time, its definition is not enough, specially faced with workflow decay, services and tools that change over time. We have therefore adapted the research object model as a foundation for the myExperiment packs, allowing uploading of workflow runs, inputs, outputs and other files relevant to the workflow, relating them with annotations and integrated the Wf4Ever architecture for performing decay analysis and tracking a research object’s evolution as it and its constituent resources change over time.
Part of the ongoing effort with Skater for enabling better Model Interpretation for Deep Neural Network models presented at the AI Conference.
https://conferences.oreilly.com/artificial-intelligence/ai-ny/public/schedule/detail/65118
The previous research has focused on quick and efficient generation of wrappers; the
development of tools for wrapper maintenance has received less attention. This is an important research
problem because Web sources often change in ways that prevent the wrappers from extracting data
correctly. Present an efficient algorithm that extract unstructured data to structural data from web. The
wrapper verification system detects when a wrapper is not extracting correct data, usually because the
Web source has changed its format. The Verification framework automatically recovers data using
Dimension Reduction Techniques from changes in the Web source by identifying data on Web pages.
After apply wrapped data to One Class Classification in Numerical features for avoid classification
problem. Finally, the result data apply in Top-K query for provide best rank based on probabilities
scores. Wrapper verification system relies on one-class classification techniques to beat previous
weaknesses to identify the problem by analysing both the signature and the classifier output. If there are
sufficient mislabelled slots, a technique to find a pattern could be explored.
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
Carole Goble is a professor in the school of computer science at the University of Manchester.
In this keynote, Carole offered her insights into research data management and data centres.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Similar to Meetup#2. Intro to Factorization Machines (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
3. Q: What, say, 3 recent papers in machine learning do you think will be influential to directing the cutting edge
of research these days?
Peter Norvig: I’ve never been able to pick lasting papers in the past, so don’t trust me now, but here are a few:
● Rendle’s “Factorization Machines”
● Wang et al. “Bayesian optimization in high dimensions via random embeddings”
● Dean et al. “Fast, Accurate Detection of 100,000 Object Classes on a Single Machine”
http://blog.teamleada.com/2014/08/ask-peter-norvig/
9. Example
U = {Alice (A), Bob (B), Charlie (C), . . .}
I = {Titanic (TI), Notting Hill (NH), Star Wars
(SW), Star Trek (ST), . . .}
10. Example
{(A, TI, 2010-1, 5),(A, NH, 2010-2, 3),(A, SW,
2010-4, 1), (B, SW, 2009-5, 4),(B, ST, 2009-8,
5), (C, TI, 2009-9, 1),(C, SW, 2009-12, 5)}
Interaction between Alice and Star Trek to predict rating?
Zero interaction?
B-SW and C-SW are similar
A and C are different
ST and SW are similar
A-SW and A-ST are to be similar
17. FFM:ideas
Features can be grouped into fields: users,
movies, context, SSPs, publishers, whatever
Better use this information
Factor vector per field