idealo.de offers a price comparison service on millions of products from a wide range of categories. Each day we receive millions of offers that we cannot map to our product catalogue. We started clustering these offers to create new product clusters to ultimately enhance our product catalogue. For this we mainly use two open-source libraries:
Sentence-Transformers to encode the offers into a vector space
Facebook Faiss to do K-Nearest-Neighbours search in vector space
We will present our results for various optimisation strategies to fine-tune Transformers for our clustering use case. The strategies include siamese and triplet network architectures, as well as an approach with an additive angular margin loss. Results will also be compared against a probabilistic record linkage and TF-IDF approach.
Further, we will share our lessons learned e.g. how both libraries make Machine Learning Engineer‘s life fairly easy and how we created informative training data for our best performing solution.
Joint Keynote at Int. Conference on Knowledge Engineering and Semantic Web and Prague Computer Science Seminar, Prague, September 22, 2016
The challenges of Big Data are frequently explained by dealing with Volume, Velocity, Variety and Veracity. The large variety of data in organizations results from accessing different information systems with heterogeneous schemata or ontologies. In this talk I will present the research efforts that target the management of such broad data.
They include: (i) an integrated development environment for programming with broad data, (ii) a query language that allows for typing of query results, (iii) a typed lambda-calculus based on description logics, and (iv) efficient access to data repositories via schema indices.
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types.
Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of a parallel computing librariy in such a way that:
- abstraction and expressiveness are maximized - cost over efficiency is minimized
We'll skim over various applications and see how they can benefit from such tools. We will conclude by discussing what lessons were learnt from this kind of implementation and how those lessons can translate into new directions for the language itself.
Google APAC Machine Learning Day 是 Google 今年三月初於新加坡 Google 辦公室針對機器學習所舉辦的兩天研討會活動,本次聚會將邀請前往參加該活動的 Evan Lin 及他的同事 Benjamin Chen 帶來他們的心得分享,內容包括:
Tensorflow Summit RECAP
Machine Learning Expert Day 所見所聞
分享一下 Linker Networks 如何使用 Tensorflow
https://gdg-taipei.kktix.cc/events/google-apac-machine-learning-day
Joint Keynote at Int. Conference on Knowledge Engineering and Semantic Web and Prague Computer Science Seminar, Prague, September 22, 2016
The challenges of Big Data are frequently explained by dealing with Volume, Velocity, Variety and Veracity. The large variety of data in organizations results from accessing different information systems with heterogeneous schemata or ontologies. In this talk I will present the research efforts that target the management of such broad data.
They include: (i) an integrated development environment for programming with broad data, (ii) a query language that allows for typing of query results, (iii) a typed lambda-calculus based on description logics, and (iv) efficient access to data repositories via schema indices.
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types.
Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of a parallel computing librariy in such a way that:
- abstraction and expressiveness are maximized - cost over efficiency is minimized
We'll skim over various applications and see how they can benefit from such tools. We will conclude by discussing what lessons were learnt from this kind of implementation and how those lessons can translate into new directions for the language itself.
Google APAC Machine Learning Day 是 Google 今年三月初於新加坡 Google 辦公室針對機器學習所舉辦的兩天研討會活動,本次聚會將邀請前往參加該活動的 Evan Lin 及他的同事 Benjamin Chen 帶來他們的心得分享,內容包括:
Tensorflow Summit RECAP
Machine Learning Expert Day 所見所聞
分享一下 Linker Networks 如何使用 Tensorflow
https://gdg-taipei.kktix.cc/events/google-apac-machine-learning-day
Use Case Patterns for LLM Applications (1).pdfM Waleed Kadous
What are the "use case patterns" for deploying LLMs into production? Understanding these will allow you to spot "LLM-shaped" problems in your own industry.
KantanMT Founder and Chief Architect, Tony O'Dowd and Technical Project Manager, Louise Faherty show you how to improve the translation productivity of your team, manage post-editing effort and translation project schedules better with powerful Machine Translation engines.
You will learn:
• How to deal with Translation challenges
• About the necessity of Machine Translation to be competitive
• How KantanMT.com can be integrated with existing Translation Management Systems
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/39SddUL.
Victor Dibia provides a friendly introduction to machine learning, covers concrete steps on how front-end developers can create their own ML models and deploy them as part of web applications. He discusses his experience building Handtrack.js - a library for prototyping real time hand tracking interactions in the browser. Filmed at qconsf.com.
Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs. Prior to this, he was a Research Staff Member at the IBM TJ Watson Research Center, New York. His research interests are at the intersection of human computer interaction, computational social science, and applied AI.
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Applying NLP to product comparison at visual metaRoss Turner
Talk given on NLP at the Elasticsearch meetup in Berlin in February 2017. Discusses word embeddings for product classification, generation of product descriptions and chat bots.
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
Learn about how the Age of Language Models in NLP can be used and how it applies to you in the real world.
You can learn about Word embeddings, Sequence Modelling, Advanced Language Models, and NLP Attention Mechanism. All the resource is available for you to grow your knowledge and skills about Natural Language Processing webinar.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
Use Case Patterns for LLM Applications (1).pdfM Waleed Kadous
What are the "use case patterns" for deploying LLMs into production? Understanding these will allow you to spot "LLM-shaped" problems in your own industry.
KantanMT Founder and Chief Architect, Tony O'Dowd and Technical Project Manager, Louise Faherty show you how to improve the translation productivity of your team, manage post-editing effort and translation project schedules better with powerful Machine Translation engines.
You will learn:
• How to deal with Translation challenges
• About the necessity of Machine Translation to be competitive
• How KantanMT.com can be integrated with existing Translation Management Systems
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/39SddUL.
Victor Dibia provides a friendly introduction to machine learning, covers concrete steps on how front-end developers can create their own ML models and deploy them as part of web applications. He discusses his experience building Handtrack.js - a library for prototyping real time hand tracking interactions in the browser. Filmed at qconsf.com.
Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs. Prior to this, he was a Research Staff Member at the IBM TJ Watson Research Center, New York. His research interests are at the intersection of human computer interaction, computational social science, and applied AI.
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Applying NLP to product comparison at visual metaRoss Turner
Talk given on NLP at the Elasticsearch meetup in Berlin in February 2017. Discusses word embeddings for product classification, generation of product descriptions and chat bots.
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
Learn about how the Age of Language Models in NLP can be used and how it applies to you in the real world.
You can learn about Word embeddings, Sequence Modelling, Advanced Language Models, and NLP Attention Mechanism. All the resource is available for you to grow your knowledge and skills about Natural Language Processing webinar.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Transformer_Clustering_PyData_2022.pdf
1. Dat Tran - Head of Data Science
Transformer based clustering:
Identifying product clusters for E-commerce
Christopher Lennan
Sebastian Wanner
13/04/2022 PyConDE & PyData Berlin
3. 20 More than 20 years
experience
900+ "idealos" from 40
nations
Active in 6 different countries
(DE, AT, ES, IT, FR, UK)
18 million visitors/month
50.000 shops
Over 330 million offers and
2 million products
Germany's 4th largest
eCommerce website
idealo key facts
11. Results 10k products (shoe category) ⌀ 17 offers per product
Dataset
* no exhaustive hyper-parameter tuning performed
scaling
ruleset
precision 👍
recall 👎
https://github.com/moj-analytical-services/splink
12. KNN
clustering
Transformer
encoders
Embeddings based clustering
EAN: 123
title: abc
colour: lmn
EAN: 123
title: cde
colour: stu
EAN: 234
title: bcd
colour: null
Offers ML model Offers as
vectors
1
2
3
2
3
4
1
2
3
x
y
z
cluster A
Cluster
similar vectors
text
attributes
as features
outputs
embeddings
cluster
embeddings
14. Transfer Learning with Transformers
Learn one task, transfer knowledge to a new task
Pretraining Fine-tuning
Masked language modelling
• Sentence: Where are we [MASK]
• Label: going
Training objective:
Unlabeled
Text data Pretrained model
15. Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
microsoft / mpnet-base
Transformer
Pre
training
16. Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+ languages
through Multi-Lingual
Knowledge Distillation
Transformer Transformer
Pre
training
17. Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110 M. parameters
• 160 GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+
languages through Multi-
Lingual Knowledge
Distillation
• trained on >5 million idealo
offer pairs
• training time 28 hours on a
NVIDIA V100 GPU
fine-tuning
idealo-offer-clustering
Transformer Transformer Transformer
Pre
training
19. Siamese Networks
Train on positive and negative training pairs. Before fine-tuning: 0.58
After fine-tuning: 0.76
+18 pp
Label:
1 = similar
0 = not similar
20. Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use
21. Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use
23. Generate Training Pairs
Choose positive pairs and negative pairs randomly
v Randomly selected negative pairs
are too easy for the model.
v Random negative pairs do not
contribute much to training
progress.
v Model quickly converges and
performance is not enough.
Lessons Learned
24. Generate Training Pairs
Select Hard-negative pairs Offline Strategy
Average embedding
for each product cluster
Generate Pairs
Training
Compute embeddings
Epoch
Search for neighbors
+6 pp
26. Building product cluster
v Scale to millions of vector
searches
v Search quality is important
v Search time should be small
Challenges
Find K-Nearest Neighbor and apply
threshold
K=10
Threshold
27. Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Performance
28. Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Performance