The document discusses duplicate detection in online marketplaces with large amounts of user-generated content. It describes a two-step framework for finding duplicate listings: candidate selection to identify potentially duplicate pairs, followed by candidate scoring using machine learning to identify true duplicates. Key aspects include using category, location, seller data, and image hashes to select candidate pairs, and training ML models on text and image similarity features to classify pairs as duplicates or not. Elasticsearch is used to index hashes at scale for fuzzy matching of image duplicates.
Auto FinTech: The Emerging FinTech Ecosystem Surrounding the Auto Industry
As the automotive industry continues to innovate, consumers and businesses will expect the financial services and processes surrounding this massive industry to modernize and adapt as well. Similarly, as new advances change the way consumers and businesses use cars, both traditional financial services and FinTech companies can distinguish themselves by offering new, innovative solutions.
Windows Server Container and Windows Subsystem for LinuxTakeshi Fukuhara
Windows Server コンテナーと Windows Subsystem for Linuxについて説明したAzureウェビナー資料。Windows Server 2019ベースで紹介。2018年11月27日に実施したWebinarの資料に、Azure Container Registryのスライドを追加したもの。
Auto FinTech: The Emerging FinTech Ecosystem Surrounding the Auto Industry
As the automotive industry continues to innovate, consumers and businesses will expect the financial services and processes surrounding this massive industry to modernize and adapt as well. Similarly, as new advances change the way consumers and businesses use cars, both traditional financial services and FinTech companies can distinguish themselves by offering new, innovative solutions.
Windows Server Container and Windows Subsystem for LinuxTakeshi Fukuhara
Windows Server コンテナーと Windows Subsystem for Linuxについて説明したAzureウェビナー資料。Windows Server 2019ベースで紹介。2018年11月27日に実施したWebinarの資料に、Azure Container Registryのスライドを追加したもの。
Securing the Web without site-specific passwordsFrancois Marier
Has anyone else noticed that the OWASP Top 10 is not changing very much? Especially in the realm of authentication-related problems. I don't claim to have the one true solution for this, but one thing is certain: if we change how things are done on the web and relieve developers from having to store passwords, we can make things better.
We need to let web developers outsource their authentication needs to people who can do it well. Does that mean we should force all of our users to join Facebook? Well not really. That might work for some sites, but outsourcing all of our logins to a single for-profit company isn't a solution that works for the whole web.
The open web needs a better solution. One that enable users to choose their identity provider and shop for the most secure one if that's what they're into. This is the promise behind Persona and the BrowserID protocol. Choose your email provider carefully and let's get rid of all of these site-specific passwords that are just sitting there waiting to be leaked and cracked.
Detecting Malicious Websites using Machine LearningAndrew Beard
We present a set of newly tuned algorithms that can distinguish between malicious and non-malicious websites with a high degree of accuracy using Machine Learning (ML). We use the Bro IDS/IPS tool for extracting the SSL certificates from network traffic and training the ML algorithms.
The extracted SSL attributes are then loaded into multiple ML frameworks such as Splunk, AWS ML and we run a series of classification algorithms to identify those attributes that correlate with malicious sites.
Our analysis shows that there are a number of emerging patterns that even allow for identification of high-jacked devices and self-signed certificates. We present the results of our analysis which show which attributes are the most relevant for detecting malicious SSL certificates and as well the performance of the ML algorithms.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Jose Selvi - Side-Channels Uncovered [rootedvlc2018]RootedCON
En los últimos años, el término "side-channel" a pasado de ser un concepto únicamente conocido en el sector de hardware hacking a ser un término popular dentro de la industria debido a las vulnerabilidades que se han ido publicando. CRIME, BREACH o FIESTA son claros ejemplos de vulnerabilidades que explotan un side-channel en TLS. Más recientemente, también hemos visto vulnerabilidades empleando este mismo concepto en procesadores, como Spectre o Meltdown.
En esta charla, repasaremos el concepto de "side-channel" y haremos un repaso por las diferentes vulnerabilidades que se han ido publicando a lo largo de estos últimos años, explicando en que consisten y que limitaciones tienen.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Beyond PHP - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just writing PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
In this talk we’ll present the technology behind the Fully Automated Store by Checkout Technologies. The actual version of the store is a result of the work of 12 engineers that spans the areas from hardware and design to the ultimate deep learning architectures. Will be also discussed the challenges and lessons learnt during this adventure and what it means to deploy the system which has an AI engine in its core. Creation of the dataset and the invention of the specific metrics that is capable to measure the accuracy of the entire system will be discussed.
Introducing Intelligence Into Your Malware AnalysisBrian Baskin
With malware becoming more prevalent, and the pool of capable reversers falling short of overall need, there is a greater need to provide quick and efficient malware analysis for network defense. While many analysts have a grasp on how to appropriately reverse malware, there is large room for improvement by extracting critical indicators, correlating on key details, and cataloging artifacts in a way to improve your corporate response for the next attack. This talk will cover beyond the basics of malware analysis and focus on critical indicators that should analysts should focus on for attribution and better reporting.
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)Андрей Новиков
PostgreSQL has become the most popular RDBMS in the Ruby ecosystem in the last decade. It has a great set of built-in features, including a variety of versatile data types, both common and very specific.
But when we load data from the database to our application code, we're working with Ruby data types: classes from the standard library, Rails, or other gems. So while they can seem to be the same as their PostgreSQL counterparts, they are not absolutely identical, and sometimes that could lead to surprising behavior.
In this talk, I would like to explore the power of data types in PostgreSQL and Ruby and how to work with them properly to use both Ruby and PostgreSQL on 100% of their power!
Similar to Fighting fraud: finding duplicates at scale (Highload+ 2019) (20)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
79. Image embeddings
CNN
Dim 1k+
Dim 100
SVD*
36a93c34a3abff
LSH
* TruncatedSVD works best
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
80. LSH: Random Projections
● Close in the original space ⇒ close in the projection
● Far in the original space ⇒ far in the projection
81. Generate the projection vectors
once and store them somewhere
LSH: Random Projections
https://www.slideshare.net/AlexeyGrigorev/duplicates-everywhere
Use the vectors to reduce the
dimensionality and compute the
hash
Store the hash in the database
82.
83. Plan
● User generated content
○ Fraud and duplicates
○ Content moderation systems
● Duplicate detection framework
○ Step 1: Selecting candidates
○ Step 2: Scoring candidates with Machine Learning
○ Image hashes
● Implementation
○ Elasticsearch
○ Image index system
84. Why Elasticsearch?
● Well-known, convenient, stable and scalable inverted index (thanks, Lucene!)
1 00fc
2 12ec
3 00fc
4 ebe4
5 7a1f
6 00fc
7 8ef4
8 12ec
00fc 1 3 6
12ec 2 8
ebe4 4
7a1f 5
8ef4 7
Direct index Inverted index
ImageID Hash
Hash ImageID
85. Elasticsearch for hashes
{
"_id": "cafebabe",
"_source": {
"title": "new iphone" ,
"description": "new iphone almost not used" ,
"hashes": ["94088af86c038327", ... ]
}
}
"94088af86c038327"
"query": {
"bool": {
"must": [{
"term": {
"hashes": "94088af86c038327"
}
]}
}
}
106. CNN+LSH
Options for deploying image models:
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)
107. CNN+LSH
Options for deploying image models:
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)
● Kubernetes
○ Easier to do: docker + existing cluster
○ More expensive for low load
○ Better for 1+ mln images per day
108. CNN+LSH
Options for deploying image models:
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)
● Kubernetes
○ Easier to do: docker + existing cluster
○ More expensive for low load
○ Better for 1+ mln images per day
114. ML
Such description
So much text
Automatic
moderation system
s3
ES
Duplicate
detection
system
Hashes
Accept
Reject
115. ML
Such description
So much text
MP
Automatic
moderation system
Moderation panel
Accept
Reject
Moderators
s3
ES
Duplicate
detection
system
Hashes
Accept
Reject
Moderation queue
116. ML
Such description
So much text
MP
Automatic
moderation system
Moderation panel
Accept
Reject
Moderators
s3
ES
Duplicate
detection
system
Hashes
Accept
Reject
Moderation queue
117. Plan
● User generated content
○ Fraud and duplicates
○ Content moderation systems
● Duplicate detection framework
○ Step 1: Selecting candidates
○ Step 2: Scoring candidates with Machine Learning
○ Image hashes
● Implementation
○ Elasticsearch
○ Image index system
118. Summary
● Fraud and duplicates often come together
● Use heuristics to find duplicate candidates and ML to find duplicates
● Image hashes is a good and easy way to find duplicate images
● Neural networks can be used for hashing as well
● Elasticsearch is good for finding duplicates (inverted index!)
● AWS Lambda can scale up and down with no human involvement
● Simple things (e.g. hashes) - better in AWS Lambda
● Complex heavy things (e.g. neural nets) - Kubernetes