Дмитрий рассказал о методе снижения размерности многомерных данных – Locality Sensitive Hashing. На примере задачи поиска похожих текстовых документов гости был подробно разобран алгоритм Minhash.
Quick introduction to community detection.
Structural properties of real world networks, definition of "communities", fundamental techniques and evaluation measures.
Quick introduction to community detection.
Structural properties of real world networks, definition of "communities", fundamental techniques and evaluation measures.
Knights tour on chessboard using backtrackingAbhishek Singh
The knight is placed o any block of an empty chess board and moving according to the rules of chess, it must visit each square exacty once. Here in the ppt the algorithm along with some visualisation and explanation is given and problem is solved using backtracking approach.
In computer science, a data structure is a particular way of organizing data in a computer so that it can be used efficiently. Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks.
breadth-first search (BFS) is a strategy for searching in a graph when search is limited to essentially two operations
Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures.
After a long period, I bring you new - fresh Presentation which gives you a brief idea on sub-problem of Dynamic Programming which is called as -"Longest Common Subsequence".I hope this presentation may help to all my viewers....
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...Umesh Kumar
PPT On Sorting And Searching Concepts In Data Structure. In Many Programming Concepts We Use This Tricks In Algorithms....So Wacth,Learn And Enjoy Study.....Thanks
Knights tour on chessboard using backtrackingAbhishek Singh
The knight is placed o any block of an empty chess board and moving according to the rules of chess, it must visit each square exacty once. Here in the ppt the algorithm along with some visualisation and explanation is given and problem is solved using backtracking approach.
In computer science, a data structure is a particular way of organizing data in a computer so that it can be used efficiently. Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks.
breadth-first search (BFS) is a strategy for searching in a graph when search is limited to essentially two operations
Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures.
After a long period, I bring you new - fresh Presentation which gives you a brief idea on sub-problem of Dynamic Programming which is called as -"Longest Common Subsequence".I hope this presentation may help to all my viewers....
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...Umesh Kumar
PPT On Sorting And Searching Concepts In Data Structure. In Many Programming Concepts We Use This Tricks In Algorithms....So Wacth,Learn And Enjoy Study.....Thanks
In this talk I will discuss how to deduplicate large amounts of source code using the source{d} stack, and more specifically the Apollo project. The 3 steps of the process used in Apollo will be detailed, ie: - the feature extraction step; - the hashing step; - the connected component and community detection step; I'll then go on describing some of the results found from applying Apollo to Public Git Archive, as well as the issues I faced and how these issues could have been somewhat avoided. The talk will be concluded by discussing Gemini, the production-ready sibling project to Apollo, and imagining applications that could extract value from Apollo.
After a quick introduction on the motivation behind Apollo, as said in the abstract I'll describe each step of Apollo's process. As a rule of thumb I'll first describe it formally, then go into how we did it in practice.
Feature extraction: I'll describe code representation, specifically as UASTs, then from there detail the features used. This will allow me to differentiate Apollo from it's inspiration, DejaVu, and talk about code clones taxonomy a bit. TF-IDF will also be touched upon. Hashing: I'll describe the basic Minhashing algorithm, then the improvements Sergey Ioffe's variant brought. I'll justify it's use in our case simultaneously. Connected components/Community detection: I'll describe the connected components and community notion's first (as in in graphs), then talk about the different ways we can extract them from the similarity graph.
After this I'll talk about the issues I had applying Apollo to PGA due to the amount of data, and how I went around the major issued faced. Then I'll go on talking about the results, show some of the communities, and explain in light of these results how issues could have been avoided, and the whole process improved. Finally I'll talk about Gemini, and outline some of the applications that could be imagined to Source code Deduplication.
How to make hash functions go fast inside snarks, aka a guided tour through arithmetisation friendly hash functions (useful for all cryptographic protocols where cost is dominated by multiplications -- e.g. anything using R1CS; secret sharing based multiparty computation protocols; etc)
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
I am Blake H. I am an Algorithm Exam Expert at programmingexamhelp.com. I hold a PhD. in Programming, from Curtin University, Australia. I have been helping students with their exams for the past 10 years. You can hire me to take your exam in Algorithm.
Visit programmingexamhelp.com or email support@programmingexamhelp.com. You can also call on +1 678 648 4277 for any assistance with the Algorithm Exam.
Monads and Monoids: from daily java to Big Data analytics in Scala
Finally, after two decades of evolution, Java 8 made a step towards functional programming. What can Java learn from other mature functional languages? How to leverage obscure mathematical abstractions such as Monad or Monoid in practice? Usually people find it scary and difficult to understand. Oleksiy will explain these concepts in simple words to give a feeling of powerful tool applicable in many domains, from daily Java and Scala routines to Big Data analytics with Storm or Hadoop.
Similar to Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: Locality Sensitive Hashing (20)
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...Mail.ru Group
В рамках доклада мы поделимся примерами проектов, на которых есть автоматизация, но нет ни одного специально выделенного инженера для выполнения задач, связанных с автоматизацией тестирования. Затронем такие вопросы как:
что нас привело к такому решению (отказаться от test automation инженеров);
сложности, с которыми мы столкнулись;
бонусы, которые мы в итоге получили.
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...Mail.ru Group
Автоматизация тестирования UI — это всегда непростая задача, особенно в условиях активной разработки и постоянного изменения требований. Как мы решали эту проблему в mall.my.com. Как и почему пришли к BDD. Какие инструменты выбрали. И что из этого вышло.
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...Mail.ru Group
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail.ru;
Свежий взгляд на Fiddler и его сравнение с Clumsy и Charles;
Небольшой обзор и сравнение функционала Fiddler и Charles.
Управление инцидентами в Почте Mail.ru, Антон ВикторовMail.ru Group
что такое инциденты и почему это важно;
как из непонятного сделать «рутину»;
про автоматизацию: OTRS, Jira, чат-боты;
про диагностику: логирование, как работает Bomgar;
про сообщество: специальная программа тестирования почты для сотрудников.
На сегодняшний день такие популярные анализаторы, как OWASP ZAP и Burp Suite, не всегда хорошо справляются с задачей автоматического сканирования приложений. Нередко они не могут найти какие-то специфические директории, автоматически отправить запрос без участия человека. И чаще данные инструменты запускаются локально. При этом, если в компании хорошо работает команда по автоматизации тестирования, их работу можно взять за основу динамического анализа и фазинга.
Как бонус, обсудим разницу Burp Suite Professional и Burp Suite Enterprise с точки зрения CI/CD и подключения автоматизированных тестов.
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...Mail.ru Group
Почему каждый DL-инженер должен написать свою либу для обучения сеток, а потом отказаться от неё.
Расскажу про опыт написания kekas-а, и почему в своей команде мы пользуемся pytorch-lightning как более зрелым решением.
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...Mail.ru Group
Расскажу про различные полезные библиотеки и функции Python: от простых и известных, до специфичных и редких. Поделюсь тем, какие технологии мы используем при разработке, обучении и деплое наших моделей: что помогало улучшить качество, а что тормозило разработку.
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
Все мы знаем, что наш любимый Pandas исключительно однопоточный, а модели из scikit-learn часто учатся не очень быстро даже в несколько процессов. Поэтому в докладе я расскажу о проекте RAPIDS - наборе библиотек для анализа данных и построения предиктивных моделей с использованием NVIDIA GPU. В докладе я предложу подискутировать о том, что закон Мура больше не выполняется, рассмотрю принципы работы архитектуры CUDA. Разберу библиотеки cuDF и cuML, а также постараюсь предельно честно рассказать о том, ждать ли чуда от перехода на GPU и в каких случаях чудо неизбежно.
WebAuthn в реальной жизни, Анатолий ОстапенкоMail.ru Group
Я расскажу, как мы поддержали вход через WebAuthn в самом крупном почтовом сервисе рунета и какие сложности скрываются за красивыми презентациями о том, какой WebAuthn простой и безопасный:
как сделать WebAuthn понятным и доступным для пользователей;
как поддержать его во всех браузерах и устройствах;
как тестировать WebAuthn, в том числе автоматизированно;
куда двигаться дальше после его запуска и включения.
AMP для электронной почты, Сергей ПешковMail.ru Group
Библиотека AMP — это не только современный инструмент создания богатых функциональностью и производительных web-сайтов, адаптированных для работы на мобильных устройствах. AMP для электронной почты радикально обновляет традиционный формат электронных писем, позволяя создавать более привлекательные и полезные для пользователя рассылки.
В Почте Mail.ru очень вдохновляют новые возможности, которые может предоставить нашим пользователям и партнерам AMP для электронной почты. Этот доклад о том:
почему стандарт для по-настоящему интерактивных электронных писем не получалось создать раньше;
что из себя представляет стандарт AMP4Email, какие новые способы взаимодействия с письмом он дает;
как с его помощью повысить ценность рассылки для пользователя;
как мы реализовали поддержку AMP4Email в своих продуктах и обеспечили его безопасность;
как AMP4Email может повысить конверсию на примере внедрения AMP-рассылок в партнерстве с крупнейшим сервисом электронной коммерции в России.
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...Mail.ru Group
Delivery Club — крупнейшая фудтех-платформа в России, которая объединяет более 12 000 ресторанов разной ценовой категории в более чем 120 городах.
Мы разработали приложение для наших партнеров, в котором они могут управлять заказами, меню, ингредиентами, статистикой в удобном интерфейсе. В докладе пойдет речь о том, как внедрение практик PWA помогло нам улучшить пользовательский опыт, решить вопросы, связанные с работой приложения на разных платформах. И как поддержка offline-режима избавила нас от проблем с вечными перепадами сети у наших партнеров.
Этика искусственного интеллекта, Александр Кармаев (AI Journey)Mail.ru Group
AI Journey — двухдневная конференция с ведущими международными и российскими спикерами — экспертами в области искусственного интеллекта и анализа данных, а также представителями компаний — лидеров по развитию и применению технологий ИИ в бизнес-процессах.
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...Mail.ru Group
AI Journey — двухдневная конференция с ведущими международными и российскими спикерами — экспертами в области искусственного интеллекта и анализа данных, а также представителями компаний — лидеров по развитию и применению технологий ИИ в бизнес-процессах.
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...Mail.ru Group
AI Journey — двухдневная конференция с ведущими международными и российскими спикерами — экспертами в области искусственного интеллекта и анализа данных, а также представителями компаний — лидеров по развитию и применению технологий ИИ в бизнес-процессах.
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)Mail.ru Group
AI Journey — двухдневная конференция с ведущими международными и российскими спикерами — экспертами в области искусственного интеллекта и анализа данных, а также представителями компаний — лидеров по развитию и применению технологий ИИ в бизнес-процессах.
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()Mail.ru Group
AI Journey — двухдневная конференция с ведущими международными и российскими спикерами — экспертами в области искусственного интеллекта и анализа данных, а также представителями компаний — лидеров по развитию и применению технологий ИИ в бизнес-процессах.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
3. Today's problem: near duplicates detection
Find all pairs of data points within distance threshold
Given: High dimensional data points
And some distance function
· ( , , . . . )x1 x2
Image
User-Item (rating, whatever) matrix
-
-
· d( , )x1 x2
Euclidean distance
Cosine distance
Jaccard distance
-
-
-
J(A, B) =
|A ∩ B|
|A ∪ B|
,xi xj d( , ) < sxi xj
/
4. Near-neighbor search
Applications
High Dimensions
High-dimensional spaces are lonely places
Duplicate detection (plagiarism, entity resolution)
Recommender systems
Clustering
·
·
·
Bag-of-words model
Large-Scale Recommender Systems (many users vs many items)
·
·
Curse of dimensionality·
almost all pairs of points are equally far away from one another-
/
5. Finding similar text documents
Examples
Challenges
Mirror websites
Similar news articles (google news, yandex news?)
·
·
Pieces of one document can appear out of order (headers, footers, etc), diffrent lengths.
Large collection of documents can not fit in RAM
Too many documents to compare all pairs - complexity
·
·
· O( )n
2
/
6. Pipeline
Pieces of one document can appear out of order (headers, footers, etc) =>
Shingling
Documents as sets
Large collection of documents can not fit in RAM =>
Minhashing
Convert large sets to short signatures, while preserving similarity
Too many documents to compare all pairs - complexity =>
Locality Sensitive Hashing
Focus on pairs of signatures likely to be from similar documents.
·
·
·
·
· O( )n
2
·
/
7. Document representation
Example phrase:
"To be, or not to be, that is the question"
Documents as sets
Bag-of-words => set of words:
k-shingles or k-gram => unordered set of k-grams:
·
{"to", "be", "or", "not", "that", "is", "the", "question"}
don't work well - need ordering
-
-
·
word level (for k = 2)
caharacter level (for k = 3):
-
{"to be", "be or", "or not", "not to", "be that", "that is", "is the", "the
question"}
-
-
{"to ", "o b", " be", "be ", "e o", " or", "or ", "r n", " no", "not", "ot
", "t t", " to", "e t", " th", "tha", "hat", "at ", "t i", " is", "is ", "s
t", "the", "he ", "e q", " qu", "que", "ues", "est", "sti", "tio", "ion"}
-
/
8. Practical notes
Optionally can compress (hash!) long shingles into 4 byte integers!
Picking
should be picked large enough that the probability of any given shingle appearing in any given
document is low
k
is OK for short documents
is OK for long documents
· k = 5
· k = 10
k
/
10. Type of rows
type s1 s2
a 1 1
b 1 0
c 0 1
d 0 0
A, B, C, D - # rows types a, b, c, d
J( , ) =s1 s2
A
(A + B + C)
/
11. Minhashing
Convert large sets to short signatures
Minhashing example
1. Random permutation of rows of the input-matrix .
2. Minhash function = # of first row in which column .
3. Use independent permutations we will end with minhash functions. => can construct signature-
matrix from input-matrix using these minhash functions.
M
h(c) c == 1
N N
/
12. Property
(h( ) = h( )) =???Pperm s1 s2
=
Why? Both
remember this result
·
J( , )s1 s2
· A
(A+B+C)
·
/
13. Implementation of Minhashing
One-pass implementation
Universal hashing
random permutation
random lookup
·
·
1. Instead of permutations pick independent hash-functions ( )
2. For column and hash-function keep slot . Init with +Inf.
3. Scan rows looking for 1
N N N = O(1/ )ϵ2
c hi Sig(i, c)
Suppose row has 1 in column
Then for each : If =>
· r c
· ki (r) < Sig(i, c)hi Sig(i, c) := (r))hi
(x) = ((ax + b) mod p)hi
- integers
- large prime:
· a, b
· p p > N
/
14. Algorithm
for each row r do begin
for each hash function hi do
compute hi (r);
for each column c
if c has 1 in row r
for each hash function hi do
if hi(r) is smaller than M(i, c) then
M(i, c) := hi(r);
end;
/
16. Job for LSH:
1. Divide M into bands, rows each
2. Hash each column in band into table with large number of buckets
3. Column become candidate if fall into same bucket for any band
b r
bi
/
17. Bands number tuning
1. The probability that the signatures agree in all rows of one particular band is
2. The probability that the signatures disagree in at least one row of a particular band is
3. The probability that the signatures disagree in at least one row of each of the bands is
4. The probability that the signatures agree in all the rows of at least one band, and therefore become a
candidate pair, is
s
r
1 − s
r
(1 − s
r
)
b
1 − (1 − s
r
)
b
/
18. Example
Goal : tune and to catch most similar pairs, but few non-similar pairs
100k documents =>
100 hash-functions => signatures of 100 integers = 40mb
find all documents, similar at least
·
·
· b = 20, r = 5
· s = 0.8
1. Probability C1, C2 identical in one particular band:
2. Probability C1, C2 are not similar in all of the 20 bands:
(0.8 = 0.328)
5
(1 − 0.328 = 0.00035)
20
b r
/
20. References
https://www.coursera.org/course/mmds
R package LSHR - https://github.com/dselivanov/LSHR
Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets. http://mmds.org/
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of
dimensionality.
Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji: Hashing for Similarity Search: A Survey
Alexandr Andoni and Piotr Indyk : Near-Optimal Hashing Algorithms for Approximate Nearest
Neighbor in High Dimensions
Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni: Locality-Sensitive Hashing Scheme
Based on p-Stable Distributions
·
·
·
·
·
/