The document summarizes a multiple sorting method called SketchSort for performing all pairs similarity search on large-scale datasets. It maps vector data to binary sketches to reduce memory usage, then applies locality sensitive hashing and multiple sorting to efficiently find all pairs of data points within a given distance threshold. The method is evaluated on large image, chemical compound, and genome sequence datasets and is shown to outperform other state-of-the-art similarity search methods.
Hash Functions, the MD5 Algorithm and the Future (SHA-3)Dylan Field
This was filmed at the Sonoma State University mathematics colloquium on November 5th, 2008. In the talk, Dylan speaks about hash functions, their applications and attacks on them. He specifically focuses on the design of the MD5 algorithm. Dylan also gives a preview of what is in store for the future of hashes- the SHA-3 competition put on by the NIST. For a video of this presentation, visit http://www.vimeo.com/2409021
The following resources come from the 2009/10 BEng in Digital Systems and Computer Engineering (course number 2ELE0065) from the University of Hertfordshire. All the mini projects are designed as level two modules of the undergraduate programmes.
The objectives of this module are to demonstrate, within an embedded development environment:
• Processor – to – processor communication
• Multiple processors to perform one computation task using parallel processing
This project requires the establishment of a communication protocol between two 68000-based microcomputer systems. Using ‘C’, students will write software to control all aspects of complex data transfer system, demonstrating knowledge of handshaking, transmission protocols, transmission overhead, bandwidth, memory addressing. Students will then demonstrate and analyse parallel processing of a mathematical problem using two processors. This project requires two students working as a team.
This is the story of how I built an all-seeing eye with Ruby, and how I use it to defend the sanctity of my suburban home.
Using a Raspberry Pi and some homemade motion detection software I've developed a home security system that can send me notifications on my phone and photograph intruders. It uses perceptual hashes to detect image changes and archives anything unusual. I can even set a custom alerting threshold and graph disturbances over time.
If you've ever had the desire to be an evil wizard with a glowing fireball of an eye this talk is perfect for you. Come play with Sauron.
David and Kanwen and Carlos implemented a speech recognition system on an FPGA development board (Altera DE2 Board) for the Design Project course at McGill (ECSE 494).
본 장에서는 C언어의 관계연산자, 논리연산자, 비트연산자에 대해 다루어 보겠습니다.
- Youtube 강의동영상
https://youtu.be/XGPVLztgiOI
- 코드는 여기에서 다운 받으세요
https://github.com/dongupak/Basic-C-Programming
Describes the math::exact module in Tcllib, which performs exact calculations over the computable reals. Numbers are represented as objects that contain programs to define the number by successive approximations.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
"Protectable subject matters, Protection in biotechnology, Protection of othe...
Sketch sort ochadai20101015-public
1. Seminar@ochanomizu university , October 14, 2010(Thu.)
Yasuo Tabei(JST Minato ERATO Project)
Joint work with Takeaki Uno(NII),
Masashi Sugiyama (TITECH), Koji Tsuda (AIST)
2. Motivation
- Large-scale data
- Needs for all pairs similarity search method
- Single sorting sethod, and its drawbacks
Method
- Multiple sorting method
- Locality sensitive hashing
- SketchSort
Experiments
- Comparison of other state-of-the art methods
- Use large-scale image datasets
3. Image Chemical Compounds
- 80 million tiny images - NCBI PubChem
(Torralba et al., (2008)) - 28 million chemical
- size: 32×32 pixes compounds
Genome Sequences
- NCBI Sequence Read Archive
- A large-scale genome sequences
from various organisms
5. Mapping vector to binary string (sketch)
- Conserve the distance in the original space
x=(0.3, 0.1, 0.5, 0.6, 0.7, 1.2, -0.2,…)
Mapping
s=1010010001110001010…
Advantages
- Can keep giga-scale data in main memory
- Accelerate various algorithms
6. Finding all neighbor pairs from vector data
- Given a set of data-points
- Find all pairs within a distance ,
xi , x j Δ( xi , x j ε
s.t. )
ε
7. Can build a neighborhood graph
- Vertex: a data-point
- Edge : a neighbor pair
Applications: semi-supervised learning, spectral
clustering, ROI detection in images, retrieval of
protein sequences, etc
ε
9. Need a large number of distance calcuration for
achieving reasonable accuracy
Can not derive an analytical estimation of the
fraction of missing neighbors
10. Motivation
- Large-scale data
- Needs for all pairs similarity search method
- Single sorting sethod, and its drawbacks
Method
- Multiple sorting method
- Locality sensitive hashing
- SketchSort
Experiments
- Comparison of other state-of-the art methods
- Use large-scale image datasets
11. Input: set of fixed-length strings S={s1,…,sn }
Output: all pairs of strings within a Hamming
distance d
By appling radixsort, enumerate all pairs in O(n+m)
- n: number of strings, m: output pairs
Introduce block-wise masking technique for acceleration
12. Sort strings by radixsort, divide strings into equivalence
classes O(n)
Draw edges within all strings in an equivalece class O(m)
Computational Complexity: O(n+m)
EMILY ALICE
DAVID ALICE
CHRIS BOBBY
ALICE Sort CHRIS
Equivalence
DAVID DAVID Classes
BOBBY DAVID
DAVID DAVID
ALICE EMILY
13. Mask d characters in all possible ways
l
Performe radixsort d times
Linear time to the number of strings
Time exponential to d, polynomial to the length of
strings l
Ex)d=2 7:0000 0001 0011 1110 7:0000 0001 0011 1110
4:0100 0001 1101 1100 4:0100 0001 1101 1100
8:0101 1001 0111 1000 8:0101 1001 0111 1000
10:1001 0011 1001 0111 5:1010 0010 1110 1010
5:1010 0010 1110 1010 3:1100 1000 1101 1100
1:1011 1111 0011 1110 6:1111 0011 1001 0111
2:1101 0111 0111 0001 10:1001 0011 1001 0111
3:1100 1000 1101 1100 2:1101 0111 0111 0001
9:1101 1000 1101 1110 9:1101 1000 1101 1110
6:1111 0011 1001 0111 1:1011 1111 0011 1110
18. 1 2 3 4
Step1: Make a total order among
6:1111 0011 1001 0111
blocks from left to right 10:1001 0011 1001 0111
Step2: Make a total order among
block combinations (1,2)<(1,3)<(1,4)
<(2,3)<(2,4)<(3,4)
Step3: Take the minimum among
matched block combinations
Combination 1 Combination 2
1 2 3 4 1 2 3 4
6:1111 0011 1001 0111 6:1111 0011 1001 0111
10:1001 0011 1001 0111 10:1001 0011 1001 0111
(1,2) < (1,4)
19. If the number of blocks is k-d,
Eliminate duplicate pairs, and
Calculate Hamming distance
Call function to equivalence
classes
20. Enumerates all neighbor pairs within a distance
- (xi, xj), i < j, Δ(xi,xj) ≦ε,
Basic idea
- Map vector data to sketches by LSH
- Enumerate all neighbor pairs by MSM
SketchSort with cosine LSH
- Enumerate all neighbor pairs within a consine
distance threshold ε xiT x j
- (xi, xj), i < j,
Δ( xi , x j ) 1 ≦ε
|| xi |||| x j ||
Applicable to Euclidean distance (Raginsky,10),
Jaccard-coffecients
21. Basic idea
Generate a random hyperplain centered at 0
Map vector data to ‘1 if a
data-points is above the hyperplain,
or else ‘0’
Repeat l times 10….
l
1
1
1 0….
0 0
22. Basic idea: Map vector data to sketches and apply MSM
Not good: create long sketches and apply MSM at once
Divide long sketches to Q short sketches of length l
(chunks)
Apply MSM to each chunk, obtain neighbor pairs w.r.t
Hamming distance
l l l
1100101010010101 0101010101001010 101010101010011
0101010101010101 0101000101010101 010101010010100
1001010101011110 1010101010101010 101010101010100
1000001010101010 1010101010101010 101010010101011
1111001010111110 1010101011111010 010101010101010 ......
1010101010010101 0100111101010100 001111010001011
1000010000100001 0011111101000100 010010010001111
1111000001110101 0101001010101001 001000111100101
0001010010101001 0100101011100100 101001010001000
1111000100100010 0011010100010010 010010100010001
MSM MSM MSM
Report neighbor pairs no more than a cosine distance
threshold ε
23. A neighbor pair of sketches can be detected in
several chunks within Hamming distance d
Si1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 …
Sj0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …
HamDist(si1,sj1)>d HamDist(si3,sj3)>d HamDist(si5,sj5)≦d
HamDist(si2,sj2)≦d HamDist(si4,sj4)>d
Duplication!!
The same pair is outputted several times
(Duplication)
24. Step1: Order chunks from left to right.
1 2 3 4 5
1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 …
0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a
Step2: Check whether left chunks are no more than
Hamming distance d
1 2 3 4 5
1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 …
0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a
HamDist(si1,sj1)>d? HamDist(si3,sj3)>d? HamDist(si5,sj5)≦d
HamDist(si2,sj2)>d? HamDist(si4,sj4)>d?
- If such chunk is found, trash the pair,
Or else check cosine distance
26. Call function for
each chunk
Check duplication
four times
Divide sketches into
equivalence classes
Call function recursively
27. True edges E*, Our results E
Type-I error (false positive): A non-neighbor pair has
a Hamming distance within d in at least one chunk
Type II-error (false negative): A neighbor pair has a
Hamming distance larger than d in all chunks
28. Basically, type-II error is more crucial
- type-I errors are filtered out by distance calculations
Missing edge ratio (type-II error) is bounded as
where p is an upper bound of the non-collision
probability of neighbors
29. Motivation
- Large-scale data
- Needs for All Pairs Similarity Search Method
- Single Sorting Method, and its drawbacks
Method
- Multiple Sorting Method
- Cosine Locality Sensitive Hashing
- SketchSort
Experiments
- Comparison of other state-of-the art methods
- Use large-scale image datasets
30. Two image datasets
- MNIST (60,000 data, 748 dimension)
- Tiny Image (100,000 data, 960 dimension)
Use missing edge ratio as an evaluation measure
Set cosine distance threshold of 0.15π
Length of each chunk to 32bit
Hamming distance and number of blocks are set
to (2,5) and (3,6).
Make number of chunks vary from 2, 4, 6, …, 50
Compare our method to Lanczos bisection method
(JMLR, 2009)
31.
32.
33. K-nearest neighbor graph construction by
SketchSort
- Keep k-nearest neighbor pairs by priority queue
Compare SketchSort to
- Cover Tree (Beygelzimer et al., ICML 2006)
- AllKNN (Ram et al., NIPS 2009),
- Lanczos-bisection (JMLR, 2009)
34.
35.
36. Set parameters so as to keep missing edge ratio
no more than 1.0×10-6
Enable to detect similar pairs nearly exactly
Take only 4.3 hours for 1.6 million images
0.05π 0.1π 0.15π
37. Fast all pairs similarity search method
Applicable to large-scale vector data
Various applications
Software
- http://code.google.com/p/sketchsort/