SlideShare a Scribd company logo
1 of 23
大规模高维数据查询

 • zougp@instreet.cn
应用来源




              Image DB
Query Image              Matched result
Point-1: [0., 8., 68., 0., 0., 0., 5., 3., 5., 16.,
66., 0., 0., 0., 4., 4., 36., 26., 13., 0., 0., 0.,
0., 5., 7., 1., 0., 0., 0., 0., 0., 0., 0., 17., …]




                                 Points set
Query points   Points DB   Matched points
方法
高维数据检索方法:

l)线性扫描:
即对整个向量集合进行穷举式顺序扫描。

2)基于树结构的索引:
例如KD-Tree,SR-Tree等。但由于“维度灾难”的存在,当向量的维度大于到20之后,基于树
结构的索引仍然需要扫描整个向量集合的大部分,与线性扫描没有太大区别。

3)向量量化方法:
基于K-means聚类(或层级K-means聚类、近似K-means聚类)的向量量化方法将量映射为标量,
为图像特征建立“视觉词库”(Visual Vocabulary)。该方法在图像检索实际问题中取得了一定
成功。但注意到K-means聚类算法的复杂度与图像特征数量以及聚类(对应“视觉单词”,
Visual Word)数量相关,当图像规模达到100万以上时,索引和匹配的时间复杂度依然很高。

4)基于散列表的索引:
与向量量化方法类似,基于散列表的方法可以利用散列函数将向量转化为标量进行匹配。
基于散列表方法的最大好在于其时间复杂度与散列表大小无关。由于需要对图像特征向量
进行非精确匹配,还需要散列算法具有将相近向量映射到同一标量的性质。
KNN                 - NN               ( , r ) - NN

给定一个查询点q,一个正     给定一个查询点q,一个参             给定一个查询点q,参
整数k以及一个距离度量,         
                 数 以及一个距离度量,则             数  , r 及一个距离度
则一个k最邻近查询为找出k    一个  最邻近查询为找出所           量,则一个 ( , r ) 最邻
个离q 最近的空间数据对象。   有与q 的距离在 (1   ) 倍最     近查询为找出有所与q
                 近距离的空间数据对象。              的距离在 (1   )r 的空
                                          间数据对象。
LSH
LSH(Locality Sensitive Hashing): 位置敏感哈希


P. Indyk 和 R. Motwani 在 [Indyk & Motwani '98] 中首次提出LSH的概念:




LSH建立了一种映射准则:将原始高维数据空间S中的点映射到相对低维空间U,保证S中距
离相近的点,其在U中的映射点,也具有较大概率的距离相近,甚至是相等;那么在做搜
索时,将查询点做映射,在U中查找与该映射值相近的点;这样,这些查找到的映射点的
原象,即是S中与查询点相近的点,因此,这种方法称为是“位置敏感”(Locality Sensitive)
的。那么,映射的建立、空间的选择以及空间的度量,是LSH要研究的问题。
MD5
www.instreet.cn         abcgjvoiboiiojwnrej




如上图所示,空间上的点经位置敏感哈希函数散列
之后,对于q,其rNN有可能散列到同一个桶(如第
一个桶),即散列到第一个桶的概率较大,会大于
某一个概率阈值p1;而其(1+c)rNN之外的对象则不太
可能散列到第一个桶,即散列到第一个桶的概率很
小,会小于某个概率阈值p2。当然,为了尽可能地
减少冲突,可建立多个散列表,每个散列表对应多
个桶。
Point-1: [0., 8., 68., 0., 0., 0., 5., 3., 5., 16.,
             66., 0., 0., 0., 4., 4., 36., 26., 13., 0., 0., 0.,
             0., 5., 7., 1., 0., 0., 0., 0., 0., 0., 0., 17., …]


                 Hash function 1                                           Hash function L



                        Bucket                                                        Bucket
                          1                                                             1
Hash Table              Bucket                ……                   Hash Table         Bucket
    1                     2                                            L                2
                               ……                                                            ……
基于Hamming距离的LSH实现


                                将D维R空间向量
                                映射为CXD维H空间向量




                                 散列函数将CXD维H空间
                                 向量选择为K维




第L个Hash function将K维H空间向量映射为一标量,插入第L个哈希表的某个桶中。
point: id 1
                           Points          point: id 2
                            Set
                                           point: id 3
         Image: id 1
                                              ……
         Image: id 2
 Image
  Set    Image: id 3
            ……
                                                                      Bucket
                                                                        1
                                                         Hash Table   Bucket
                                                             1          2
                                                                           ……
Query
image
                                                                      Bucket
                                                                        1
                                                         Hash Table   Bucket
                 Matched
                  Points    v p  vq       r                L          2
                                       H
                   set                                                     ……
基于P-Stable的LSH算法
基于P-Stable的LSH算法:

         对于D维R空间的向量,通常距离度量函数,都是欧氏距离,要应用基
         于Hamming距离的LSH算法,必须将欧拉距离转换为二进制海明距离,
         但这将增加算法的查询时间和复杂性。
数据对象:



                                     a  (a1 ,, ad )     2-稳定分布

                                     X

       a v (   ai vi , R )
               i
                           1
                                         ( vi )1/ 2 X
                                              i
                                                  2       同分布


                                                          度量估计
                            v    2
                                      a v

      a (v1  v2 )  a v1  a v2  v1  v2 X            位置敏感

               h(v1 )
h(v )  av
                h(v2 )
数据对象:


  映射:



  函数:



散列函数:        g (v)  (h1 (v),, hk (v))  ( x1,, xk )
                                                         散列函数族:
       k               
 h1     ri xi  mod c  mod table _ size
        i 1           

       k '             
 h2     ri xi  mod c 
        i 1           
v            gi (v)  (h1 (v),, hk (v))  ( x1,, xk ), i  1 L
point             LSH
                                          h1 , h2 
                                                                        L
                                       index, value                   K
                                                                        Table size
                                                Key-value               C

                         <index,
                        Value list>
     Hash Table          <index,
                                                Index        Value      value        …
         i              Value list>
                                                            Point id   Point id      …
                             ……
                                                            Image id   Image id      …
point: id 1
                       Points   point: id 2    Index       Value      value     …
                        Set                            Point id      Point id   …
                                point: id 3
         Image: id 1
                                   ……                  Image id     Image id    …
         Image: id 2
 Image
  Set    Image: id 3
            ……
                                                               <index,
                                                              Value list>
                                                                       ……
                                              Hash Table
                                                  1            <index,
                                                              Value list>

Query
image
                                                                <index,
                                                               Value list>
                                                                        ……
                                              Hash Table
                                                  2            <index,
                                                              Value list>
Query set: bucket set


Value     Value        value     …                Value     Value         value     …
 List    Point id     Point id   …                 List    Point id      Point id   …
  1                                                 k
                                      ……
         Image id Image id       …                         Image id Image id        …



                       values



                                     KNN
                                     Point set                                          Matched
                                                     vq  v p       r                  Images
                                                                2                         set




        Pseudo-Grid
E2LSH: 基于P-Stable分布的LSH实现



实验数据:

    <每个特征为128维unsinged int>
    图像集      特征点集         查询时间     <K,L,Table   内存销耗
                          s/100p   Size>
    20       10,000       0.0006   14,91
    200      100,000      0.001    16,153
    2000     1000,000     27.5     18,360       100%
    20,000   12,000,000   #        #            #




    1、基于内存的索引构建,需要将全部点集数据载入内存建立哈希;
    2、存在过度销耗内存的中间环节;
    3、<key , value>的复杂数据结构全部置到内存,过度销耗内存影响查询速度;
    4、小规模数据集上速度快,可直接返回  , r ) - NN 点,准确率高;
                              (
基于P-Stable分布的LSH实现(改进1)




  1、建立哈希时,从文件载入特征点数据;
  2、减少哈希表数量,但无法预知最优数;
  3、<key , value>简单的数据结构,减少内在销耗;
  4、速度快,不可直接返回 ( , r ) - NN点,需要在返回的前C个数据对像中做二次
    查询,准确率未测试;



   图像集      特征点集        查询时间     <K,L,Table   内存销耗
                        s/100p   Size>
   20       10,000      0.0001   16,2
   200      100,000     0.003    16,4
   2000     1000,000    0.05     16,8
   20,000   12,000,00   1.2      16,8
基于P-Stable分布的LSH实现(改进2)




  1、<key , value>存于redis,但查询速度提升并不明显;
  2、<key , value>读取方便;




   图像集      特征点集        查询时间     <K,L,Table   内存销耗
                        s/100p   Size>
   20       10,000      0.003    16,2
   200      100,000     0.5      16,4
   2000     1000,000    5.5      16,8
   20,000   12,000,00   #        16,8
方案
• 特征点进行筛选以降低点集大小;
• 特征向量降维处理以降低距离比较的时间;
• 对冲突的Bucket中的点对应图像与查询图像
  做二次匹配;
• 海量<key value>的存储查询解决之道;

More Related Content

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

大规模高维数据查询