Introduction

局部敏感哈希(Locality Sensitive Hashing)的基本思想：让相邻的点落入同一个“桶”中，在进行最近邻搜索时，只需要在一个桶，或者相邻的几个桶内进行搜索。属于近似邻近查找（Approximate Nearest Neighbor, ANN）的一种方法.

notion

这里假设$h(x)$表示$x$哈希变化之后的值， $x$和$y$是任意两个样本，$d(x,y)$表示distance between $x$ and $y$, $d_1, d_2, p_1, p_2$均为常数，并且 $d_1 < d_2$, 以及 $p_1, p_2 \in [0,1]$, 如果 $h(x)$ 满足以下两个条件，则称哈希函数$h(x)$ is $(d_1, d_2, p_1,p_2)-sensitive$

当$d(x,y)<= d_1$, then $h(x) = h(y)$的概率至少为$p_1$
当$d(x,y)>=d_2$, then $h(x) = h(y)$ 的概率最多为$p_2$

假设有m个 $(d_1,d_2,p_1,p_2)-sensitive$哈希函数，每个hash函数有$w_i$个桶

work

（可离线）将各个样本映射到hash桶中，构建hash表索引
将查询数据映射到m个桶
使用and 或者 or汇总m个桶的样本获得候选集
在候选集中进行KNN

distance

一些常见的距离度量

欧式距离

将$v$进行映射， $r$为一个随机向量：$h(v) = v \cdot r$, 将结果分桶得到hash函数如下：

$$ \mathrm{h}^{\mathrm{r}, \mathrm{b}}(\mathrm{v})=\left\lfloor\frac{\mathrm{v} \cdot \mathrm{r}+\mathrm{b}}{\mathrm{w}}\right\rfloor $$

这里，$w$是分桶宽度， $b\in (0,w)$的随机变量，目的是避免分桶边界固定化？
余弦距离

余弦夹角的符号最为hash函数：$H(v) = sign(v \cdot r)$

references

https://blog.csdn.net/coffee_cream/article/details/109146143

Approximate nearest neighbors: towards removing the curse of dimensionality

Similarity search in high dimensions via hashing