全文搜索和倒排索引

全文搜索和精确搜索

1、exact value

2017-01-01，exact value，搜索的时候，必须输入2017-01-01，才能搜索出来如果你输入一个01，是搜索不出来的

2、full text （1）缩写 vs. 全程：cn vs. china （2）格式转化：like liked likes （3）大小写：Tom vs tom （4）同义词：like vs love

2017-01-01，2017 01 01，搜索2017，或者01，都可以搜索出来 china，搜索cn，也可以将china搜索出来 likes，搜索like，也可以将likes搜索出来 Tom，搜索tom，也可以将Tom搜索出来 like，搜索love，同义词，也可以将like搜索出来

就不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配

倒排索引

doc1：I really liked my small dogs, and I think my mom also liked them. doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

分词，初步的倒排索引的建立

word doc1 doc2

I * * really * liked * * my * * small * dogs * and * think * mom * * also * them * He * never * any * so * hope * that * will * not * expect * me * to * him *

演示了一下倒排索引最简单的建立的一个过程

搜索

mother like little dog，不可能有任何结果

mother like little dog

这个是不是我们想要的搜索结果？？？绝对不是，因为在我们看来，mother和mom有区别吗？同义词，都是妈妈的意思。like和liked有区别吗？没有，都是喜欢的意思，只不过一个是现在时，一个是过去时。little和small有区别吗？同义词，都是小小的。dog和dogs有区别吗？狗，只不过一个是单数，一个是复数。

normalization，建立倒排索引的时候，会执行一个操作，也就是说对拆分出的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率

时态的转换，单复数的转换，同义词的转换，大小写的转换

mom —> mother liked —> like small —> little dogs —> dog

重新建立倒排索引，加入normalization，再次用mother liked little dog搜索，就可以搜索到了

word doc1 doc2

I * * really * like * * liked --> like my * * little * small --> little dog * * dogs --> dog and * think * mom * * also * them * He * never * any * so * hope * that * will * not * expect * me * to * him *

mother like little dog，分词，normalization

mother --> mom like --> like little --> little dog --> dog

doc1和doc2都会搜索出来

doc1：I really liked my small dogs, and I think my mom also liked them. doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

倒排索引的底层数据结构

排索引，是适合用于进行搜索的

倒排索引的结构

（1）包含这个关键词的document list （2）包含这个关键词的所有document的数量：IDF（inverse document frequency）（3）这个关键词在每个document中出现的次数：TF（term frequency）（4）这个关键词在这个document中的次序（5）每个document的长度：length norm （6）包含这个关键词的所有document的平均长度

word doc1 doc2

dog * * hello * you *

倒排索引不可变的好处

（1）不需要锁，提升并发能力，避免锁的问题（2）数据不变，一直保存在os cache中，只要cache内存足够（3）filter cache一直驻留在内存，因为数据不变（4）可以压缩，节省cpu和io开销

倒排索引不可变的坏处：每次都要重新构建整个索引