寻找相关词 | Elasticsearch: 权威指南

寻找相关词 | Elasticsearch: 权威指南 | Elastic

2026-01-11

请注意:
本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。

» » »

寻找相关词编辑

短语查询和邻近查询都很好用，但仍有一个缺点。它们过于严格了：为了匹配短语查询，所有词项都必须存在，即使使用了 slop 。

用 slop 得到的单词顺序的灵活性也需要付出代价，因为失去了单词对之间的联系。即使可以识别 sue 、 alligator 和 ate 相邻出现的文档，但无法分辨是 Sue ate 还是 alligator ate 。

当单词相互结合使用的时候，表达的含义比单独使用更丰富。两个子句 I’m not happy I’m working 和 I’m happy I’m not working 包含相同的单词，也拥有相同的邻近度，但含义截然不同。

如果索引单词对而不是索引独立的单词，就能对这些单词的上下文尽可能多的保留。

对句子 Sue ate the alligator ，不仅要将每一个单词（或者 unigram ）作为词项索引

["sue", "ate", "the", "alligator"]

也要将每个单词 以及它的邻近词 作为单个词项索引：

["sue ate", "ate the", "the alligator"]

这些单词对（或者 bigrams ）被称为 shingles 。

Shingles 不限于单词对；你也可以索引三个单词（ trigrams ）：

["sue ate the", "ate the alligator"]

Trigrams 提供了更高的精度，但是也大大增加了索引中唯一词项的数量。在大多数情况下，Bigrams 就够了。

当然，只有当用户输入的查询内容和在原始文档中顺序相同时，shingles 才是有用的；对 sue alligator 的查询可能会匹配到单个单词，但是不会匹配任何 shingles 。

幸运的是，用户倾向于使用和搜索数据相似的构造来表达搜索意图。但这一点很重要：只是索引 bigrams 是不够的；我们仍然需要 unigrams ，但可以将匹配 bigrams 作为增加相关度评分的信号。

生成 Shingles编辑

Shingles 需要在索引时作为分析过程的一部分被创建。我们可以将 unigrams 和 bigrams 都索引到单个字段中，但将它们分开保存在能被独立查询的字段会更清晰。unigrams 字段将构成我们搜索的基础部分，而 bigrams 字段用来提高相关度。

首先，我们需要在创建分析器时使用 shingle 语汇单元过滤器：

DELETE /my_index

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2, 
                    "max_shingle_size": 2, 
                    "output_unigrams":  false   
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" 
                    ]
                }
            }
        }
    }
}

	参考被破坏的相关度！。
	默认最小/最大的 shingle 大小是 `2` ，所以实际上不需要设置。
	`shingle` 语汇单元过滤器默认输出 unigrams ，但是我们想让 unigrams 和 bigrams 分开。
	`my_shingle_analyzer` 使用我们常规的 `my_shingles_filter` 语汇单元过滤器。

首先，用 analyze API 测试下分析器：

GET /my_index/_analyze?analyzer=my_shingle_analyzer
Sue ate the alligator

果然，我们得到了 3 个词项：

sue ate
ate the
the alligator

现在我们可以继续创建一个使用新的分析器的字段。

多字段编辑

我们曾谈到将 unigrams 和 bigrams 分开索引更清晰，所以 title 字段将创建成一个多字段（参考字符串排序与多字段）：

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "properties": {
            "title": {
                "type": "string",
                "fields": {
                    "shingles": {
                        "type":     "string",
                        "analyzer": "my_shingle_analyzer"
                    }
                }
            }
        }
    }
}

通过这个映射， JSON 文档中的 title 字段将会被以 unigrams (title)和 bigrams (title.shingles)被索引，这意味着可以独立地查询这些字段。

最后，我们可以索引以下示例文档:

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "Sue ate the alligator" }
{ "index": { "_id": 2 }}
{ "title": "The alligator ate Sue" }
{ "index": { "_id": 3 }}
{ "title": "Sue never goes anywhere without her alligator skin purse" }

搜索 Shingles编辑

为了理解添加 shingles 字段的好处，让我们首先来看 The hungry alligator ate Sue 进行简单 match 查询的结果：

GET /my_index/my_type/_search
{
   "query": {
        "match": {
           "title": "the hungry alligator ate sue"
        }
   }
}

这个查询返回了所有的三个文档，但是注意文档 1 和 2 有相同的相关度评分因为他们包含了相同的单词：

{
  "hits": [
     {
        "_id": "1",
        "_score": 0.44273707, 
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "2",
        "_score": 0.44273707, 
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "3", 
        "_score": 0.046571054,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

	两个文档都包含 `the` 、 `alligator` 和 `ate` ，所以获得相同的评分。
	我们可以通过设置 `minimum_should_match` 参数排除文档 3 ，参考控制精度。

现在在查询里添加 shingles 字段。不要忘了在 shingles 字段上的匹配是充当一种信号--为了提高相关度评分--所以我们仍然需要将基本 title 字段包含到查询中：

GET /my_index/my_type/_search
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "title": "the hungry alligator ate sue"
            }
         },
         "should": {
            "match": {
               "title.shingles": "the hungry alligator ate sue"
            }
         }
      }
   }
}

仍然匹配到了所有的 3 个文档，但是文档 2 现在排到了第一名因为它匹配了 shingled 词项 ate sue.

{
  "hits": [
     {
        "_id": "2",
        "_score": 0.4883322,
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "1",
        "_score": 0.13422975,
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "3",
        "_score": 0.014119488,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

即使查询包含的单词 hungry 没有在任何文档中出现，我们仍然使用单词邻近度返回了最相关的文档。

Performance性能编辑

shingles 不仅比短语查询更灵活，而且性能也更好。 shingles 查询跟一个简单的 match 查询一样高效，而不用每次搜索花费短语查询的代价。只是在索引期间因为更多词项需要被索引会付出一些小的代价，这也意味着有 shingles 的字段会占用更多的磁盘空间。然而，大多数应用写入一次而读取多次，所以在索引期间优化我们的查询速度是有意义的。

这是一个在 Elasticsearch 里会经常碰到的话题：不需要任何前期进行过多的设置，就能够在搜索的时候有很好的效果。一旦更清晰的理解了自己的需求，就能在索引时通过正确的为你的数据建模获得更好结果和性能。

« 性能优化部分匹配 »

官方地址：https://www.elastic.co/guide/cn/elasticsearch/guide/current/shingles.html

有任何技术问题请点击这里网站运营推广招聘

IT PHP 编程语言开发编程 Linux 科技 Elasticsearch HTML/CSS/XML 面试数据库网络 JAVA NoSQL C/C++ Golang 操作系统 Git 算法正则表达式 Redis 互联网 MySql 软件运维 JavaScript 国际架构设计 Mac OS TCP/IP Excel Windows Oracle Socket VR Vim MongoDB 运营 Python MemCache 商业硬件电子娱乐设计摄影 nginx WordPress 游戏 HTTP 团建数码电器 Docker 大模型

Elasticsearch集群模式知多少携程Elasticsearch数据同步实践 Elasticsearch是做什么的以及它的使用和基本原理 Elasticsearch简介与实战 elasticsearch动态映射 elasticsearch配置如何配置使用Elasticsearch的动态映射 (dynamic mapping) elasticsearch最新版安装 Elasticsearch集群高亮搜索两节点Elasticsearch集群 elasticsearch集群部署文档 elasticsearch集群分布式特性安装elasticsearch的java环境确认 elasticsearch集群新机搭建 Elasticsearch Mapping设置 Elasticsearch集群节点(角色)类型解释node.master和node.data [Elasticsearch集群分页]from-size VS scroll-scan es-ik插件安装 ElasticSearch自带的分词类型 elasticsearch[5.5]全文匹配 —— Common Terms Query常用术语查询

略微加速

Elasticsearch权威指南 - 互联网笔记

寻找相关词编辑

生成 Shingles编辑

多字段编辑

搜索 Shingles编辑

Performance性能编辑

略微加速

Elasticsearch权威指南 - 互联网笔记

寻找相关词编辑

生成 Shingles编辑

多字段编辑

搜索 Shingles编辑

Performance性能编辑

Getting Started Videos