分词器是es中专门处理分词的组件,就是按照一定的逻辑,将一段文本分析成多个词语的工具

分词器组成如下:

  • character filters:针对原始文本进行处理,比如去除html特殊标记符
  • tokenizer:将原始文本按照一定规则切分为单词
  • token filters:针对tokenizer处理的单词进行再加工,比如转小写、删除或新增等处理

分词器

es自带了以下的分词器:

  • standard(默认)
  • simple
  • whitespace
  • stop
  • keyword
  • pattern
  • language

Standard Analyzer

标准分词器是默认分词器,如未指定,则使用该分词器

描述&特征:

  • 默认分词器,如果未指定,则使用该分词器
  • 按词切分,支持多语言
  • 小写处理,它删除大多数标点符号、小写术语,并支持删除停止词

standard由以下组成:

  • tokenizer:standard tokenizer
  • token filter:standard token filter 、lower case token filter、stop token filter
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
'

会产生以下结果

[2, running, quick, brown, foxes, leap, over, lazy, dogs, in, the, summer, evening]

Simple Analyzer

描述&特征:

  • 按照非字母切分
  • 小写处理

standard由以下组成:

  • tokenizer:Lower Case Tokenizer
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
'

会产生以下结果

[running, quick, brown, foxes, leap, over, lazy, dogs, in, the, summer, evening]

Whitespace Analyzer

描述&特征:

  • 空白字符作为分隔符,当遇到任何空白字符,空白分词器将文本分成术语

whitespace由以下组成:

  • Tokenizer:Whitespace Tokenizer
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
'

会产生以下结果

[2, running, Quick, brown, foxes, leap, over, lazy, dogs, in, the, summer, evening.]

Stop Analyzer

描述&特征:

  • 删除停止词,停止词指 the、an、的、这等等

stop由以下组成:

  • Tokenizer:Lower Case Tokenizer
  • Token Filters:Stop Token Filter
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
'

会产生以下结果

[running, quick, brown, foxes, leap, over, lazy, dogs, summer, evening]

Keyword Analyze

描述&特征:

  • 不分词,直接将输入作为一个单词输出

keyword由以下组成:

  • Tokenizer:Keyword Tokenizer
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"analyzer": "keyword",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
'

会产生以下结果

[2 running Quick brown-foxes leap over lazy dogs in the summer evening.]

Pattern Analyzer

描述&特征:

  • 通过正则表达式自定义分隔符,默认是\W+,即非字词的符号作为分隔符

pattern由以下组成:

  • Tokenizer:Pattern Tokenizer
  • Token Filters:Lower case Token Filter、Stop Token Filter
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"analyzer": "pattern",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
'

[2, running, quick, brown, foxes, leap, over, lazy, dogs, in, the, summer, evening]

Language Analyzers

es提供多语言特定的分析工具

自定义分词

当自带的分词无法满足需求时,可以自定义分词,自定义分词主要通过自定义Character Filters、Tokenizer和Token Filter实现

Character Filters

在Tokenizer之前对原始文本进行处理,比如增加、删除或替换字符串等,会影响后续tokenizer解析的position和offest信息

自带的如下:

  • HTML Strip去除html标签和转换html实体
  • Mapping进行字符替换操作
  • Pattern Replace进行正则匹配替换
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text": "<p>I&apos;m so <b>happy</b>!</p>"
}
'

会产生下面输出

[I’m so happy!]

Tokenizer

将原始文本按照一定规则切分为单词

自带的如下:

  • standard 按照单词进行分割
  • letter 按照非字符类进行分割
  • whitespace 按照空格进行分割
  • UAX URL Email 按照standard分割,但不会分割邮箱和url
  • NGram和Edge NGram连词分割
  • Path Hierarchy 按照文件路径进行切割
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"tokenizer": "path_hierarchy",
"text": "/es/data/log"
}
'

会产生下面输出

[/es, /es/data, /es/data/log]

Token Filters

对tokenizer输出的单词进行增加、删除、修改等操作

自带的如下:

  • lowercase将所有单词转换为小写
  • stop删除stop words
  • NGram和Edge NGram连词分割
  • Synonym添加近义词
curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"text": "a Hello,world!",
"tokenizer": "standard",
"filter": [
"stop",
"lowercase",
{
"type": "ngram",
"min_gram": 4,
"max_gram": 4
}
]
}
'

会产生下面输出

[hell, ello, worl, orld]

curl -H 'Content-Type: application/json' -XPOST http://10.244.2.18:9200/_analyze?pretty -d '
{
"text": "a Hello,world!",
"tokenizer": "standard",
"filter": [
"stop",
"lowercase",
{
"type": "ngram",
"min_gram": 2,
"max_gram": 3
}
]
}
'

会产生下面输出

[he, hel, el, ell, ll, llo, lo, wo, wor, or, orl, rl, rld, ld]

自定义分词器

创建一个 my_custom_analyzer 名称的分词器

curl -H 'Content-Type: application/json' -XPUT http://10.244.2.18:9200/test_index_1?pretty -d '
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
'

来个有难度的分词器

curl -H 'Content-Type: application/json' -XPUT http://10.244.2.18:9200/test_index_2?pretty -d '
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer2": {
"type": "custom",
"tokenizer": "punctuation",
"char_filter": [
"emoticons"
],
"filter": [
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
'

查看效果