穿越之休妻翻身记免费无弹窗 | 穿越之休妻翻身记最新..."/> 2024-09-03 14:00·九派观天下"/>

腾博tengbo9885官网

穿越之休妻翻身记免费无弹窗 | 穿越之休妻翻身记最新...

穿越之休妻翻身记免费无弹窗 | 穿越之休妻翻身记最新...

《穿越之休妻翻身记免费无弹窗 | 穿越之休妻翻身记最新...》剧情简介:2024-09-03 14:00·九派观天下穿越之休妻翻身记免费无弹窗 | 穿越之休妻翻身记最新...一起基础配景

《穿越之休妻翻身记免费无弹窗 | 穿越之休妻翻身记最新...》视频说明:自己继续随着杜海峰虽然杜海峰一个大老粗很好掌握可是很顽强只要他有什么特定的想法说未必就凭证自己的性情起来到时间极有可能成事缺乏败事有余到时间他都自身难保了还能包管当初给自己的允许吗elasticsearch分词器 character filter ,tokenizer,token filter2023-09-02 18:14·孫攀龍分词器:规范化:normalization字符过滤器:character filter分词器:tokenizer令牌过滤器:token filter无论是内置的剖析器(analyzer)照旧自界说的剖析器(analyzer)都由三种构件块组成的:character filters  tokenizers  token filters内置的analyzer将这些构建块预先打包到适合差别语言和文本类型的analyzer中Character filters (字符过滤器)字符过滤器以字符流的形式吸收原始文本并可以通过添加、删除或更改字符来转换该流举例来说一个字符过滤器可以用来把阿拉伯数字(??????????)转成成Arabic-Latin的等价物(0123456789)一个剖析器可能有0个或多个字符过滤器它们按顺序应用(PS:类似Servlet中的过滤器或者阻挡器想象一下有一个过滤器链)Tokenizer (分词器)一个分词器吸收一个字符流并将其拆分成单个token (通常是单个单词)并输出一个token流例如一个whitespace分词器当它看到空缺的时间就会将文本拆分成token它会将文本Quick brown fox!转换为[Quick, brown, fox!](PS:Tokenizer 认真将文本拆分成单个token 这里token就指的就是一个一个的单词就是一段文本被支解成好几部分相当于Java中的字符串的 split )分词器还认真纪录每个term的顺序或位置以及该term所体现的原单词的最先和竣事字符偏移量(PS:文本被分词后的输出是一个term数组)一个剖析器必需只能有一个分词器Token filters (token过滤器)token过滤器吸收token流并且可能会添加、删除或更改tokens例如一个lowercase token filter可以将所有的token转成小写stop token filter可以删除常用的单词好比 the synonym token filter可以将同义词引入token流不允许token过滤器更改每个token的位置或字符偏移量一个剖析器可能有0个或多个token过滤器它们按顺序应用小结&回首analyzer(剖析器)是一个包这个包由三部分组成划分是:character filters (字符过滤器)、tokenizer(分词器)、token filters(token过滤器)一个analyzer可以有0个或多个character filters一个analyzer有且只能有一个tokenizer一个analyzer可以有0个或多个token filterscharacter filter 是做字符转换的它吸收的是文本字符流输出也是字符流tokenizer 是做分词的它吸收字符流输出token流(文本拆分后酿成一个一个单词这些单词叫token)token filter 是做token过滤的它吸收token流输出也是token流由此可见整个analyzer要做的事情就是将文本拆分成单个单词文本 ----> 字符 ----> token1 normalization:文档规范化,提高召回率停用词时态转换巨细写同义词语气词#normalizationGET _analyze{ "text": "Mr. Ma is an excellent teacher", "analyzer": "english"}2 字符过滤器(character filter):分词之前的预处置惩罚过滤无用字符HTML StripMappingPattern ReplaceHTML Strip##HTML Strip Character Filter###测试数据

I'm so happy!

DELETE my_indexPUT my_index{ "settings": { "analysis": { "char_filter": { "my_char_filter(自界说的剖析器名字)":{ "type":"html_strip", "escaped_tags":["a"] } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":["my_char_filter(自界说的剖析器名字)"] } } } }}GET my_index/_analyze{ "analyzer": "my_analyzer", "text": "

I'm so happy!

"}Mapping##Mapping Character Filter DELETE my_indexPUT my_index{ "settings": { "analysis": { "char_filter": { "my_char_filter":{ "type":"mapping", "mappings":[ "滚 => *", "垃 => *", "圾 => *" ] } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":["my_char_filter"] } } } }}GET my_index/_analyze{ "analyzer": "my_analyzer", "text": "你就是个垃圾滚"}Pattern Replace##Pattern Replace Character Filter #17611001200DELETE my_indexPUT my_index{ "settings": { "analysis": { "char_filter": { "my_char_filter":{ "type":"pattern_replace", "pattern":"(\\d{3})\\d{4}(\\d{4})", "replacement":"$1****$2" } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":["my_char_filter"] } } } }}GET my_index/_analyze{ "analyzer": "my_analyzer", "text": "您的手机号是17611001200"}3 令牌过滤器(token filter)--停用词、时态转换、巨细写转换、同义词转换、语气词处置惩罚等好比:has=>have him=>he apples=>apple the/oh/a=>干掉巨细写时态停用词同义词语气词#token filterDELETE test_indexPUT /test_index{ "settings": { "analysis": { "filter": { "my_synonym": { "type": "synonym_graph", "synonyms_path": "analysis/synonym.txt" } }, "analyzer": { "my_analyzer": { "tokenizer": "ik_max_word", "filter": [ "my_synonym" ] } } } }}GET test_index/_analyze{ "analyzer": "my_analyzer", "text": ["蒙丢丢大G犷悍daG"]}GET test_index/_analyze{ "analyzer": "ik_max_word", "text": ["疾驰G级"]}近义词匹配DELETE test_indexPUT /test_index{ "settings": { "analysis": { "filter": { "my_synonym": { "type": "synonym", "synonyms": ["赵,钱,孙,李=>吴","周=>王"] } }, "analyzer": { "my_analyzer": { "tokenizer": "standard", "filter": [ "my_synonym" ] } } } }}GET test_index/_analyze{ "analyzer": "my_analyzer", "text": ["赵,钱,孙,李","周"]}巨细写#巨细写GET test_index/_analyze{ "tokenizer": "standard", "filter": ["lowercase"], "text": ["AASD ASDA SDASD ASDASD"]}GET test_index/_analyze{ "tokenizer": "standard", "filter": ["uppercase"], "text": ["asdasd asd asg dsfg gfhjsdf asfdg g"]}#长度小于5的转大写GET test_index/_analyze{ "tokenizer": "standard", "filter": { "type": "condition", "filter":"uppercase", "script": { "source": "token.getTerm().length() < 5" } }, "text": ["asdasd asd asg dsfg gfhjsdf asfdg g"]}转小写转大写长度小于5的转大写停用词https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-stop-tokenfilter.html#停用词DELETE test_indexPUT /test_index{ "settings": { "analysis": { "analyzer": { "my_analyzer自界说名字": { "type": "standard", "stopwords":["me","you"] } } } }}GET test_index/_analyze{ "analyzer": "my_analyzer自界说名字", "text": ["Teacher me and you in the china"]}#####返回 teacher and you in the china官计划例:官方支持的 token filterhttps://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-stop-tokenfilter.html4 分词器(tokenizer):切词默认分词器:standard(英文切割凭证空缺切割)中文分词器:ik分词https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-whitespace-tokenizer.html设置内置的剖析器内置的剖析器不必任何设置就可以直接使用虽然默认设置是可以更改的例如standard剖析器可以设置为支持阻止字列表:curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'{ "settings": { "analysis": { "analyzer": { "std_english": { "type": "standard", "stopwords": "_english_" } } } }, "mappings": { "_doc": { "properties": { "my_text": { "type": "text", "analyzer": "standard", "fields": { "english": { "type": "text", "analyzer": "std_english" } } } } } }}'在这个例子中我们基于standard剖析器来界说了一个std_englisth剖析器同时设置为删除预界说的英语阻止词列表后面的mapping中界说了my_text字段用standardmy_text.english用std_english剖析器因此下面两个的分词效果会是这样的:curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'{ "field": "my_text", "text": "The old brown cow"}'curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'{ "field": "my_text.english", "text": "The old brown cow"}'第一个由于用的standard剖析器因此分词的效果是:[ the, old, brown, cow ]第二个用std_english剖析的效果是:[ old, brown, cow ]--------------------------Standard Analyzer (默认)---------------------------若是没有特殊指定的话standard 是默认的剖析器它提供了基于语法的标记化(基于Unicode文天职割算法)适用于大大都语言例如:curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{ "analyzer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."}'上面例子中那段文本将会输出如下terms:[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]-------------------案例3---------------------标准剖析器接受下列参数:max_token_length : 最大token长度默认255stopwords : 预界说的阻止词列表如_english_ 或 包括阻止词列表的数组默认是 _none_stopwords_path : 包括阻止词的文件路径curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'{ "settings": { "analysis": { "analyzer": { "my_english_analyzer": { "type": "standard", "max_token_length": 5, "stopwords": "_english_" } } } }}'curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'{ "analyzer": "my_english_analyzer", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."}'以上输出下列terms:[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]---------------------界说--------------------standard剖析器由下列两部分组成:TokenizerStandard TokenizerToken FiltersStandard Token FilterLower Case Token FilterStop Token Filter (默认被禁用)你还可以自界说curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'{ "settings": { "analysis": { "analyzer": { "rebuilt_standard": { "tokenizer": "standard", "filter": [ "lowercase" ] } } } }}'-------------------- Simple Analyzer---------------------------simple 剖析器当它遇到只要不是字母的字符就将文本剖析成term并且所有的term都是小写的例如:curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{ "analyzer": "simple", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."}'输入效果如下:[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]5 常见分词器:standard analyzer:默认分词器中文支持的不睬想会逐字拆分keyword分词器差池输入的text内容做热呢和处置惩罚而是将整个输入text作为一个tokenpattern tokenizer:以正则匹配脱离符把文本拆分成若干词项simple pattern tokenizer:以正则匹配词项速率比pattern tokenizer快whitespace analyzer:以空缺符脱离 Tim_cookie6 自界说分词器:custom analyzerchar_filter:内置或自界说字符过滤器 token filter:内置或自界说token filter tokenizer:内置或自界说分词器分词器(Analyzer)由0个或者多个字符过滤器(Character Filter)1个标记天生器(Tokenizer)0个或者多个标记过滤器(Token Filter)组成说白了就是将一段文本经由处置惩罚后输出成单个单个单词PUT custom_analysis{ "settings":{ "analysis":{ } }}#自界说分词器DELETE custom_analysisPUT custom_analysis{ "settings": { "analysis": {#第一步:字符过滤器 吸收原始文本并可以通过添加删除或者更改字符来转换字符串转换成可识别的的字符串 "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "& => and", "| => or" ] }, "html_strip_char_filter":{ "type":"html_strip", "escaped_tags":["a"] } }, "filter": { #第三步:令牌(token)过滤器 吸收切割好的token流(单词term)并且会添加删除或者更改tokens 如:lowercase token fileter可以把所有token(单词)转成小写stop token filter停用词可以删除常用的单词; synonym token filter 可以将同义词引入token流 "my_stopword": { "type": "stop", "stopwords": [ "is", "in", "the", "a", "at", "for" ] } }, "tokenizer": {#第2步:分词器切割点切割成一个个单个的token(单词)并输出token流它会将文本Quick brown fox!转换为[Quick, brown, fox!]就是一段文本被支解成好几部分 "my_tokenizer": { "type": "pattern", "pattern": "[ ,.!?]" } }, "analyzer": { "my_analyzer":{ "type":"custom",#告诉 "char_filter":["my_char_filter","html_strip_char_filter"], "filter":["my_stopword","lowercase"], "tokenizer":"my_tokenizer" } } } }}GET custom_analysis/_analyze{ "analyzer": "my_analyzer", "text": ["What is ,as.df ss

in ? &

| is ! in the a at for "]}------------------------------自义定2---------------------------------------------curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'{ "settings": { "analysis": { "analyzer": { "rebuilt_simple": { "tokenizer": "lowercase", "filter": [ ] } } } }}'Whitespace Analyzerwhitespace 剖析器当它遇到空缺字符时就将文本剖析成terms示例:curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{ "analyzer": "whitespace", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."}'输出效果如下:[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]------------------------------Stop Analyzer-----------------top 剖析器 和 simple 剖析器很像唯一差别的是stop 剖析器增添了对删除阻止词的支持默认用的阻止词是 _englisht_(PS:意思是假设有一句话this is a apple并且假设this 和 is都是阻止词那么用simple的话输出会是[ this , is , a , apple ]而用stop输出的效果会是[ a , apple ]到这里就看出二者的区别了stop 不会输出阻止词也就是说它不以为阻止词是一个term)(PS:所谓的阻止词可以明确为脱离符)curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{ "analyzer": "stop", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."}'输出[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]stop 接受以下参数:stopwords : 一个预界说的阻止词列表(好比_englisht_)或者是一个包括阻止词的列表默认是 _english_stopwords_path : 包括阻止词的文件路径这个路径是相关于Elasticsearch的config目录的一个路径curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'{ "settings": { "analysis": { "analyzer": { "my_stop_analyzer": { "type": "stop", "stopwords": ["the", "over"] } } } }}'上面设置了一个stop剖析器它的阻止词有两个:the 和 overcurl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'{ "analyzer": "my_stop_analyzer", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."}'基于以上设置这个请求输入会是这样的:[ quick, brown, foxes, jumped, lazy, dog, s, bone ]Pattern Analyzercurl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{ "analyzer": "pattern", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."}'由于默认凭证非单词字符支解因此输出会是这样的:[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]pattern 剖析器接受如下参数:pattern : 一个Java正则表达式默认 \W+flags : Java正则表达式flags好比:CASE_INSENSITIVE 、COMMENTSlowercase : 是否将terms所有转成小写默认truestopwords : 一个预界说的阻止词列表或者包括阻止词的一个列表默认是 _none_stopwords_path : 阻止词文件路径curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'{ "settings": { "analysis": { "analyzer": { "my_email_analyzer": { "type": "pattern", "pattern": "\\W|_", "lowercase": true } } } }}'上面的例子中设置了凭证非单词字符或者下划线支解并且输出的term都是小写curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'{ "analyzer": "my_email_analyzer", "text": "John_Smith@foo-bar.com"}'因此基于以上设置本例输出如下:[ john, smith, foo, bar, com ]Language Analyzers支持差别语言情形下的文天职析内置(预界说)的语言有:arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai7 中文分词器:ik分词装置和安排ik下载地址:https://github.com/medcl/elasticsearch-analysis-ikGithub加速器:https://github.com/fhefh2015/Fast-GitHub建设插件文件夹 cd your-es-root/plugins/ && mkdir ik将插件解压缩到文件夹 your-es-root/plugins/ik重新启动esIK文件形貌IKAnalyzer.cfg.xml:IK分词设置文件主词库:main.dic英文停用词:stopword.dic不会建设在倒排索引中特殊词库:quantifier.dic:特殊词库:计量单位等suffix.dic:特殊词库:行政单位surname.dic:特殊词库:百家姓preposition:特殊词库:语气词自界说词库:网络词汇、盛行词、自造词等ik提供的两种analyzer:ik_max_word会将文本做最细粒度的拆分好比会将中华人民共和国国歌拆分为中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌会穷尽种种可能的组合适合 Term Query;ik_smart: 会做最粗粒度的拆分好比会将中华人民共和国国歌拆分为中华人民共和国,国歌适合 Phrase 盘问热更新远程词库文件优点:上手简朴弱点:词库的管理不利便要操作直接操作磁盘文件检索页很贫困文件的读写没有专门的优化性能欠很多多少一层接口挪用和网络传输ik会见数据库MySQL驱动版本兼容性https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-versions.htmlhttps://dev.mysql.com/doc/connector-j/5.1/en/connector-j-versions.html驱动下载地址https://mvnrepository.com/artifact/mysql/mysql-connector-java演示下载装置:扩展词库:重启es后生效=》本文来自博客园作者:孙龙-程序员转载请注明原文链接:https://www.cnblogs.com/sunlong88/p/17093708.html2025-06-06 15:48·天中微人

编剧:
更新:

2025-10-09 20:06:50

备注:
国语
评价:
穿越之休妻翻身记免费无弹窗 | 穿越之休妻翻身记最新...
首页
影戏
一连剧
综艺
动漫
APP
网站地图