Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java. This blog is record what I learn from this geek Course
The recommendation for learning Elasticsearch.
Getting started with elasticsearch
This course github https://github.com/geektime-geekbang/geektime-ELK
Elasticsearch https://www.elastic.co/
- Elasticsearch Certification https://www.elastic.co/cn/training/certification
- ElasticSearch Engineering I training https://www.elastic.co/cn/training/elasticsearch-engineer-1
- ElasticSearch Engineering II training https://www.elastic.co/cn/training/elasticsearch-engineer-2
Install elasticsearch
Install can refer offical docs
From version 7.0, don’t need install java by yourself.
directory | Vaconfig filelue | description |
---|---|---|
bin | execution file | |
config | elasticsearch.yml | es cluster config |
JDK | Java env | |
data | path.data | es data file |
lib | Java lib | |
logs | path.log | log file |
modules | including all es modules | |
plugins | including all installed plugins |
1 | # adjust JVM - config/jvm.options |
Config recommendation
- Xms equal to Xmx
- Xmx don’t larger than 50% of memory
- Xmx no larger than 30GB can refer https://www.elastic.co/blog/a-heap-of-trouble
plugins
1 |
|
Run multiple nodes in the same machine
1 | bin/elasticsearch -E node.name=node0 -E cluster.name=geektime -E path.dat= node0_data -d |
docker
Use docker-compose start cerebro kibana and 2 elasticsearch
1 | version: '2.2' |
logstash
Make sure all the version is the same for elasticsearch logstash and kibana
Now the latest version is 7.6, because this course use 7.1, https://www.elastic.co/downloads/past-releases/logstash-7-1-0
1 | # download logstash,(7.1.0)in mac you need also install JAVA |
- Download MovieLens data set:https://grouplens.org/datasets/movielens/
- Logstash download:https://www.elastic.co/cn/downloads/logstash
- Logstash docs:https://www.elastic.co/guide/en/logstash/current/index.html
Elasticsearch basic
Document is the smallest unit of elasticsearch. Document will transform to json and store in es. Every document has unique ID.
- Mutiple Tyeps https://www.elastic.co/cn/blog/moving-from-types-to-typeless-apis-in-elasticsearch-7-0
- CAT Index API https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-indices.html
Index 相关 API1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26#查看索引相关信息 check index info
GET kibana_sample_data_ecommerce
#查看索引的文档总数 check index count
GET kibana_sample_data_ecommerce/_count
#查看前10条文档,了解文档格式 check first 10 index
POST kibana_sample_data_ecommerce/_search
{
}
#_cat indices API
#查看indices
GET /_cat/indices/kibana*?v&s=index
#查看状态为绿的索引
GET /_cat/indices?v&health=green
#按照文档个数排序
GET /_cat/indices?v&s=docs.count:desc
#查看具体的字段
GET /_cat/indices/kibana*?pri&v&h=health,index,pri,rep,docs.count,mt
#How much memory is used per index?
GET /_cat/indices?v&h=i,tm&s=tm:desc
Node & Shard
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html
- Master-eligible node: A node that has node.master set to true (default), which makes it eligible to be elected as the master node, which controls the cluster.
- Data node: A node that has node.data set to true (default). Data nodes hold data and perform data related operations such as CRUD, search, and aggregations.
- Ingest node: A node that has node.ingest set to true (default). Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes as node.ingest: false.
- Machine learning node: A node that has xpack.ml.enabled and node.ml set to true, which is the default behavior in the Elasticsearch default distribution. If you want to use machine learning features, there must be at least one machine learning node in your cluster. For more information about machine learning features, see Machine learning in the Elastic Stack.
https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html
- primary shard: Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard. By default, an index has one primary shard. You can specify more primary shards to scale the number of documents that your index can handle. You cannot change the number of primary shards in an index, once the index is created. However, an index can be split into a new index using the split API. See also routing
- replica shard: Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes:
- increase failover: a replica shard can be promoted to a primary shard if the primary fails
- increase performance: get and search requests can be handled by primary or replica shards.
By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.
cluster-health
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
1 | get _cat/nodes?v |
- CAT Nodes API https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-nodes.html
- Cluster API https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster.html
- CAT Shards API https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html
CRUD
- Document API https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148############Create Document############
#create document. 自动生成 _id
POST users/_doc
{
"user" : "Mike",
"post_date" : "2019-04-15T14:12:12",
"message" : "trying out Kibana"
}
#create document. 指定Id。如果id已经存在,报错
PUT users/_doc/1?op_type=create
{
"user" : "Jack",
"post_date" : "2019-05-15T14:12:12",
"message" : "trying out Elasticsearch"
}
#create document. 指定 ID 如果已经存在,就报错
PUT users/_create/1
{
"user" : "Jack",
"post_date" : "2019-05-15T14:12:12",
"message" : "trying out Elasticsearch"
}
### Get Document by ID
#Get the document by ID
GET users/_doc/1
### Index & Update
#Update 指定 ID (先删除,在写入)
GET users/_doc/1
PUT users/_doc/1
{
"user" : "Mike"
}
#GET users/_doc/1
#在原文档上增加字段
POST users/_update/1/
{
"doc":{
"post_date" : "2019-05-15T14:12:12",
"message" : "trying out Elasticsearch"
}
}
### Delete by Id
# 删除文档
DELETE users/_doc/1
### Bulk 操作
#执行两次,查看每次的结果
#执行第1次
POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test2", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
#执行第2次
POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test2", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
### mget 操作
GET /_mget
{
"docs" : [
{
"_index" : "test",
"_id" : "1"
},
{
"_index" : "test",
"_id" : "2"
}
]
}
#URI中指定index
GET /test/_mget
{
"docs" : [
{
"_id" : "1"
},
{
"_id" : "2"
}
]
}
GET /_mget
{
"docs" : [
{
"_index" : "test",
"_id" : "1",
"_source" : false
},
{
"_index" : "test",
"_id" : "2",
"_source" : ["field3", "field4"]
},
{
"_index" : "test",
"_id" : "3",
"_source" : {
"include": ["user"],
"exclude": ["user.location"]
}
}
]
}
### msearch 操作
POST kibana_sample_data_ecommerce/_msearch
{}
{"query" : {"match_all" : {}},"size":1}
{"index" : "kibana_sample_data_flights"}
{"query" : {"match_all" : {}},"size":2}
### 清除测试数据
#清除数据 delete data
DELETE users
DELETE test
DELETE test2
inverted-index
- https://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95
- https://www.elastic.co/guide/cn/elasticsearch/guide/current/inverted-index.html
Demo
1 | POST _analyze |
analyzer
Demo for practice
1 |
|
- https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
URI Search
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
- https://searchenginewatch.com/sew/news/2065080/search-engines-101
- https://www.huffpost.com/entry/search-engines-101-part-i_b_1104525
- https://www.entrepreneur.com/article/176398
- https://www.searchtechnologies.com/meaning-of-relevancy
Grammer | search range |
---|---|
/_search | all index in cluster |
/index1/_search | only index1 |
/index,index2/_search | index1 and index2 |
/index*/_search | regex index* |
1 | #URI Query |
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-uri-request.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
fields | function |
---|---|
q | query string syntax |
df | default field |
sort | sort by xx |
from size | for page |
profile | check the query process |
A B | A OR B |
“A B” | A AND B |
title:(A AND B) | title=”A AND B” |
1 | #基本查询 |
Request Body
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-uri-request.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103#ignore_unavailable=true,可以忽略尝试访问不存在的索引“404_idx”导致的报错
#查询movies分页
POST /movies,404_idx/_search?ignore_unavailable=true
{
"profile": true,
"query": {
"match_all": {}
}
}
POST /kibana_sample_data_ecommerce/_search
{
"from":10,
"size":20,
"query":{
"match_all": {}
}
}
#对日期排序
POST kibana_sample_data_ecommerce/_search
{
"sort":[{"order_date":"desc"}],
"query":{
"match_all": {}
}
}
#source filtering
POST kibana_sample_data_ecommerce/_search
{
"_source":["order_date"],
"query":{
"match_all": {}
}
}
#脚本字段 painless script
GET kibana_sample_data_ecommerce/_search
{
"script_fields": {
"new_field": {
"script": {
"lang": "painless",
"source": "doc['order_date'].value+'hello'"
}
}
},
"query": {
"match_all": {}
}
}
POST movies/_search
{
"query": {
"match": {
"title": "last christmas"
}
}
}
POST movies/_search
{
"query": {
"match": {
"title": {
"query": "last christmas",
"operator": "and"
}
}
}
}
POST movies/_search
{
"query": {
"match_phrase": {
"title":{
"query": "one love"
}
}
}
}
# slop can increase search area
POST movies/_search
{
"query": {
"match_phrase": {
"title":{
"query": "one love",
"slop": 1
}
}
}
}
Simple Query String
- Not support AND OR NOT, will transform thses to string
- Support + for AND, | for OR, - for NOT
1 | PUT /users/_doc/1 |
Dynamic Mapping
- https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html
true | false | strict | |
---|---|---|---|
document can be index | YES | YES | NO |
fields can be index | YES | NO | NO |
mapping can be update | YES | NO | NO |
1 | #写入文档,查看 Mapping |
Mapping Setting
- Mapping Parameters https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-params.html
Index Option | Record Type |
---|---|
docs | doc id |
freqs | doc id, term frequencies |
positions | doc id, term frequencies, term position |
offsets | doc id, term frequencies, term position, character offects |
1 | #设置 index 为 false |
Analyzer
- character filters: HTML strip, Mapping, Pattern replace
- tokenizer: whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
- token filter: Lowercase / stop / synonym
1 | PUT logs/_doc/1 |
Dynamic Template
- Index Templates https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html
- Dynamic Template https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137#数字字符串被映射成text,日期字符串被映射成日期
PUT ttemplate/_doc/1
{
"someNumber":"1",
"someDate":"2019/01/01"
}
GET ttemplate/_mapping
#Create a default template
PUT _template/template_default
{
"index_patterns": ["*"],
"order" : 0,
"version": 1,
"settings": {
"number_of_shards": 1,
"number_of_replicas":1
}
}
PUT /_template/template_test
{
"index_patterns" : ["test*"],
"order" : 1,
"settings" : {
"number_of_shards": 1,
"number_of_replicas" : 2
},
"mappings" : {
"date_detection": false,
"numeric_detection": true
}
}
#查看template信息
GET /_template/template_default
GET /_template/temp*
#写入新的数据,index以test开头
PUT testtemplate/_doc/1
{
"someNumber":"1",
"someDate":"2019/01/01"
}
GET testtemplate/_mapping
get testtemplate/_settings
PUT testmy
{
"settings":{
"number_of_replicas":5
}
}
put testmy/_doc/1
{
"key":"value"
}
get testmy/_settings
DELETE testmy
DELETE /_template/template_default
DELETE /_template/template_test
#Dynaminc Mapping 根据类型和字段名
DELETE my_index
PUT my_index/_doc/1
{
"firstName":"Ruan",
"isVIP":"true"
}
GET my_index/_mapping
DELETE my_index
PUT my_index
{
"mappings": {
"dynamic_templates": [
{
"strings_as_boolean": {
"match_mapping_type": "string",
"match":"is*",
"mapping": {
"type": "boolean"
}
}
},
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}
}
]
}
}
DELETE my_index
#结合路径
PUT my_index
{
"mappings": {
"dynamic_templates": [
{
"full_name": {
"path_match": "name.*",
"path_unmatch": "*.middle",
"mapping": {
"type": "text",
"copy_to": "full_name"
}
}
}
]
}
}
PUT my_index/_doc/1
{
"name": {
"first": "John",
"middle": "Winston",
"last": "Lennon"
}
}
GET my_index/_search?q=full_name:John
aggregations
- Bucket Aggregation: A family of aggregations that build buckets, where each bucket is associated with a key and a document criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in the context and when a criterion matches, the document is considered to “fall in” the relevant bucket. By the end of the aggregation process, we’ll end up with a list of buckets - each one with a set of documents that “belong” to it.
- Metric Aggregation: Aggregations that keep track and compute metrics over a set of documents.
- Pipeline Aggregation: Aggregations that aggregate the output of other aggregations and their associated metrics
- Matrix Aggregation: A family of aggregations that operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Unlike metric and bucket aggregations, this aggregation family does not yet support scripting.
1 | #按照目的地进行分桶统计 |
3 chapter
term query & full query
- https://www.elastic.co/guide/en/elasticsearch/reference/current/term-level-queries.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
1 | DELETE products |
structured search
- https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/term-level-queries.html
1 | #结构化搜索,精确匹配 |
relevance
- TP = term frequency
- IDF = inverse document frequency
1 | PUT testscore |
Query&Filtering
- https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html
Demo
1 | POST /products/_bulk |
Disjunction max query
Demo
1 | PUT /blogs/_doc/1 |
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html
Best Fields
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139POST blogs/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
],
"tie_breaker": 0.2
}
}
}
POST blogs/_search
{
"query": {
"multi_match": {
"type": "best_fields",
"query": "Quick pets",
"fields": ["title","body"],
"tie_breaker": 0.2,
"minimum_should_match": "20%"
}
}
}
POST books/_search
{
"multi_match": {
"query": "Quick brown fox",
"fields": "*_title"
}
}
POST books/_search
{
"multi_match": {
"query": "Quick brown fox",
"fields": [ "*_title", "chapter_title^2" ]
}
}
DELETE /titles
PUT /titles
{
"settings": { "number_of_shards": 1 },
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "string",
"analyzer": "english",
"fields": {
"std": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
}
PUT /titles
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english"
}
}
}
}
POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }
GET titles/_search
{
"query": {
"match": {
"title": "barking dogs"
}
}
}
DELETE /titles
PUT /titles
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {"std": {"type": "text","analyzer": "standard"}}
}
}
}
}
POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }
GET /titles/_search
{
"query": {
"multi_match": {
"query": "barking dogs",
"type": "most_fields",
"fields": [ "title", "title.std" ]
}
}
}
GET /titles/_search
{
"query": {
"multi_match": {
"query": "barking dogs",
"type": "most_fields",
"fields": [ "title^10", "title.std" ]
}
}
}
multi-language
- Elasticsearch IK分词插件 https://github.com/medcl/elasticsearch-analysis-ik/releases
- Elasticsearch hanlp 分词插件 https://github.com/KennFalcon/elasticsearch-analysis-hanlp
- 分词算法综述 https://zhuanlan.zhihu.com/p/50444885
- 中科院计算所NLPIR http://ictclas.nlpir.org/nlpir/
- ansj分词器 https://github.com/NLPchina/ansj_seg
- 哈工大的LTP https://github.com/HIT-SCIR/ltp
- 清华大学THULAC https://github.com/thunlp/THULAC
- 斯坦福分词器 https://nlp.stanford.edu/software/segmenter.shtml
- Hanlp分词器 https://github.com/hankcs/HanLP
- 结巴分词 https://github.com/yanyiwu/cppjieba
- KCWS分词器(字嵌入+Bi-LSTM+CRF) https://github.com/koth/kcws
- ZPar https://github.com/frcchang/zpar/releases
- IKAnalyzer https://github.com/wks/ik-analyzer
1 | #stop word |
tmdb practice
Prequest
- Python 2.7.15
- request
1 | cd tmdb-search |
Demo
1 | POST tmdb/_search |
Search Template & Index Alias
1 | POST _scripts/tmdb |
Function Score Query
Demo
1 | DELETE blogs |
Term & Phrase Suggester
1 | DELETE articles |
Auto Complete
Pratice in Dev Tools
1 | DELETE articles |
Cross Cluster Search
- 在Kibana中使用Cross data search https://kelonsoftware.com/cross-cluster-search-kibana/
1 | # 启动3个集群 start 3 cluster |
split-brain
1 | # start 3 nodes in one host |
node type | config | default |
---|---|---|
master eligible | node.master | true |
data | node.data | true |
ingest | node.ingest | true |
coordinating only | NA | above 3 are false |
machine learning | node.ml | true(need x-pack) |
query fetch
1 | DELETE message |
Doc Values & Fielddata
Demo
1 | #单字段排序 |
From, Size, Search_after & Scroll API
- from: where to begin search
- size: number of docs you need query
type | function |
---|---|
Regular | real-time query top docs |
Scroll | Need all docs |
Pagination | from and size, if need deep page, use Search After |
1 | # result windows is too large 10000 |
Concurrent Control
- Optimistic lock
1 | DELETE products |