12. Aggregation

통계 (aggregation) 옵션을 이용하면 query 에 따른 documents 의 통계 결과를 알 수 있다.

aggregation 옵션의 설명과 사용에 앞서 테스트용 데이터를 입력하자.

orders-bulk.json

다운로드

POST /order/default/_bulk

<orders-bulk.json 파일에 포함된 내용을 복붙>

GET /order/default/_mapping

{

"order" : {

"mappings" : {

"default" : {

"properties" : {

"lines" : {

"properties" : {

"amount" : {

"type" : "float"

"product_id" : {

"type" : "long"

"quantity" : {

"type" : "long"

}

"purchased_at" : {

"type" : "date"

"sales_channel" : {

"type" : "text",

"fields" : {

"keyword" : {

"type" : "keyword",

"ignore_above" : 256

}

"salesman" : {

"properties" : {

"id" : {

"type" : "long"

"name" : {

"type" : "text",

"fields" : {

"keyword" : {

"type" : "keyword",

"ignore_above" : 256

}

"status" : {

"type" : "text",

"fields" : {

"keyword" : {

"type" : "keyword",

"ignore_above" : 256

}

"total_amount" : {

"type" : "float"

}

Metric Aggregations

metric aggregations는 산술 연산 결과를 확인할 때 쓰인다.

즉 최댓값, 최솟값, 평균값 등을 구할때 이용한다.

그 예로 total_amount 필드값에 대한 sum, avg, min, max 를 구하고자 할때

다음과 같이 구할 수 있으며 결과로 설정한 field 명이 나오게 된다.

GET /order/default/_search

{

"size":0,

"aggs": {

"total_sales": {

"sum": {

"field": "total_amount"

}

"avg_sale":{

"avg": {

"field": "total_amount"

}

"min_sale":{

"min": {

"field": "total_amount"

}

"max_sale":{

"max": {

"field": "total_amount"

}

{

"took" : 18,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1000,

"max_score" : 0.0,

"hits" : [ ]

"aggregations" : {

"max_sale" : {

"value" : 281.7699890136719

"avg_sale" : {

"value" : 109.20960997009277

"min_sale" : {

"value" : 10.270000457763672

"total_sales" : {

"value" : 109209.60997009277

}

Bucket Aggregations

Bucket Aggregations 기능을 이용하면 Elasticsearch 는 documents 를 위한 그룹인 Bucket 을 만든다.

그림에서 보면 두개의 Bucket 이 보이며 SQL 로 치면 group by 와 같은 기능을 한다.

간단히 status 값을 기준으로 나뉜 Buckets 을 만들어 보자.

아래 쿼리를 실행하면 결과로 status 값 (processed, pending...)

에 따라 5개의 buckets 가 생성되었고 documents 들이 count 된 것을 알 수 있다.

추가로 size = 0 을 넣어 집계 결과만 보도록 하였다.

GET /order/default/_search

{

"size":0,

"aggs": {

"status_terms":{

"terms": {

"field": "status.keyword"

}

{

"took" : 5,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1000,

"max_score" : 0.0,

"hits" : [ ]

"aggregations" : {

"status_terms" : {

"doc_count_error_upper_bound" : 0,

"sum_other_doc_count" : 0,

"buckets" : [

{

"key" : "processed",

"doc_count" : 209

{

"key" : "completed",

"doc_count" : 204

{

"key" : "pending",

"doc_count" : 199

{

"key" : "cancelled",

"doc_count" : 196

{

"key" : "confirmed",

"doc_count" : 192

}

]

}

sum_other_doc_count

sum_other_doc_count 는 그룹핑 되지 않은 count 를 말하는데

위의 예에서는 모두 그룹핑 되어 해당 count 가 0 이 된다.

total_amount 를 기준으로 Buckets 를 구해보면 값이 다양하여 그룹핑 되지 않은 많은 documents 항목이

있으며 아래와 같은 결과를 볼 수 있다.

GET /order/default/_search

{

"size":0,

"aggs": {

"status_terms":{

"terms": {

"field": "total_amount"

}

{

"took" : 10,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1000,

"max_score" : 0.0,

"hits" : [ ]

"aggregations" : {

"status_terms" : {

"doc_count_error_upper_bound" : 5,

"sum_other_doc_count" : 986,

"buckets" : [

{

"key" : 23.770000457763672,

"doc_count" : 2

...

]

}

sort

결과를 정렬할 때는 _order 옵션을 사용한다.

GET /order/default/_search

{

"size":0,

"aggs": {

"status_terms":{

"terms": {

"field": "status.keyword",

"order": {

"_term": "asc"

}

doc_count_error_upper_bound

집계 count 결과는 정확하지 않을 수 있다.

예를 들어 특정 product 를 Bucket Aggregation 한다고 생각해 보면

아마도 아래와 같이 쿼리를 하게 된다.

GET /_search

{

"aggs" : {

"products" : {

"terms" : {

"field" : "product",

"size" : 3

}

그리고 이 쿼리는 size 조건에 의해 각 shard 별 top 5 products 만 조사하게 될 것이다.

예를 들어 전체 제품이 아래와 같다고 생각해 보면

이중 각 Shard 별 3개지 top products 만 count 하여

실행의 결과가 아래와 같게 된다.

결과적으로 Product B 와 Product C 의 집계 결과가 전체 documents 에서 집계한 것과 차이가 발생한다.

doc_count_error_upper_bound 는 이렇듯 포함되지 않은 데이터 중 마지막 행을 조사하여 리턴한다.

위의 경우라면 각 shard 의 5번 행이 되며 10 + 10 + 10 = 30 을 리턴할 것이다.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

Nested Aggregation

Metric Aggregation 과 Bucket Aggregation 은 같이 사용되어 쿼리 결과를 얻어낼 때 사용할 수 있다.

예를 들어 아래와 같이 Nested Aggregation 을 하면 Bucket Aggregation 결과 buckets 에 대한

Metric Aggregation 결과가 같이 포함되어 나오는 것을 알 수 있다.

GET /order/default/_search

{

"size":0,

"aggs": {

"status_terms":{

"terms": {

"field": "status.keyword"

"aggs": {

"status_stats": {

"stats": {

"field" : "total_amount"

}

{

"took" : 12,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1000,

"max_score" : 0.0,

"hits" : [ ]

"aggregations" : {

"status_terms" : {

"doc_count_error_upper_bound" : 0,

"sum_other_doc_count" : 0,

"buckets" : [

{

"key" : "processed",

"doc_count" : 209,

"status_stats" : {

"count" : 209,

"min" : 10.270000457763672,

"max" : 281.7699890136719,

"avg" : 109.30703350231408,

"sum" : 22845.170001983643

}

{

"key" : "completed",

"doc_count" : 204,

"status_stats" : {

"count" : 204,

"min" : 10.930000305175781,

"max" : 260.5899963378906,

"avg" : 113.54058812178818,

"sum" : 23162.279976844788

}

...

]

}

이번에는 쿼리를 포함하여 Nested Aggreation 을 해 보자.

일반적으로 이 형태가 가장 많이 사용된다.

total hits 가 1000 개에서 489 개로 줄었다.

GET /order/default/_search

{

"size":0,

"query": {

"range": {

"total_amount": {

"gte": 100

}

"aggs": {

"status_terms":{

"terms": {

"field": "status.keyword"

"aggs": {

"status_stats": {

"stats": {

"field" : "total_amount"

}

{

"took" : 7,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 489,

"max_score" : 0.0,

"hits" : [ ]

"aggregations" : {

"status_terms" : {

"doc_count_error_upper_bound" : 0,

"sum_other_doc_count" : 0,

"buckets" : [

{

"key" : "pending",

"doc_count" : 110,

"status_stats" : {

"count" : 110,

"min" : 100.06999969482422,

"max" : 260.0299987792969,

"avg" : 159.29090909090908,

"sum" : 17522.0

}

{

"key" : "completed",

"doc_count" : 103,

"status_stats" : {

"count" : 103,

"min" : 103.37999725341797,

"max" : 260.5899963378906,

"avg" : 162.43087338938295,

"sum" : 16730.379959106445

}

{

"key" : "processed",

"doc_count" : 103,

"status_stats" : {

"count" : 103,

"min" : 100.83000183105469,

"max" : 281.7699890136719,

"avg" : 155.72310690277988,

"sum" : 16039.480010986328

}

{

"key" : "cancelled",

"doc_count" : 96,

"status_stats" : {

"count" : 96,

"min" : 100.05000305175781,

"max" : 272.8999938964844,

"avg" : 152.56229201952615,

"sum" : 14645.980033874512

}

{

"key" : "confirmed",

"doc_count" : 77,

"status_stats" : {

"count" : 77,

"min" : 100.9800033569336,

"max" : 246.88999938964844,

"avg" : 155.78025946679054,

"sum" : 11995.079978942871

}

]

}

Filter Aggregation

위의 query 결과를 bucketing 하는 방법 이외에 반대로 bucketing 후 filtering 할 수도 있다.

GET /order/default/_search

{

"size": 0,

"aggs": {

"low_value": {

"filter": {

"range": {

"total_amount": {

"lte": 50

}

"aggs": {

"avg_amount": {

"avg": {

"field": "total_amount"

}

{

"took" : 5,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1000,

"max_score" : 0.0,

"hits" : [ ]

"aggregations" : {

"low_value" : {

"doc_count" : 164,

"avg_amount" : {

"value" : 32.59371952894257

}

Range Aggregation

특정 Range 로 Aggregation 하는 경우는 range 옵션을 사용한다.

GET /order/default/_search

{

"size":0,

"aggs": {

"amount_distribution": {

"range": {

"field": "total_amount",

"ranges": [

{

"to": 50

{

"from":50,

"to": 100

{

"from":100

}

]

}

{

"took" : 14,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1000,

"max_score" : 0.0,

"hits" : [ ]

"aggregations" : {

"amount_distribution" : {

"buckets" : [

{

"key" : "*-50.0",

"to" : 50.0,

"doc_count" : 164

{

"key" : "50.0-100.0",

"from" : 50.0,

"to" : 100.0,

"doc_count" : 347

{

"key" : "100.0-*",

"from" : 100.0,

"doc_count" : 489

}

]

}

혹은 Histogram 이라는 옵션을 사용하는데 이는 특정 간격으로 Range 를 구할수 있다.

간단히 아래와 같이 사용할 수 있다.

GET /order/default/_search

{

"size":0,

"aggs": {

"amount_distribution": {

"histogram": {

"field": "total_amount",

"interval": 50

}

{

"took" : 16,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1000,

"max_score" : 0.0,

"hits" : [ ]

"aggregations" : {

"amount_distribution" : {

"buckets" : [

{

"key" : 0.0,

"doc_count" : 164

{

"key" : 50.0,

"doc_count" : 347

{

"key" : 100.0,

"doc_count" : 249

{

"key" : 150.0,

"doc_count" : 160

{

"key" : 200.0,

"doc_count" : 75

{

"key" : 250.0,

"doc_count" : 5

}

]

}

기타 옵션은 아래 레퍼런스를 참조하자.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html

저작자표시 비영리 변경금지

'Monitoring > Elasticsearch' 카테고리의 다른 글

11. Query Result Options (0)	2020.01.17
10. Joining Query (0)	2020.01.17
09. Compound Query (0)	2020.01.17
08. Query (0)	2020.01.17
07. Analyzer (1)	2020.01.17

주경야독

12. Aggregation

'Monitoring > Elasticsearch' 카테고리의 다른 글

티스토리툴바

12. Aggregation

'Monitoring > Elasticsearch' 카테고리의 다른 글

'Monitoring/Elasticsearch' Related Articles

티스토리툴바