Significant terms aggregation with non tokenized text

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Significant terms aggregation with non tokenized text

Mike
I just tried using the significant terms aggregation on two text fields I have, and noticed that it doesn't seem to work on "non tokenized" fields.  On my keyword tokenized field, I get 0 for the bg_count, and it looks the same as a regular terms query with slighly different counts.  When I used my regular tokenized query, I see the results differ, and I have bg_counts.  Why is this?

Here are my 2 fields and analyzer:

"properties":{
    "query"      : {                                                              
        "type" : "multi_field",                                                   
        "fields" : {                                                              
            "query"          : { "type" : "string" },                             
            "queryUntouched" : { "type" : "string", "analyzer" : "myLowercaseAnalyzer" }      
        }                                                                         
    }
}                                                                        

"analyzer" : {                                                                    
    "myLowercaseAnalyzer" : {                                                     
        "tokenizer" : "keyword",                                                  
        "filter" : ["lowercase"]                                                  
    }                                                                            
}

When I send the significant terms aggregation against queryUntouched it looks the same as a regular terms agg, with bg_count set to 0:

"aggs": {
    "pop": {
      "terms": {
        "field": "queryUntouched",
        "size": 3
      }
    },
    "sig": {
      "significant_terms": {
        "field": "queryUntouched",
        "size": 3
      }
    }
}


aggregations{
  • pop: {
    • buckets: [
      • {
        • keyyield curve
        • doc_count102
        }
      • {
        • keygdp
        • doc_count70
        }
      ]
    }
  • sig: {
    • doc_count62804
    • buckets: [
      • {
        • keyyield curve
        • doc_count102
        • score7.200895615143776
        • bg_count0
        }
      • {
        • keygdp
        • doc_count81
        • score4.540783692447051
        • bg_count0
        }
      ]
    }

When I use the tokenized field, I get results that I would expect:
"aggs": {
    "pop": {
      "terms": {
        "field": "query",
        "size": 2
      }
    },
    "sig": {
      "significant_terms": {
        "field": "query",
        "size": 2
      }
    }
  }


aggregations{
  • pop: {
    • buckets: [
      • {
        • keybank
        • doc_count1423
        }
      • {
        • keyof
        • doc_count641
        }
      ]
    }
  • sig: {
    • doc_count62804
    • buckets: [
      • {
        • keybank
        • doc_count1423
        • score0.03191767117787348
        • bg_count25686
        }
      • {
        • keyid
        • doc_count715
        • score0.017449718916743313
        • bg_count12274
        }
      ]
    }






--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7a41870-bb42-46f5-9161-dbeb6c847ad2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Significant terms aggregation with non tokenized text

Mark Harwood-2
Unlike the terms aggs which only accesses the content loaded into RAM (aka FieldData), the significant_terms agg has to also go to disk to check the frequency of terms in the index for the background count. This different datasource means the naming conventions can sometimes differ. Can you try prefix the field name used by the significant terms with "query" e.g. "field":"query.queryUnTouched"? 

On Friday, September 26, 2014 2:57:10 AM UTC+1, Mike wrote:
I just tried using the significant terms aggregation on two text fields I have, and noticed that it doesn't seem to work on "non tokenized" fields.  On my keyword tokenized field, I get 0 for the bg_count, and it looks the same as a regular terms query with slighly different counts.  When I used my regular tokenized query, I see the results differ, and I have bg_counts.  Why is this?

Here are my 2 fields and analyzer:

"properties":{
    "query"      : {                                                              
        "type" : "multi_field",                                                   
        "fields" : {                                                              
            "query"          : { "type" : "string" },                             
            "queryUntouched" : { "type" : "string", "analyzer" : "myLowercaseAnalyzer" }      
        }                                                                         
    }
}                                                                        

"analyzer" : {                                                                    
    "myLowercaseAnalyzer" : {                                                     
        "tokenizer" : "keyword",                                                  
        "filter" : ["lowercase"]                                                  
    }                                                                            
}

When I send the significant terms aggregation against queryUntouched it looks the same as a regular terms agg, with bg_count set to 0:

"aggs": {
    "pop": {
      "terms": {
        "field": "queryUntouched",
        "size": 3
      }
    },
    "sig": {
      "significant_terms": {
        "field": "queryUntouched",
        "size": 3
      }
    }
}


aggregations{
  • pop: {
    • buckets: [
      • {
        • keyyield curve
        • doc_count102
        }
      • {
        • keygdp
        • doc_count70
        }
      ]
    }
  • sig: {
    • doc_count62804
    • buckets: [
      • {
        • keyyield curve
        • doc_count102
        • score7.200895615143776
        • bg_count0
        }
      • {
        • keygdp
        • doc_count81
        • score4.540783692447051
        • bg_count0
        }
      ]
    }

When I use the tokenized field, I get results that I would expect:
"aggs": {
    "pop": {
      "terms": {
        "field": "query",
        "size": 2
      }
    },
    "sig": {
      "significant_terms": {
        "field": "query",
        "size": 2
      }
    }
  }


aggregations{
  • pop: {
    • buckets: [
      • {
        • keybank
        • doc_count1423
        }
      • {
        • keyof
        • doc_count641
        }
      ]
    }
  • sig: {
    • doc_count62804
    • buckets: [
      • {
        • keybank
        • doc_count1423
        • score0.03191767117787348
        • bg_count25686
        }
      • {
        • keyid
        • doc_count715
        • score0.017449718916743313
        • bg_count12274
        }
      ]
    }






--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39153e77-d916-4132-8987-a89a88f0b8a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.