Ignore a field in the scoring

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Ignore a field in the scoring

RogerCF
Hello

Our documents have metadata indexed with them, but we don't want the metadata to interfere in the scoring

After a user searches for documents, they can bookmark them (what means we add more metadata to the document), then in the next search with the same query the bookmarked document  appears in a lower (worse) position

Is there a way to completely ignore one or more specific fields in the scoring of every query? as in indexing time or something?

Note that we are not using the metadata field in the query, but yet it lowers the score of every query

We cannot set the "index" attribute of this field to "no" because we are gonna use it in other queries

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2533Rjjec4SwXe_p-0eHYkkyEegFyP9DUMGQfHhua8ZyMWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Ignore a field in the scoring

Ivan Brusic

Use the field in a filter and not part of the query. Is this field free text?

Ivan

On Dec 23, 2014 9:12 PM, "Roger de Cordova Farias" <[hidden email]> wrote:
Hello

Our documents have metadata indexed with them, but we don't want the metadata to interfere in the scoring

After a user searches for documents, they can bookmark them (what means we add more metadata to the document), then in the next search with the same query the bookmarked document  appears in a lower (worse) position

Is there a way to completely ignore one or more specific fields in the scoring of every query? as in indexing time or something?

Note that we are not using the metadata field in the query, but yet it lowers the score of every query

We cannot set the "index" attribute of this field to "no" because we are gonna use it in other queries

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2533Rjjec4SwXe_p-0eHYkkyEegFyP9DUMGQfHhua8ZyMWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAKsYquQJMbfztJ%2Ba2_jpi-fVG%3DvcnXYHS-7bKvaOX4hA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Ignore a field in the scoring

Doug Turnbull
Are you querying the _all field? How are you doing your searches?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

The _all field receives a copy of every  field you index, so adding data here could impact scores regardless of the source field.

Otherwise, fields are scored independently before being put together by other queries like boolean queries or dismax. Are you using boolean/dismax/etc over multiple fields?

-Doug

On Fri, Dec 26, 2014 at 11:59 AM, Ivan Brusic <[hidden email]> wrote:

Use the field in a filter and not part of the query. Is this field free text?

Ivan

On Dec 23, 2014 9:12 PM, "Roger de Cordova Farias" <[hidden email]> wrote:
Hello

Our documents have metadata indexed with them, but we don't want the metadata to interfere in the scoring

After a user searches for documents, they can bookmark them (what means we add more metadata to the document), then in the next search with the same query the bookmarked document  appears in a lower (worse) position

Is there a way to completely ignore one or more specific fields in the scoring of every query? as in indexing time or something?

Note that we are not using the metadata field in the query, but yet it lowers the score of every query

We cannot set the "index" attribute of this field to "no" because we are gonna use it in other queries

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2533Rjjec4SwXe_p-0eHYkkyEegFyP9DUMGQfHhua8ZyMWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAKsYquQJMbfztJ%2Ba2_jpi-fVG%3DvcnXYHS-7bKvaOX4hA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--
Doug Turnbull
Search & Big Data Architect

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL9ND_SWteSetZL9059WyGRZvJrO2k4PQ9FQ1zUFhjbsxw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Ignore a field in the scoring

RogerCF
The added field is an array of Integers, but we are not using it in the query at all

We are not querying the _all field, it is disabled in our type mapping

Our query is something like this:

{
  "query": {
    "query_string": {
      "fields": [
        "name"
      ],
      "query": "roger"
    }
  }
}

I ran this query. In the first result, I added a new field called "bookmarked_by" with a numeric value. Then I ran the same query again. The document in which I added the new field is no longer the first result

2014-12-26 17:34 GMT-02:00 Doug Turnbull <[hidden email]>:
Are you querying the _all field? How are you doing your searches?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

The _all field receives a copy of every  field you index, so adding data here could impact scores regardless of the source field.

Otherwise, fields are scored independently before being put together by other queries like boolean queries or dismax. Are you using boolean/dismax/etc over multiple fields?

-Doug

On Fri, Dec 26, 2014 at 11:59 AM, Ivan Brusic <[hidden email]> wrote:

Use the field in a filter and not part of the query. Is this field free text?

Ivan

On Dec 23, 2014 9:12 PM, "Roger de Cordova Farias" <[hidden email]> wrote:
Hello

Our documents have metadata indexed with them, but we don't want the metadata to interfere in the scoring

After a user searches for documents, they can bookmark them (what means we add more metadata to the document), then in the next search with the same query the bookmarked document  appears in a lower (worse) position

Is there a way to completely ignore one or more specific fields in the scoring of every query? as in indexing time or something?

Note that we are not using the metadata field in the query, but yet it lowers the score of every query

We cannot set the "index" attribute of this field to "no" because we are gonna use it in other queries

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2533Rjjec4SwXe_p-0eHYkkyEegFyP9DUMGQfHhua8ZyMWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAKsYquQJMbfztJ%2Ba2_jpi-fVG%3DvcnXYHS-7bKvaOX4hA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--
Doug Turnbull
Search & Big Data Architect

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL9ND_SWteSetZL9059WyGRZvJrO2k4PQ9FQ1zUFhjbsxw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2533UjpAz2dvNitdD-%3DaoXL9rrkZdd%3DzC3LZz8xWYvBAoFQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Ignore a field in the scoring

RogerCF
Now I ran the query with explain = true. The results are the following:


Explain before the update:
 
          "details": [
            {
              "value": 5.752348,
              "description": "fieldWeight in 424, product of:",
              "details": [
                {
                  "value": 1,
                  "description": "tf(freq=1.0), with freq of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0"
                    }
                  ]
                },
                {
                  "value": 9.203756,
                  "description": "idf(docFreq=201, maxDocs=738240)"
                },
                {
                  "value": 0.625,
                  "description": "fieldNorm(doc=424)"
                }
              ]
            }
          ]


Update script (scriptLang = groovy, profileId = 1):

if (ctx._source.bookmarked_by == null) {
    ctx._source.bookmarked_by = [profileId]
} else if (ctx._source.bookmarked_by.contains(profileId)) {
    ctx.op = "none"
} else {
    ctx._source.bookmarked_by += profileId
}


Explain after the update:

          "details": [
            {
              "value": 5.749262,
              "description": "fieldWeight in 0, product of:",
              "details": [
                {
                  "value": 1,
                  "description": "tf(freq=1.0), with freq of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0"
                    }
                  ]
                },
                {
                  "value": 9.198819,
                  "description": "idf(docFreq=202, maxDocs=738241)"
                },
                {
                  "value": 0.625,
                  "description": "fieldNorm(doc=0)"
                }
              ]
            }
          ] 


 Query used with the explain:

{
  "query": {
    "query_string": {
      "fields": [
        "name"
      ],
      "query": "roger"
    }
  }
}




The inverse document frequency (idf) is changed after adding a new field that is not used in the query. Also, it changed the "fieldWeight in 424" and "fieldNorm(doc=424)" to  "fieldWeight in 0" and "fieldNorm(doc=0)" (idk if it changes something)

Can someone help me on how to not change the score of the document after running the update? Note that the update creates a new field if it was not found (== null), but this field is not used in the query

2015-01-05 13:35 GMT-02:00 Roger de Cordova Farias <[hidden email]>:
The added field is an array of Integers, but we are not using it in the query at all

We are not querying the _all field, it is disabled in our type mapping

Our query is something like this:

{
  "query": {
    "query_string": {
      "fields": [
        "name"
      ],
      "query": "roger"
    }
  }
}

I ran this query. In the first result, I added a new field called "bookmarked_by" with a numeric value. Then I ran the same query again. The document in which I added the new field is no longer the first result

2014-12-26 17:34 GMT-02:00 Doug Turnbull <[hidden email]>:

Are you querying the _all field? How are you doing your searches?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

The _all field receives a copy of every  field you index, so adding data here could impact scores regardless of the source field.

Otherwise, fields are scored independently before being put together by other queries like boolean queries or dismax. Are you using boolean/dismax/etc over multiple fields?

-Doug

On Fri, Dec 26, 2014 at 11:59 AM, Ivan Brusic <[hidden email]> wrote:

Use the field in a filter and not part of the query. Is this field free text?

Ivan

On Dec 23, 2014 9:12 PM, "Roger de Cordova Farias" <[hidden email]> wrote:
Hello

Our documents have metadata indexed with them, but we don't want the metadata to interfere in the scoring

After a user searches for documents, they can bookmark them (what means we add more metadata to the document), then in the next search with the same query the bookmarked document  appears in a lower (worse) position

Is there a way to completely ignore one or more specific fields in the scoring of every query? as in indexing time or something?

Note that we are not using the metadata field in the query, but yet it lowers the score of every query

We cannot set the "index" attribute of this field to "no" because we are gonna use it in other queries

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2533Rjjec4SwXe_p-0eHYkkyEegFyP9DUMGQfHhua8ZyMWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAKsYquQJMbfztJ%2Ba2_jpi-fVG%3DvcnXYHS-7bKvaOX4hA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--
Doug Turnbull
Search & Big Data Architect

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL9ND_SWteSetZL9059WyGRZvJrO2k4PQ9FQ1zUFhjbsxw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2531gd_T5rRm2a4JsiGGwd2kQrmFxgiW0iAFMceL2PyzUWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Ignore a field in the scoring

Masaru Hasegawa
In reply to this post by RogerCF
Hi,

Update is delete and add. I mean, instead of updating existing document, it deletes it and adds it as new document.
And those deleted documents are just marked as deleted and aren’t actually removed from index until the segment merge.

IDF doesn’t take those deleted-but-not-removed document into account (it counts those documents).
That’s the reason you see different IDF score (you see both maxDocs and docFreq are incremented).

Regarding 424 v.s. 0, the document had ID 424 (lucene’s internal ID). But when the document is updated (delete + add), it got new ID 0 in new segment.

So, I think it’s not possible to keep score when you update documents.
You can run optimise with max_num_segments=1 every time you update documents but it’s not practical (and until optimise is done, you see different score)


Masaru



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.54acade5.625558ec.13b%40citra.local.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Ignore a field in the scoring

RogerCF
Thank you for your explanation

Do you know if it is a bug of intended behavior?

I don't think deleted (marked as deleted) docs should be used at all

2015-01-07 1:53 GMT-02:00 Masaru Hasegawa <[hidden email]>:
Hi,

Update is delete and add. I mean, instead of updating existing document, it deletes it and adds it as new document.
And those deleted documents are just marked as deleted and aren’t actually removed from index until the segment merge.

IDF doesn’t take those deleted-but-not-removed document into account (it counts those documents).
That’s the reason you see different IDF score (you see both maxDocs and docFreq are incremented).

Regarding 424 v.s. 0, the document had ID 424 (lucene’s internal ID). But when the document is updated (delete + add), it got new ID 0 in new segment.

So, I think it’s not possible to keep score when you update documents.
You can run optimise with max_num_segments=1 every time you update documents but it’s not practical (and until optimise is done, you see different score)


Masaru



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.54acade5.625558ec.13b%40citra.local.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2531fazjRDeFMmWLVuoCtCUtbCUMv841O%2BZoFpMJBdcjRDA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Ignore a field in the scoring

Masaru Hasegawa
Hi,

It says:
--
Note that CollectionStatistics.maxDoc() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.maxDoc(), and in the same direction. In addition, CollectionStatistics.maxDoc() is more efficient to compute
--

Masaru

On Thu, Jan 8, 2015 at 12:01 AM, Roger de Cordova Farias <[hidden email]> wrote:
Thank you for your explanation

Do you know if it is a bug of intended behavior?

I don't think deleted (marked as deleted) docs should be used at all

2015-01-07 1:53 GMT-02:00 Masaru Hasegawa <[hidden email]>:
Hi,

Update is delete and add. I mean, instead of updating existing document, it deletes it and adds it as new document.
And those deleted documents are just marked as deleted and aren’t actually removed from index until the segment merge.

IDF doesn’t take those deleted-but-not-removed document into account (it counts those documents).
That’s the reason you see different IDF score (you see both maxDocs and docFreq are incremented).

Regarding 424 v.s. 0, the document had ID 424 (lucene’s internal ID). But when the document is updated (delete + add), it got new ID 0 in new segment.

So, I think it’s not possible to keep score when you update documents.
You can run optimise with max_num_segments=1 every time you update documents but it’s not practical (and until optimise is done, you see different score)


Masaru



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.54acade5.625558ec.13b%40citra.local.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2531fazjRDeFMmWLVuoCtCUtbCUMv841O%2BZoFpMJBdcjRDA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGmu3c1rWBCuaLrwHY818sy%2BcM6wEYzNivcFMjzbqupW_7paAw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Ignore a field in the scoring

RogerCF
Thank you very much

2015-01-08 4:35 GMT-02:00 Masaru Hasegawa <[hidden email]>:
Hi,

It says:
--
Note that CollectionStatistics.maxDoc() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.maxDoc(), and in the same direction. In addition, CollectionStatistics.maxDoc() is more efficient to compute
--

Masaru

On Thu, Jan 8, 2015 at 12:01 AM, Roger de Cordova Farias <[hidden email]> wrote:
Thank you for your explanation

Do you know if it is a bug of intended behavior?

I don't think deleted (marked as deleted) docs should be used at all

2015-01-07 1:53 GMT-02:00 Masaru Hasegawa <[hidden email]>:
Hi,

Update is delete and add. I mean, instead of updating existing document, it deletes it and adds it as new document.
And those deleted documents are just marked as deleted and aren’t actually removed from index until the segment merge.

IDF doesn’t take those deleted-but-not-removed document into account (it counts those documents).
That’s the reason you see different IDF score (you see both maxDocs and docFreq are incremented).

Regarding 424 v.s. 0, the document had ID 424 (lucene’s internal ID). But when the document is updated (delete + add), it got new ID 0 in new segment.

So, I think it’s not possible to keep score when you update documents.
You can run optimise with max_num_segments=1 every time you update documents but it’s not practical (and until optimise is done, you see different score)


Masaru



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.54acade5.625558ec.13b%40citra.local.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2531fazjRDeFMmWLVuoCtCUtbCUMv841O%2BZoFpMJBdcjRDA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGmu3c1rWBCuaLrwHY818sy%2BcM6wEYzNivcFMjzbqupW_7paAw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp2533-8TBoyPmfpqj12T_TVb4z%2BrgLKqtuOxRfReajti7WfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.