how to lower the significance of a certain phrase

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

how to lower the significance of a certain phrase

Yehosef Shapiro
Often people using our search type "how to <something>"   eg "how to paint my kitchen".  This might result in results for "tips to paint my kitchen" or "how to paint my bathroom".  the phrase "how to" is a generic phrase and I would like to minimize its significance.  I don't want to remove it completely because I still would like a post called "how to paint my kitchen cabinets" to match higher than "should I wallpaper or paint my kitchen".

I don't want it to be a stopword because it still has value (as in the example).  

The Common Terms query might work - but I don't necessarily want to apply the rules to all other common phrases (it might be a good idea - but this is a specific common search term that I know people search for and I would like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want those documents to get penalized for containing the words "how to" - just that they should get a much smaller boost.

Any suggestions how to approach this?  For the record, I'm using the BM25 similarity algorithm.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: how to lower the significance of a certain phrase

joergprante@gmail.com
You can not penalize terms, you can only reward terms. The trick is to reward important terms and so all other (unwanted and unknown) terms get penalized. One method is to analyze sentences for grammar (part-of-speech tagging) and reward nouns or other keywords with boosting values, and use an extended similarity algorithm.

You can use UIMA or OpenNLP or Stanford NLP for POS tagging, and try to implement payload-based scoring, something like this demo code


My demo code does not work,  not sure where I made a mistake.

Jörg

On Sun, Apr 12, 2015 at 12:34 PM, Yehosef Shapiro <[hidden email]> wrote:
Often people using our search type "how to <something>"   eg "how to paint my kitchen".  This might result in results for "tips to paint my kitchen" or "how to paint my bathroom".  the phrase "how to" is a generic phrase and I would like to minimize its significance.  I don't want to remove it completely because I still would like a post called "how to paint my kitchen cabinets" to match higher than "should I wallpaper or paint my kitchen".

I don't want it to be a stopword because it still has value (as in the example).  

The Common Terms query might work - but I don't necessarily want to apply the rules to all other common phrases (it might be a good idea - but this is a specific common search term that I know people search for and I would like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want those documents to get penalized for containing the words "how to" - just that they should get a much smaller boost.

Any suggestions how to approach this?  For the record, I'm using the BM25 similarity algorithm.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE0GW0Frjv3coC6-iMK81fEVZLR8R2S9fayqR8bTpx2qw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: how to lower the significance of a certain phrase

Doug Turnbull
In reply to this post by Yehosef Shapiro
Yehosef, this sounds very similar to some title search work I've done. Title fields are odd because TF is often meaningless, and IDF can also
Be quite skewed. If only a few titles have "how" in the text, then you'll get very odd results. 

Read more here:
http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/

On Sunday, April 12, 2015, Yehosef Shapiro <[hidden email]> wrote:
Often people using our search type "how to <something>"   eg "how to paint my kitchen".  This might result in results for "tips to paint my kitchen" or "how to paint my bathroom".  the phrase "how to" is a generic phrase and I would like to minimize its significance.  I don't want to remove it completely because I still would like a post called "how to paint my kitchen cabinets" to match higher than "should I wallpaper or paint my kitchen".

I don't want it to be a stopword because it still has value (as in the example).  

The Common Terms query might work - but I don't necessarily want to apply the rules to all other common phrases (it might be a good idea - but this is a specific common search term that I know people search for and I would like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want those documents to get penalized for containing the words "how to" - just that they should get a much smaller boost.

Any suggestions how to approach this?  For the record, I'm using the BM25 similarity algorithm.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;elasticsearch%2Bunsubscribe@googlegroups.com&#39;);" target="_blank">elasticsearch+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Doug Turnbull Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com 
Author: Taming Search from Manning Publications
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL-nLmW3Gc28VN9BXKpBF_gB2CCGyeAn0YOqV6VFCkQmcQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: how to lower the significance of a certain phrase

Yehosef Shapiro
In reply to this post by joergprante@gmail.com
Thanks for this  - I so I could basically strip out the unwanted terms.  Then I could do the search with two clauses, one with the original search phrase with a lower weight and another with the "cleaned" search phrase with a higher weight.  

On Monday, April 13, 2015 at 12:05:44 AM UTC+3, Jörg Prante wrote:
You can not penalize terms, you can only reward terms. The trick is to reward important terms and so all other (unwanted and unknown) terms get penalized. One method is to analyze sentences for grammar (part-of-speech tagging) and reward nouns or other keywords with boosting values, and use an extended similarity algorithm.

You can use UIMA or OpenNLP or Stanford NLP for POS tagging, and try to implement payload-based scoring, something like this demo code

<a href="https://github.com/jprante/elasticsearch-payload" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fjprante%2Felasticsearch-payload\46sa\75D\46sntz\0751\46usg\75AFQjCNGnzymZbkqpwiIqUXEcHDYJ2-RK4Q';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fjprante%2Felasticsearch-payload\46sa\75D\46sntz\0751\46usg\75AFQjCNGnzymZbkqpwiIqUXEcHDYJ2-RK4Q';return true;">https://github.com/jprante/elasticsearch-payload

My demo code does not work,  not sure where I made a mistake.

Jörg

On Sun, Apr 12, 2015 at 12:34 PM, Yehosef Shapiro <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="F4ceQ52-_VEJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">yeh...@...> wrote:
Often people using our search type "how to <something>"   eg "how to paint my kitchen".  This might result in results for "tips to paint my kitchen" or "how to paint my bathroom".  the phrase "how to" is a generic phrase and I would like to minimize its significance.  I don't want to remove it completely because I still would like a post called "how to paint my kitchen cabinets" to match higher than "should I wallpaper or paint my kitchen".

I don't want it to be a stopword because it still has value (as in the example).  

The Common Terms query might work - but I don't necessarily want to apply the rules to all other common phrases (it might be a good idea - but this is a specific common search term that I know people search for and I would like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want those documents to get penalized for containing the words "how to" - just that they should get a much smaller boost.

Any suggestions how to approach this?  For the record, I'm using the BM25 similarity algorithm.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="F4ceQ52-_VEJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/baa4565e-9b2d-45f9-8711-db8950b9ce1a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: how to lower the significance of a certain phrase

Yehosef Shapiro
In reply to this post by Doug Turnbull
So because we're using BM25, I think this is a lower concern in general ( good chart in http://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html)  
We also disable norms on title fields (http://stackoverflow.com/questions/20222652/elasticsearch-when-to-set-omit-norms-option-as-false) FWIW.

Thanks for the link - Good info.  I'm leaning toward something like you recommend in your keepWordFilter - but doing it at query time instead of index time.  It doesn't seem like I need to use the memory to store "Socrates and Plato on Metaphysics" and also "Socrates Plato Metaphysics" - seems better to make the distinction at query time - and the performance should be the same because I need two search clauses anyway.


On Monday, April 13, 2015 at 12:15:14 AM UTC+3, Doug Turnbull wrote:
Yehosef, this sounds very similar to some title search work I've done. Title fields are odd because TF is often meaningless, and IDF can also
Be quite skewed. If only a few titles have "how" in the text, then you'll get very odd results. 

Read more here:
<a href="http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fopensourceconnections.com%2Fblog%2F2014%2F12%2F08%2Ftitle-search-when-relevancy-is-only-skin-deep%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFy160Rm0JcNUSQpIVYmAUFCRTq-A';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fopensourceconnections.com%2Fblog%2F2014%2F12%2F08%2Ftitle-search-when-relevancy-is-only-skin-deep%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFy160Rm0JcNUSQpIVYmAUFCRTq-A';return true;">http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/

On Sunday, April 12, 2015, Yehosef Shapiro <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="l37Xb8yDMKIJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">yeh...@...> wrote:
Often people using our search type "how to <something>"   eg "how to paint my kitchen".  This might result in results for "tips to paint my kitchen" or "how to paint my bathroom".  the phrase "how to" is a generic phrase and I would like to minimize its significance.  I don't want to remove it completely because I still would like a post called "how to paint my kitchen cabinets" to match higher than "should I wallpaper or paint my kitchen".

I don't want it to be a stopword because it still has value (as in the example).  

The Common Terms query might work - but I don't necessarily want to apply the rules to all other common phrases (it might be a good idea - but this is a specific common search term that I know people search for and I would like to solve it specifically for this case if possible.)

I don't think the negative boost is what I want because I don't want those documents to get penalized for containing the words "how to" - just that they should get a much smaller boost.

Any suggestions how to approach this?  For the record, I'm using the BM25 similarity algorithm.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/acd86fb2-ae69-40be-a772-c65d008f2415%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.


--
Doug Turnbull Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | <a href="http://www.opensourceconnections.com/" style="color:rgb(17,85,204)" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.opensourceconnections.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNENYzeUvLLGbDwBce8aGJDfHfsN2g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.opensourceconnections.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNENYzeUvLLGbDwBce8aGJDfHfsN2g';return true;">http://www.opensourceconnections.com 
Author: <a href="http://manning.com/turnbull" style="font-style:italic;color:rgb(17,85,204);font-size:13px;font-family:Helvetica" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fturnbull\46sa\75D\46sntz\0751\46usg\75AFQjCNFazDwbgqS0hdRlNmV8GuhTrLVEKg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fturnbull\46sa\75D\46sntz\0751\46usg\75AFQjCNFazDwbgqS0hdRlNmV8GuhTrLVEKg';return true;">Taming Search from Manning Publications
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7cceb1d2-cefc-420b-bb97-bba2eb2b97fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.