How to use Elasticsearch to find Collocations and Statistically Improbable Phrases

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to use Elasticsearch to find Collocations and Statistically Improbable Phrases

Mike
  1. What is the best way to find collocations of terms?  Is it by indexing documents using the shingles filter, and then performing a terms facet on that field?

  2. How can I get a list of statistically improbable phrases for a document?  From what I understand, instead of a list of highly frequent phrases like in the previous one, it would need to be a list based on the IDF score as well.  Is there any way for me to extract that from elastic search?  This is kind of the reverse of what a query returns, a list of phrases that would have a high score for a given document instead of a list of documents with a high score for a given phrase.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to use Elasticsearch to find Collocations and Statistically Improbable Phrases

Otis Gospodnetic
Hi Mike,

Are you looking to get this via ES because this looks like the best/easiest route or some other reason?  Long time ago I ran a social bookmarking service where I took top N matches (with tags) and post-processed them.  It worked OK with top 200 hits.

I wouldn't shingle at index time.  You'll shingle all your content and the index will balloon.

Have a look at http://sematext.com/products/key-phrase-extractor/index.html - it's a Java lib you can use for getting Collocations and SIPs (and their hybrids).  Unrelated to (Elastic)Search, but people sometimes use it in document processing pipelines to tag documents before indexing them (so you don't have to shingle everything in a doc, just add key terms/phrases as "tags").  You could then facet on that, or index in a separate highly weighted field, detect trending topics in a stream of data, etc.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Friday, July 5, 2013 9:31:42 PM UTC-4, Mike wrote:
  1. What is the best way to find collocations of terms?  Is it by indexing documents using the shingles filter, and then performing a terms facet on that field?

  2. How can I get a list of statistically improbable phrases for a document?  From what I understand, instead of a list of highly frequent phrases like in the previous one, it would need to be a list based on the IDF score as well.  Is there any way for me to extract that from elastic search?  This is kind of the reverse of what a query returns, a list of phrases that would have a high score for a given document instead of a list of documents with a high score for a given phrase.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.