Performance on indices for each language

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance on indices for each language

Felipe Hummel
Hi, we currently one unique index in production which has around 50M documents. They are, mainly, from three different languages and also have a created_at field associated with it.

Our common search use case is to search documents of any language but we allow users to search only one specific language. Also, in terms of time intervals, the common use case is to search the entire dataset, but we also have a use case where we only need to search documents of the last ~1 year.

We currently have some queries paying the price of searching in a 50M index, when they would only need to search in < 10M documents. Filters can help with this, but they only help skipping the scoring of a document, the query processing still needs to go through the whole gigantic posting list, that is potentially on disk. (a posting list for "the" in Spanish is really tiny compared to the one in English, for example).

My questions are regarding performance. Our main concern, now, is query latency. Our system uses many queries with many OR clauses which easily makes query latency a pain, depending on the terms it can be up to dozens of seconds.

- First, if I search in multiple indices, will the search on them be done in "parallel"?  For example, I have "alias1" over two indexes: "index1" and "index2". When searching in "alias1", the search in "index1" will occur in parallel to the search in "index2", or they will be executed in sequence?
- Second, what are the implications of "cutting" the dataset into 3 indices, one for each language? What will be the performance difference between searching 3 indices and searching 1 index with all documents?
- Third, separating indices by language and then searching all indices together would mess up the scoring, right? IDFs for words can greatly vary between languages (would need to change search type?)
- Finally, it would be a good idea to "cut" the index in time intervals? (like an index for each year worth of documents, for each language).

This all assumes the same number of machines/shards/replicas. We currently have 16 shards in 8 (m1.large) EC2 instances.

Thanks!

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance on indices for each language

Andrew Gaydenko
Do you use filtered queries? http://www.elasticsearch.org/guide/reference/query-dsl/filtered-query/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance on indices for each language

Felipe Hummel
Not actually, we normally use the request-level filters (http://www.elasticsearch.org/guide/reference/api/search/filter/). 
We only use filtered when we need some facet to be calculated considering the filters.

I assume there's no performance difference.

On Sunday, June 2, 2013 5:18:30 AM UTC-4, Andrew Gaydenko wrote:
Do you use filtered queries? http://www.elasticsearch.org/guide/reference/query-dsl/filtered-query/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance on indices for each language

Felipe Hummel
Anyone has an opinion?

On Sunday, June 2, 2013 12:27:17 PM UTC-4, Felipe Hummel wrote:
Not actually, we normally use the request-level filters (http://www.elasticsearch.org/guide/reference/api/search/filter/). 
We only use filtered when we need some facet to be calculated considering the filters.

I assume there's no performance difference.

On Sunday, June 2, 2013 5:18:30 AM UTC-4, Andrew Gaydenko wrote:
Do you use filtered queries? http://www.elasticsearch.org/guide/reference/query-dsl/filtered-query/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance on indices for each language

Otis Gospodnetic
In reply to this post by Felipe Hummel
Hi,

One element that people often don't talk about when talking about searching documents in multiple languages is the UI/UX.  What does that need to look like and how much flexibility do you have there?  I understand some queries need to search all languages, but that doesn't necessarily mean the UI needs to show a single result set.  If the presentation layer allows separation by language, I would go with index-per-language model, which is cleaner and simpler.

The second part of your email is about searching all content vs. a subset of content depending on the user time range selection, or something along those lines.  For this you could consider having multiple indices for different time frames and the search client that knows which index holds documents in which time range and issues queries only to the relevant indices.  Alternatively, this might be doable with routing on a field that contains a date or a part of it.

Otis
--
Solr & ElasticSearch Support - http://sematext.com/
Performance Monitoring - http://sematext.com/spm/index.html


On Sunday, June 2, 2013 5:06:16 AM UTC-4, Felipe Hummel wrote:
Hi, we currently one unique index in production which has around 50M documents. They are, mainly, from three different languages and also have a created_at field associated with it.

Our common search use case is to search documents of any language but we allow users to search only one specific language. Also, in terms of time intervals, the common use case is to search the entire dataset, but we also have a use case where we only need to search documents of the last ~1 year.

We currently have some queries paying the price of searching in a 50M index, when they would only need to search in < 10M documents. Filters can help with this, but they only help skipping the scoring of a document, the query processing still needs to go through the whole gigantic posting list, that is potentially on disk. (a posting list for "the" in Spanish is really tiny compared to the one in English, for example).

My questions are regarding performance. Our main concern, now, is query latency. Our system uses many queries with many OR clauses which easily makes query latency a pain, depending on the terms it can be up to dozens of seconds.

- First, if I search in multiple indices, will the search on them be done in "parallel"?  For example, I have "alias1" over two indexes: "index1" and "index2". When searching in "alias1", the search in "index1" will occur in parallel to the search in "index2", or they will be executed in sequence?
- Second, what are the implications of "cutting" the dataset into 3 indices, one for each language? What will be the performance difference between searching 3 indices and searching 1 index with all documents?
- Third, separating indices by language and then searching all indices together would mess up the scoring, right? IDFs for words can greatly vary between languages (would need to change search type?)
- Finally, it would be a good idea to "cut" the index in time intervals? (like an index for each year worth of documents, for each language).

This all assumes the same number of machines/shards/replicas. We currently have 16 shards in 8 (m1.large) EC2 instances.

Thanks!

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.