Performance killed when faceting on high cardinality fields

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Performance killed when faceting on high cardinality fields

Otis Gospodnetic
Hi,

We're doing some ES performance testing with a relatively small index.  All is peachy until we want to facet on a field that has relatively high cardinality - in this case it's a "tags" field that, as you can imagine, has a high number of distinct values across all documents in the index.
So when we include faceting on tags in our queries performance sinks from over 400 QPS to 20-30 QPS.  The average latency jumps from 40 ms to 500 ms.

Is there anything in ES that one can use to improve performance in such cases? 

In Solr land there are 2 faceting methods, one of which is designed for "situations where the number of indexed values for the field is high, but the number of values per document is low":

Field Cache: If facet.method=fc then a field-cache approach will be used. This is currently implemented using either the Lucene FieldCache or (starting in Solr 1.4) an UnInvertedField if the field either is multi-valued or is tokenized (according toFieldType.isTokened()). Each document is looked up in the cache to see what terms/values it contains, and a tally is incremented for each value. This is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents. 


I didn't see anything like this in ES docs and I'm wondering if there is room for improvement in ES faceting or....?

Thanks,
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Alex at Ikanow
Otis,

I think this is similar/the same to the issue I raised here:


Interesting that the effect you saw was increased latency; in 0.18 the construction would always be fast but would run out of memory (unless you used "soft" caching, in which case it would just "swap" data in/out a lot, which would obviously increase the latency; but caches are by default "hard"). 

Workarounds: nesting helps a lot, provided you don't need to do custom sorts. Where you do need to sort, creating separate indexes for the "worst" offenders (eg highest 1% of array sizes) worked very well.

I think Shay mentioned during the discussion that improving the field cache to handle large field cardinalities was on his todo list.

Alex
Ikanow: agile intelligence through open analytics 

On Tuesday, May 22, 2012 2:18:23 PM UTC-4, Otis Gospodnetic wrote:
Hi,

We're doing some ES performance testing with a relatively small index.  All is peachy until we want to facet on a field that has relatively high cardinality - in this case it's a "tags" field that, as you can imagine, has a high number of distinct values across all documents in the index.
So when we include faceting on tags in our queries performance sinks from over 400 QPS to 20-30 QPS.  The average latency jumps from 40 ms to 500 ms.

Is there anything in ES that one can use to improve performance in such cases? 

In Solr land there are 2 faceting methods, one of which is designed for "situations where the number of indexed values for the field is high, but the number of values per document is low":

Field Cache: If facet.method=fc then a field-cache approach will be used. This is currently implemented using either the Lucene FieldCache or (starting in Solr 1.4) an UnInvertedField if the field either is multi-valued or is tokenized (according toFieldType.isTokened()). Each document is looked up in the cache to see what terms/values it contains, and a tally is incremented for each value. This is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents. 


I didn't see anything like this in ES docs and I'm wondering if there is room for improvement in ES faceting or....?

Thanks,
Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html

Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Otis Gospodnetic
Hi,

On Tuesday, May 22, 2012 5:56:21 PM UTC-4, Alex at Ikanow wrote:
Otis,

I think this is similar/the same to the issue I raised here:


Interesting that the effect you saw was increased latency; in 0.18 the construction would always be fast but would run out of memory (unless you used "soft" caching, in which case it would just "swap" data in/out a lot, which would obviously increase the latency; but caches are by default "hard"). 

Right.  We're about to do a round of performance tests and use SPM for ES to look at all ES cache stats.
 
Workarounds: nesting helps a lot, provided you don't need to do custom sorts. Where you do need to sort, creating separate indexes for the "worst" offenders (eg highest 1% of array sizes) worked very well.

I think Shay mentioned during the discussion that improving the field cache to handle large field cardinalities was on his todo list

Uh, that would be great!
Shay, is there an issue we should watch, and do you know if this will be in 0.20?

Thanks!
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

 
Alex
Ikanow: agile intelligence through open analytics 

On Tuesday, May 22, 2012 2:18:23 PM UTC-4, Otis Gospodnetic wrote:
Hi,

We're doing some ES performance testing with a relatively small index.  All is peachy until we want to facet on a field that has relatively high cardinality - in this case it's a "tags" field that, as you can imagine, has a high number of distinct values across all documents in the index.
So when we include faceting on tags in our queries performance sinks from over 400 QPS to 20-30 QPS.  The average latency jumps from 40 ms to 500 ms.

Is there anything in ES that one can use to improve performance in such cases? 

In Solr land there are 2 faceting methods, one of which is designed for "situations where the number of indexed values for the field is high, but the number of values per document is low":

Field Cache: If facet.method=fc then a field-cache approach will be used. This is currently implemented using either the Lucene FieldCache or (starting in Solr 1.4) an UnInvertedField if the field either is multi-valued or is tokenized (according toFieldType.isTokened()). Each document is looked up in the cache to see what terms/values it contains, and a tally is incremented for each value. This is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents. 


I didn't see anything like this in ES docs and I'm wondering if there is room for improvement in ES faceting or....?

Thanks,
Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html

Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Otis Gospodnetic
Hello,

On Wednesday, May 23, 2012 6:02:00 PM UTC-4, Otis Gospodnetic wrote:
Hi,

On Tuesday, May 22, 2012 5:56:21 PM UTC-4, Alex at Ikanow wrote:
Otis,

I think this is similar/the same to the issue I raised here:


Interesting that the effect you saw was increased latency; in 0.18 the construction would always be fast but would run out of memory (unless you used "soft" caching, in which case it would just "swap" data in/out a lot, which would obviously increase the latency; but caches are by default "hard"). 

Right.  We're about to do a round of performance tests and use SPM for ES to look at all ES cache stats.
 
Workarounds: nesting helps a lot, provided you don't need to do custom sorts. Where you do need to sort, creating separate indexes for the "worst" offenders (eg highest 1% of array sizes) worked very well.

I think Shay mentioned during the discussion that improving the field cache to handle large field cardinalities was on his todo list

Uh, that would be great!
Shay, is there an issue we should watch, and do you know if this will be in 0.20?

Shay, for what it's worth, I did some thread dumping while running a performance test with faceting on high cardinality field and identified what looks like a hotspot:

"elasticsearch[search]-pool-6-thread-22" daemon prio=10 tid=0x00002ab2e8183800 nid=0x3681 runnable [0x0000000049f03000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:239)
        at org.apache.lucene.util.PriorityQueue.updateTop(PriorityQueue.java:202)
        at org.elasticsearch.search.facet.terms.strings.TermsStringOrdinalsFacetCollector.facet(TermsStringOrdinalsFacetCollector.java:168)
        at org.elasticsearch.search.facet.FacetPhase.execute(FacetPhase.java:138)
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:203)

I have not look at that code yet, but you probably know what's on line 168 by heart.  Is there any chance that something could be optimized there?
And should I open an issue with the above or is this a known thing and has an issue open already?

Also, I see that *strings* in the package name.
Do you think performance would be any better if we somehow replaces string tokens with, say, int tokens?

Thanks,
Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html
 
Alex
Ikanow: agile intelligence through open analytics 

On Tuesday, May 22, 2012 2:18:23 PM UTC-4, Otis Gospodnetic wrote:
Hi,

We're doing some ES performance testing with a relatively small index.  All is peachy until we want to facet on a field that has relatively high cardinality - in this case it's a "tags" field that, as you can imagine, has a high number of distinct values across all documents in the index.
So when we include faceting on tags in our queries performance sinks from over 400 QPS to 20-30 QPS.  The average latency jumps from 40 ms to 500 ms.

Is there anything in ES that one can use to improve performance in such cases? 

In Solr land there are 2 faceting methods, one of which is designed for "situations where the number of indexed values for the field is high, but the number of values per document is low":

Field Cache: If facet.method=fc then a field-cache approach will be used. This is currently implemented using either the Lucene FieldCache or (starting in Solr 1.4) an UnInvertedField if the field either is multi-valued or is tokenized (according toFieldType.isTokened()). Each document is looked up in the cache to see what terms/values it contains, and a tally is incremented for each value. This is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents. 


I didn't see anything like this in ES docs and I'm wondering if there is room for improvement in ES faceting or....?

Thanks,
Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html

Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Andy Wick
I switched from string tags to short tags and increased the amount of data I could load in memory before hitting OOM.

See https://groups.google.com/d/msg/elasticsearch/xsMmFDuSVCM/gPzXVBzLlBkJ   for all the things I've done so far.

This week I finally got my new machines with more memory, and that obviously has helped the most :)

Thanks,
Andy
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Otis Gospodnetic
Hi Andy,

On Thursday, May 24, 2012 8:15:16 AM UTC-4, Andy Wick wrote:
I switched from string tags to short tags and increased the amount of data I could load in memory before hitting OOM.


OK, so these are the relevant points for us from that thread: 
* I switched from string tags to short tags, with a separate index/type that has the string->short conversion.  I create the conversion on the fly using the _version technique of yet another index/type. ( See http://blogs.perl.org/users/clinton_gormley/2011/10/elasticsearchsequence---a-blazing-fast-ticket-server.html)
* Since I'm using multi value shorts I reduced the max number of tags per document which really helped.

In our case the problem is not OOM -- we have servers with > 90 GB RAM.
Our problem is speed - query latency.

So, would you happen to know if either of the above changes had a positive effect on query speed?

Thanks,
Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html

This week I finally got my new machines with more memory, and that obviously has helped the most :)

Thanks,
Andy
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Otis Gospodnetic
In reply to this post by Andy Wick
Hi,

On Thursday, May 24, 2012 8:15:16 AM UTC-4, Andy Wick wrote:
I switched from string tags to short tags and increased the amount of data I could load in memory before hitting OOM.


Re that _version trick.  Is the following what you did?

Create 2 indices:
  1) the main index
  2) the tags-to-sequence-number-generator-via-version trick index

Index 2) is used to send it each tag of doc to be indexed and get a distinct number for each new tag via _version 
This converts tags to numbers and lets you index tags as numbers in the main index.

Something like this:
doc1:
  tags: a b c
doc 2:
  tabs: a foo bar

At index time this happens for doc 1:
* send tag a to index 2) and get some int back, say 1
* send tag b to index 2) and get back 2
* send tag c to index 2) and get back 3

index this doc in main index

Then for doc 2:
* send tag a to index 2) and get back 1 (again!)
* send tag foo to index 2) and get back 4
* send tag bar to index 2) and get back 5

index this doc in main index.

Then, at search time, you facet on the tags field that is now multi-valued and numeric (and not multi-valued and string).

So ES could return facet (count) as follows:

1 (2)
2 (1)
3 (1)
4 (1)
5 (1)

And then you use index 2) to look up 1, 2, 3, 4, and 4 and get back the original string values of those tags, thus allowing you to show this to the end user:

a (2)
b (1)
c (1)
foo (1)
bar (1)

Something like that?

If so, doesn't this considerably slow down your indexing?
And doesn't it actually add search latency?

In your case speed may not matter as much as memory footprint.  In our case we need to index a few thousand documents a second and handle > 100 QPS with 90th percentile latency < 100 ms.

Thanks,
Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html

This week I finally got my new machines with more memory, and that obviously has helped the most :)

Thanks,
Andy
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Andy Wick
As far as speed increase from switching from strings to shorts - I didn't measure it.  Subjectively it felt faster, but that could also have been from reduced loading from disk.   My deployment is billions of documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it differently)
1) elements/element
2) tags/tag
3) tags/sequence  (only has 1 document currently, and in theory could be in your tag type, but i kept it separate)

I have about 15k tags right now.  I bulk index everything, and my indexer stays runnings.  So on start it loads all the tags and caches them.  It only hits ES if it doesn't have the tag in cache.  In which case it does a GET, to make sure it wasn't already added by another indexer, if not there it then gets a new sequence number, and then a POST with op_type=create to handle the possible race condition of multiple indexers creating.  I do the same caching on the viewer side where I map the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable number of tags,  then I agree the extra look ups will hurt performance.


Another suggestion that might be easier is if your tags can be split up in categories and you don't need to facet all the categories each time, then splitting into multiple fields should help.  The max number of tags per document really seems to effect memory  (and I'm assuming performance.)  Of course if you need to facet everything then I doubt it will help, and probably hurt performance.


Thanks,
Andy
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

kimchy
Administrator
Hey,

  Otis, are you using any special configuration on the field cache? Are you also indexing data while searching?

   The profiling section that you saw is known, and its basically the time it takes to sort through the per segment aggregations. As was mentioned, I have some ideas on how to improve that, working hard on making it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick <[hidden email]> wrote:
As far as speed increase from switching from strings to shorts - I didn't measure it.  Subjectively it felt faster, but that could also have been from reduced loading from disk.   My deployment is billions of documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it differently)
1) elements/element
2) tags/tag
3) tags/sequence  (only has 1 document currently, and in theory could be in your tag type, but i kept it separate)

I have about 15k tags right now.  I bulk index everything, and my indexer stays runnings.  So on start it loads all the tags and caches them.  It only hits ES if it doesn't have the tag in cache.  In which case it does a GET, to make sure it wasn't already added by another indexer, if not there it then gets a new sequence number, and then a POST with op_type=create to handle the possible race condition of multiple indexers creating.  I do the same caching on the viewer side where I map the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable number of tags,  then I agree the extra look ups will hurt performance.


Another suggestion that might be easier is if your tags can be split up in categories and you don't need to facet all the categories each time, then splitting into multiple fields should help.  The max number of tags per document really seems to effect memory  (and I'm assuming performance.)  Of course if you need to facet everything then I doubt it will help, and probably hurt performance.


Thanks,
Andy

Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Otis Gospodnetic
Hey Shay,

Great to hear that's a known hotspot and looking forward to any improvements there!  So when is 0.20 coming?  Just kidding.

Maybe that hotspot was being hit because of those new segments because, yes, we had indexing going while searching (we have to do this - documents are streaming in all the time). We can't stop indexing.  Maybe we could increase our index refresh interval and see if that improves performance of queries with facets...

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:
Hey,

  Otis, are you using any special configuration on the field cache? Are you also indexing data while searching?

   The profiling section that you saw is known, and its basically the time it takes to sort through the per segment aggregations. As was mentioned, I have some ideas on how to improve that, working hard on making it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick <[hidden email]> wrote:
As far as speed increase from switching from strings to shorts - I didn't measure it.  Subjectively it felt faster, but that could also have been from reduced loading from disk.   My deployment is billions of documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it differently)
1) elements/element
2) tags/tag
3) tags/sequence  (only has 1 document currently, and in theory could be in your tag type, but i kept it separate)

I have about 15k tags right now.  I bulk index everything, and my indexer stays runnings.  So on start it loads all the tags and caches them.  It only hits ES if it doesn't have the tag in cache.  In which case it does a GET, to make sure it wasn't already added by another indexer, if not there it then gets a new sequence number, and then a POST with op_type=create to handle the possible race condition of multiple indexers creating.  I do the same caching on the viewer side where I map the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable number of tags,  then I agree the extra look ups will hurt performance.


Another suggestion that might be easier is if your tags can be split up in categories and you don't need to facet all the categories each time, then splitting into multiple fields should help.  The max number of tags per document really seems to effect memory  (and I'm assuming performance.)  Of course if you need to facet everything then I doubt it will help, and probably hurt performance.


Thanks,
Andy

Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

kimchy
Administrator
I think the warmup option is the best one you have. Current state of 0.20 is that its basically 0.19 + warmups, and its being used in production by several users (that needed the warmup option).

On Tue, May 29, 2012 at 9:39 PM, Otis Gospodnetic <[hidden email]> wrote:
Hey Shay,

Great to hear that's a known hotspot and looking forward to any improvements there!  So when is 0.20 coming?  Just kidding.

Maybe that hotspot was being hit because of those new segments because, yes, we had indexing going while searching (we have to do this - documents are streaming in all the time). We can't stop indexing.  Maybe we could increase our index refresh interval and see if that improves performance of queries with facets...

Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:
Hey,

  Otis, are you using any special configuration on the field cache? Are you also indexing data while searching?

   The profiling section that you saw is known, and its basically the time it takes to sort through the per segment aggregations. As was mentioned, I have some ideas on how to improve that, working hard on making it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick <[hidden email]> wrote:
As far as speed increase from switching from strings to shorts - I didn't measure it.  Subjectively it felt faster, but that could also have been from reduced loading from disk.   My deployment is billions of documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it differently)
1) elements/element
2) tags/tag
3) tags/sequence  (only has 1 document currently, and in theory could be in your tag type, but i kept it separate)

I have about 15k tags right now.  I bulk index everything, and my indexer stays runnings.  So on start it loads all the tags and caches them.  It only hits ES if it doesn't have the tag in cache.  In which case it does a GET, to make sure it wasn't already added by another indexer, if not there it then gets a new sequence number, and then a POST with op_type=create to handle the possible race condition of multiple indexers creating.  I do the same caching on the viewer side where I map the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable number of tags,  then I agree the extra look ups will hurt performance.


Another suggestion that might be easier is if your tags can be split up in categories and you don't need to facet all the categories each time, then splitting into multiple fields should help.  The max number of tags per document really seems to effect memory  (and I'm assuming performance.)  Of course if you need to facet everything then I doubt it will help, and probably hurt performance.


Thanks,
Andy


Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Leonardo Menezes
Hey there,
    any updates on this problem? I have the exact same thing happening. Comparing with SolR(comparable schema and same data) using facet.method=enum, I get about the same response time for SolR and ElasticSearch, but when using facet.method=fc, I get results around 4x faster for SolR. 
    Is there any work going on the addresses that? Otis, did you manage to find a work around for that? 

 Thanks,

Leo

On Wednesday, May 30, 2012 12:29:38 AM UTC+2, kimchy wrote:
I think the warmup option is the best one you have. Current state of 0.20 is that its basically 0.19 + warmups, and its being used in production by several users (that needed the warmup option).

On Tue, May 29, 2012 at 9:39 PM, Otis Gospodnetic <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="MoKEEfusl7UJ">otis.gos...@...> wrote:
Hey Shay,

Great to hear that's a known hotspot and looking forward to any improvements there!  So when is 0.20 coming?  Just kidding.

Maybe that hotspot was being hit because of those new segments because, yes, we had indexing going while searching (we have to do this - documents are streaming in all the time). We can't stop indexing.  Maybe we could increase our index refresh interval and see if that improves performance of queries with facets...

Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:
Hey,

  Otis, are you using any special configuration on the field cache? Are you also indexing data while searching?

   The profiling section that you saw is known, and its basically the time it takes to sort through the per segment aggregations. As was mentioned, I have some ideas on how to improve that, working hard on making it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="MoKEEfusl7UJ">andy...@...> wrote:
As far as speed increase from switching from strings to shorts - I didn't measure it.  Subjectively it felt faster, but that could also have been from reduced loading from disk.   My deployment is billions of documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it differently)
1) elements/element
2) tags/tag
3) tags/sequence  (only has 1 document currently, and in theory could be in your tag type, but i kept it separate)

I have about 15k tags right now.  I bulk index everything, and my indexer stays runnings.  So on start it loads all the tags and caches them.  It only hits ES if it doesn't have the tag in cache.  In which case it does a GET, to make sure it wasn't already added by another indexer, if not there it then gets a new sequence number, and then a POST with op_type=create to handle the possible race condition of multiple indexers creating.  I do the same caching on the viewer side where I map the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable number of tags,  then I agree the extra look ups will hurt performance.


Another suggestion that might be easier is if your tags can be split up in categories and you don't need to facet all the categories each time, then splitting into multiple fields should help.  The max number of tags per document really seems to effect memory  (and I'm assuming performance.)  Of course if you need to facet everything then I doubt it will help, and probably hurt performance.


Thanks,
Andy


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Ivan Brusic
Looking at the latest commits, there seems to have been numerous changes to the field data and the facets that use them. Hopefully they address the issue and something will be released soon.

Lucene 4.1 was just released. The next version of ElasticSearch is supposed to support Lucene 4, so things might be sidetracked in order to catch up. Many new features under the hood, perhaps ElasticSearch will make use of them.

Cheers,

Ivan


On Wed, Jan 23, 2013 at 9:16 AM, Leonardo Menezes <[hidden email]> wrote:
Hey there,
    any updates on this problem? I have the exact same thing happening. Comparing with SolR(comparable schema and same data) using facet.method=enum, I get about the same response time for SolR and ElasticSearch, but when using facet.method=fc, I get results around 4x faster for SolR. 
    Is there any work going on the addresses that? Otis, did you manage to find a work around for that? 

 Thanks,

Leo


On Wednesday, May 30, 2012 12:29:38 AM UTC+2, kimchy wrote:
I think the warmup option is the best one you have. Current state of 0.20 is that its basically 0.19 + warmups, and its being used in production by several users (that needed the warmup option).

On Tue, May 29, 2012 at 9:39 PM, Otis Gospodnetic <[hidden email]> wrote:
Hey Shay,

Great to hear that's a known hotspot and looking forward to any improvements there!  So when is 0.20 coming?  Just kidding.

Maybe that hotspot was being hit because of those new segments because, yes, we had indexing going while searching (we have to do this - documents are streaming in all the time). We can't stop indexing.  Maybe we could increase our index refresh interval and see if that improves performance of queries with facets...

Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:
Hey,

  Otis, are you using any special configuration on the field cache? Are you also indexing data while searching?

   The profiling section that you saw is known, and its basically the time it takes to sort through the per segment aggregations. As was mentioned, I have some ideas on how to improve that, working hard on making it into 0.20.

-shay.banon


On Thu, May 24, 2012 at 10:28 PM, Andy Wick <[hidden email]> wrote:
As far as speed increase from switching from strings to shorts - I didn't measure it.  Subjectively it felt faster, but that could also have been from reduced loading from disk.   My deployment is billions of documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it differently)
1) elements/element
2) tags/tag
3) tags/sequence  (only has 1 document currently, and in theory could be in your tag type, but i kept it separate)

I have about 15k tags right now.  I bulk index everything, and my indexer stays runnings.  So on start it loads all the tags and caches them.  It only hits ES if it doesn't have the tag in cache.  In which case it does a GET, to make sure it wasn't already added by another indexer, if not there it then gets a new sequence number, and then a POST with op_type=create to handle the possible race condition of multiple indexers creating.  I do the same caching on the viewer side where I map the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable number of tags,  then I agree the extra look ups will hurt performance.


Another suggestion that might be easier is if your tags can be split up in categories and you don't need to facet all the categories each time, then splitting into multiple fields should help.  The max number of tags per document really seems to effect memory  (and I'm assuming performance.)  Of course if you need to facet everything then I doubt it will help, and probably hurt performance.


Thanks,
Andy


--
 
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Otis Gospodnetic
In reply to this post by Leonardo Menezes
Hi Leo,

I have not looked into this further and have not notices any changes that would improve this particular issue in ES.
In Lucene devs are going crazy improving faceting performance, but ES has its own faceting impl, and Solr its own as well.

See:

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Wednesday, January 23, 2013 12:16:31 PM UTC-5, Leonardo Menezes wrote:
Hey there,
    any updates on this problem? I have the exact same thing happening. Comparing with SolR(comparable schema and same data) using facet.method=enum, I get about the same response time for SolR and ElasticSearch, but when using facet.method=fc, I get results around 4x faster for SolR. 
    Is there any work going on the addresses that? Otis, did you manage to find a work around for that? 

 Thanks,

Leo

On Wednesday, May 30, 2012 12:29:38 AM UTC+2, kimchy wrote:
I think the warmup option is the best one you have. Current state of 0.20 is that its basically 0.19 + warmups, and its being used in production by several users (that needed the warmup option).

On Tue, May 29, 2012 at 9:39 PM, Otis Gospodnetic <[hidden email]> wrote:
Hey Shay,

Great to hear that's a known hotspot and looking forward to any improvements there!  So when is 0.20 coming?  Just kidding.

Maybe that hotspot was being hit because of those new segments because, yes, we had indexing going while searching (we have to do this - documents are streaming in all the time). We can't stop indexing.  Maybe we could increase our index refresh interval and see if that improves performance of queries with facets...

Otis
--
Scalable Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, May 29, 2012 2:19:24 PM UTC-4, kimchy wrote:
Hey,

  Otis, are you using any special configuration on the field cache? Are you also indexing data while searching?

   The profiling section that you saw is known, and its basically the time it takes to sort through the per segment aggregations. As was mentioned, I have some ideas on how to improve that, working hard on making it into 0.20.

-shay.banon

On Thu, May 24, 2012 at 10:28 PM, Andy Wick <[hidden email]> wrote:
As far as speed increase from switching from strings to shorts - I didn't measure it.  Subjectively it felt faster, but that could also have been from reduced loading from disk.   My deployment is billions of documents, 6k new docs a second, but EXTREMELY low QPS (basically 0) with no real response time requirements.

Currently I actually have 3 types in 2 indexes (although you could do it differently)
1) elements/element
2) tags/tag
3) tags/sequence  (only has 1 document currently, and in theory could be in your tag type, but i kept it separate)

I have about 15k tags right now.  I bulk index everything, and my indexer stays runnings.  So on start it loads all the tags and caches them.  It only hits ES if it doesn't have the tag in cache.  In which case it does a GET, to make sure it wasn't already added by another indexer, if not there it then gets a new sequence number, and then a POST with op_type=create to handle the possible race condition of multiple indexers creating.  I do the same caching on the viewer side where I map the numbers back to strings before returning to the user.

If you don't have long running indexers and viewers, or an un-cachable number of tags,  then I agree the extra look ups will hurt performance.


Another suggestion that might be easier is if your tags can be split up in categories and you don't need to facet all the categories each time, then splitting into multiple fields should help.  The max number of tags per document really seems to effect memory  (and I'm assuming performance.)  Of course if you need to facet everything then I doubt it will help, and probably hurt performance.


Thanks,
Andy


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Ivan Brusic
Have you seen some of the latest commits?


There are no issues attached to these commits, so there is no telling what version they belong to.

-- 
Ivan


On Wed, Jan 23, 2013 at 9:09 PM, Otis Gospodnetic <[hidden email]> wrote:
I have not looked into this further and have not notices any changes that would improve this particular issue in ES.


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Drew Raines-2
Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
> https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--


Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Leonardo Menezes
So... just to give an update on this. Reading the source code last night, We found a parameter that doesn't seem to be documented anywhere and that is related to choosing which faceting method should be used for a certain field. The parameter is called execution_hint  and should be used like

   "facets" : {
       "company" : {
           "terms" : {
               "field" : "current_company",
               "size" : 15,
"execution_hint":"map"
           }
       }
   }

The process of choosing the faceting method occurs at TermsFacetProcessor and is a bit different for strings than it is for other types. Anyway, after running some tests with this setting, our response time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field



Query #1(matches 5000k documents)
- using "execution_hint":"map" - roughly 50ms avg.
- not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)
- using "execution_hint":"map" - roughly 1.9s avg.
- not using it - roughly 800ms avg.



so, since our query pattern is really close to query #1, that really made a big difference in our results. hope that might be of some help for someone else. 


Leonardo Menezes
(+34) 688907766


On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <[hidden email]> wrote:
Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
> https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--



Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Itamar Syn-Hershko
Is this against master or a previous version? And what about memory usage?


On Fri, Jan 25, 2013 at 11:16 AM, Leonardo Menezes <[hidden email]> wrote:
So... just to give an update on this. Reading the source code last night, We found a parameter that doesn't seem to be documented anywhere and that is related to choosing which faceting method should be used for a certain field. The parameter is called execution_hint  and should be used like

   "facets" : {
       "company" : {
           "terms" : {
               "field" : "current_company",
               "size" : 15,
"execution_hint":"map"
           }
       }
   }

The process of choosing the faceting method occurs at TermsFacetProcessor and is a bit different for strings than it is for other types. Anyway, after running some tests with this setting, our response time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field



Query #1(matches 5000k documents)
- using "execution_hint":"map" - roughly 50ms avg.
- not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)
- using "execution_hint":"map" - roughly 1.9s avg.
- not using it - roughly 800ms avg.



so, since our query pattern is really close to query #1, that really made a big difference in our results. hope that might be of some help for someone else. 


Leonardo Menezes
<a href="tel:%28%2B34%29%20688907766" value="+34688907766" target="_blank">(+34) 688907766


On Thu, Jan 24, <a href="tel:2013" value="+9722013" target="_blank">2013 at 4:39 PM, Drew Raines <[hidden email]> wrote:
Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
> https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--




Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

Leonardo Menezes
We are running 20.1. Memory usage actually dropped as well as CPU usage. Not really sure why that could be... As mentioned before, depending on the query pattern you have, maybe this setting is actually counter productive.

Also, without this option, We were not really able to keep the cluster running for too much time, at some point, things would slow down too much and the cluster would just become unstable. We have only been running since last night our cluster with live traffic(since before it wasn't able to handle it). So if anything odd come up I will update this, but at the moment, everything looks good.

Leonardo Menezes
(+34) 688907766


On Fri, Jan 25, 2013 at 10:39 AM, Itamar Syn-Hershko <[hidden email]> wrote:
Is this against master or a previous version? And what about memory usage?


On Fri, Jan 25, 2013 at 11:16 AM, Leonardo Menezes <[hidden email]> wrote:
So... just to give an update on this. Reading the source code last night, We found a parameter that doesn't seem to be documented anywhere and that is related to choosing which faceting method should be used for a certain field. The parameter is called execution_hint  and should be used like

   "facets" : {
       "company" : {
           "terms" : {
               "field" : "current_company",
               "size" : 15,
"execution_hint":"map"
           }
       }
   }

The process of choosing the faceting method occurs at TermsFacetProcessor and is a bit different for strings than it is for other types. Anyway, after running some tests with this setting, our response time improved a LOT. So, some numbers:

Index: 12MM documents
Field: string, multi valued. has about 400k unique value
Document: has between 1 to 10 values for this field



Query #1(matches 5000k documents)
- using "execution_hint":"map" - roughly 50ms avg.
- not using it - roughly 600ms avg.

Query #2(match all, so, 12MM documents)
- using "execution_hint":"map" - roughly 1.9s avg.
- not using it - roughly 800ms avg.



so, since our query pattern is really close to query #1, that really made a big difference in our results. hope that might be of some help for someone else. 


Leonardo Menezes
<a href="tel:%28%2B34%29%20688907766" value="+34688907766" target="_blank">(+34) 688907766


On Thu, Jan 24, <a href="tel:2013" value="+9722013" target="_blank">2013 at 4:39 PM, Drew Raines <[hidden email]> wrote:
Ivan Brusic wrote:

> Have you seen some of the latest commits?
>
> https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>
> There are no issues attached to these commits, so there is no
> telling what version they belong to.

The goal is for the fielddata refactoring and Lucene 4.1 integration
to appear in 0.21.0.  Much of the work is already in master.

-Drew

--





--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Performance killed when faceting on high cardinality fields

joergprante@gmail.com
In reply to this post by Leonardo Menezes
Hi,

interesting... it looks like your system can fit the 5000k documents
into the cache with "execution_hint: map" without being hit seriously by
GC. Without execution_hint:map, do you use soft refs by any chance? That
would explain the 600ms, could be extra time because your cache elements
are being invalidated.

Jörg

Am 25.01.13 10:16, schrieb Leonardo Menezes:

> So... just to give an update on this. Reading the source code last
> night, We found a parameter that doesn't seem to be documented
> anywhere and that is related to choosing which faceting method should
> be used for a certain field. The parameter is called execution_hint
>  and should be used like
>
>    "facets" : {
>        "company" : {
>            "terms" : {
>                "field" : "current_company",
>                "size" : 15,
> "execution_hint":"map"
>            }
>        }
>    }
>
> The process of choosing the faceting method occurs at
> TermsFacetProcessor and is a bit different for strings than it is for
> other types. Anyway, after running some tests with this setting, our
> response time improved a LOT. So, some numbers:
>
> Index: 12MM documents
> Field: string, multi valued. has about 400k unique value
> Document: has between 1 to 10 values for this field
>
>
>
> Query #1(matches 5000k documents)
> - using "execution_hint":"map" - roughly 50ms avg.
> - not using it - roughly 600ms avg.
>
> Query #2(match all, so, 12MM documents)
> - using "execution_hint":"map" - roughly 1.9s avg.
> - not using it - roughly 800ms avg.
>
>
>
> so, since our query pattern is really close to query #1, that really
> made a big difference in our results. hope that might be of some help
> for someone else.
>
>
> Leonardo Menezes
> (+34) 688907766
> http://lmenezes.com <http://lmenezes.com/>
>
> <http://twitter.com/leonardomenezes>
>
>
>
> On Thu, Jan 24, 2013 at 4:39 PM, Drew Raines <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Ivan Brusic wrote:
>
>     > Have you seen some of the latest commits?
>     >
>     >
>     https://github.com/elasticsearch/elasticsearch/commit/346422b74751f498f037daff34ea136a131fca89
>     >
>     > There are no issues attached to these commits, so there is no
>     > telling what version they belong to.
>
>     The goal is for the fielddata refactoring and Lucene 4.1 integration
>     to appear in 0.21.0.  Much of the work is already in master.
>
>     -Drew
>
>     --
>
>
>

12