Estimating field cache size for facets in advance

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Estimating field cache size for facets in advance

Andrew Clegg
I want to do some planning around how much cache memory it will take to facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

1. Is it correct to say that running a facet on a field causes every shard to load *all* the values for that field into memory? Before any facet filters are applied?

2. What factors affect the memory consumed when this happens? Is it: number of documents in the shard, number of distinct values in that field, something else?

3. Is there a formula for calculating/estimating the overall usage? (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

4. Is the document type taken into account anywhere in this process? Or is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large number of different types (around a hundred I think) which have most of the same fields in common. If someone does a facet on one type, will the data for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate *index* for each type?

Thanks in advance,

A.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

joergprante@gmail.com
Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting over a field by precomputing the memory consumption to prevent OOMs. Right now ES throws OOM if faceting fails, but will not automatically recover the index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
I want to do some planning around how much cache memory it will take to facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

1. Is it correct to say that running a facet on a field causes every shard to load *all* the values for that field into memory? Before any facet filters are applied?

2. What factors affect the memory consumed when this happens? Is it: number of documents in the shard, number of distinct values in that field, something else?

3. Is there a formula for calculating/estimating the overall usage? (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

4. Is the document type taken into account anywhere in this process? Or is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large number of different types (around a hundred I think) which have most of the same fields in common. If someone does a facet on one type, will the data for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate *index* for each type?

Thanks in advance,

A.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

Andrej
Yeah, that would be interesting, especially OOM is a real problem for us too at the moment. So knowing if changing cache type or heapsize would help is for sure a benefit (at least an estimation, maybe?)

One interesting thing came up while playing with cache settings. I can set expiration time using curl and everything is fine:

curl -XPUT host:/port_settings -d '{ "index" : { "cache.field.expire" : "10m"}}'

After trying to set the default value again (curl -XPUT host:port/_settings -d '{ "index" : { "cache.field.expire" : "-1"}}') I am getting an error including the following message:
Caused by: java.lang.IllegalArgumentException: duration cannot be negative: -1000000 NANOSECONDS
Is there a bug in parsing the argument or am I doing something wrong?


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

Andrew Clegg
In reply to this post by joergprante@gmail.com
This is the third time I've tried to write this reply, thanks Google Groups.

I think you'd have to iterate over the field data twice, once to construct the estimate, and once again to load the data, so it might slow things down. And really it'd be meaningless unless you ran a GC first, as there's no way to know how much memory is *potentially* available until after a GC. So you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap space" instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to take down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters and document type are irrelevant when loading field data into the cache, so faceting really will cause you to load all the field values across all the types in your index.

(Can anyone confirm/deny please?)


On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:
Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting over a field by precomputing the memory consumption to prevent OOMs. Right now ES throws OOM if faceting fails, but will not automatically recover the index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
I want to do some planning around how much cache memory it will take to facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

1. Is it correct to say that running a facet on a field causes every shard to load *all* the values for that field into memory? Before any facet filters are applied?

2. What factors affect the memory consumed when this happens? Is it: number of documents in the shard, number of distinct values in that field, something else?

3. Is there a formula for calculating/estimating the overall usage? (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

4. Is the document type taken into account anywhere in this process? Or is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large number of different types (around a hundred I think) which have most of the same fields in common. If someone does a facet on one type, will the data for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate *index* for each type?

Thanks in advance,

A.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

revdev
Would love to see an answer for this. Thanks for the detailed question.


On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:
This is the third time I've tried to write this reply, thanks Google Groups.

I think you'd have to iterate over the field data twice, once to construct the estimate, and once again to load the data, so it might slow things down. And really it'd be meaningless unless you ran a GC first, as there's no way to know how much memory is *potentially* available until after a GC. So you'd have to have a user-specified limit.

Would this be a really silly idea:

Wrap the whole FieldDataLoader#load method in a try/catch for OutOfMemoryError.

Then if you get one, do an immediate GC (in the catch block so all the local variables are out of scope).

Then throw an IOException: "Unable to load field data: out of heap space" instead.

Is that crazy? It kinda sounds crazy, but no worse than being able to take down a node with a single bad facet.

In answer to my own questions 1 and 4: I'm now 99% sure that filters and document type are irrelevant when loading field data into the cache, so faceting really will cause you to load all the field values across all the types in your index.

(Can anyone confirm/deny please?)


On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:
Would love to see answers to this questions too.

An important feature for ES would be a graceful rejection of faceting over a field by precomputing the memory consumption to prevent OOMs. Right now ES throws OOM if faceting fails, but will not automatically recover the index from that state (only manual cluster restart helps).

Jörg

On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
I want to do some planning around how much cache memory it will take to facet over potentially a lot of records (millions, eventually billions).

These are mainly date histograms and term facets.

So, I have a few questions.

1. Is it correct to say that running a facet on a field causes every shard to load *all* the values for that field into memory? Before any facet filters are applied?

2. What factors affect the memory consumed when this happens? Is it: number of documents in the shard, number of distinct values in that field, something else?

3. Is there a formula for calculating/estimating the overall usage? (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)

4. Is the document type taken into account anywhere in this process? Or is the data loading done across all types in the index?

Let me go into 4 in a little more detail. Our index contains a large number of different types (around a hundred I think) which have most of the same fields in common. If someone does a facet on one type, will the data for that field in across all types get loaded?

If that's the case, are we perhaps better off having a separate *index* for each type?

Thanks in advance,

A.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

Andrew Clegg
We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev <[hidden email]> wrote:

> Would love to see an answer for this. Thanks for the detailed question.
>
>
> On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:
>>
>> This is the third time I've tried to write this reply, thanks Google
>> Groups.
>>
>> I think you'd have to iterate over the field data twice, once to construct
>> the estimate, and once again to load the data, so it might slow things down.
>> And really it'd be meaningless unless you ran a GC first, as there's no way
>> to know how much memory is *potentially* available until after a GC. So
>> you'd have to have a user-specified limit.
>>
>> Would this be a really silly idea:
>>
>> Wrap the whole FieldDataLoader#load method in a try/catch for
>> OutOfMemoryError.
>>
>> Then if you get one, do an immediate GC (in the catch block so all the
>> local variables are out of scope).
>>
>> Then throw an IOException: "Unable to load field data: out of heap space"
>> instead.
>>
>> Is that crazy? It kinda sounds crazy, but no worse than being able to take
>> down a node with a single bad facet.
>>
>> In answer to my own questions 1 and 4: I'm now 99% sure that filters and
>> document type are irrelevant when loading field data into the cache, so
>> faceting really will cause you to load all the field values across all the
>> types in your index.
>>
>> (Can anyone confirm/deny please?)
>>
>>
>> On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:
>>>
>>> Would love to see answers to this questions too.
>>>
>>> An important feature for ES would be a graceful rejection of faceting
>>> over a field by precomputing the memory consumption to prevent OOMs. Right
>>> now ES throws OOM if faceting fails, but will not automatically recover the
>>> index from that state (only manual cluster restart helps).
>>>
>>> Jörg
>>>
>>> On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
>>>>
>>>> I want to do some planning around how much cache memory it will take to
>>>> facet over potentially a lot of records (millions, eventually billions).
>>>>
>>>> These are mainly date histograms and term facets.
>>>>
>>>> So, I have a few questions.
>>>>
>>>> 1. Is it correct to say that running a facet on a field causes every
>>>> shard to load *all* the values for that field into memory? Before any facet
>>>> filters are applied?
>>>>
>>>> 2. What factors affect the memory consumed when this happens? Is it:
>>>> number of documents in the shard, number of distinct values in that field,
>>>> something else?
>>>>
>>>> 3. Is there a formula for calculating/estimating the overall usage?
>>>> (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)
>>>>
>>>> 4. Is the document type taken into account anywhere in this process? Or
>>>> is the data loading done across all types in the index?
>>>>
>>>> Let me go into 4 in a little more detail. Our index contains a large
>>>> number of different types (around a hundred I think) which have most of the
>>>> same fields in common. If someone does a facet on one type, will the data
>>>> for that field in across all types get loaded?
>>>>
>>>> If that's the case, are we perhaps better off having a separate *index*
>>>> for each type?
>>>>
>>>> Thanks in advance,
>>>>
>>>> A.
>>>>
> --
>
>



--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--


Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

revdev
Thanks for the forumula!

So here "n" is the number of unique values that term can take? 
If thats so, then I would imagine that storing second level resolution for a date field would take lot of memory when building field cache. Do you have any experience with performance and capacity for storing dates?

Vinay

On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:
We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="WHSjBhZEoCAJ">click...@...> wrote:

> Would love to see an answer for this. Thanks for the detailed question.
>
>
> On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:
>>
>> This is the third time I've tried to write this reply, thanks Google
>> Groups.
>>
>> I think you'd have to iterate over the field data twice, once to construct
>> the estimate, and once again to load the data, so it might slow things down.
>> And really it'd be meaningless unless you ran a GC first, as there's no way
>> to know how much memory is *potentially* available until after a GC. So
>> you'd have to have a user-specified limit.
>>
>> Would this be a really silly idea:
>>
>> Wrap the whole FieldDataLoader#load method in a try/catch for
>> OutOfMemoryError.
>>
>> Then if you get one, do an immediate GC (in the catch block so all the
>> local variables are out of scope).
>>
>> Then throw an IOException: "Unable to load field data: out of heap space"
>> instead.
>>
>> Is that crazy? It kinda sounds crazy, but no worse than being able to take
>> down a node with a single bad facet.
>>
>> In answer to my own questions 1 and 4: I'm now 99% sure that filters and
>> document type are irrelevant when loading field data into the cache, so
>> faceting really will cause you to load all the field values across all the
>> types in your index.
>>
>> (Can anyone confirm/deny please?)
>>
>>
>> On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:
>>>
>>> Would love to see answers to this questions too.
>>>
>>> An important feature for ES would be a graceful rejection of faceting
>>> over a field by precomputing the memory consumption to prevent OOMs. Right
>>> now ES throws OOM if faceting fails, but will not automatically recover the
>>> index from that state (only manual cluster restart helps).
>>>
>>> Jörg
>>>
>>> On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
>>>>
>>>> I want to do some planning around how much cache memory it will take to
>>>> facet over potentially a lot of records (millions, eventually billions).
>>>>
>>>> These are mainly date histograms and term facets.
>>>>
>>>> So, I have a few questions.
>>>>
>>>> 1. Is it correct to say that running a facet on a field causes every
>>>> shard to load *all* the values for that field into memory? Before any facet
>>>> filters are applied?
>>>>
>>>> 2. What factors affect the memory consumed when this happens? Is it:
>>>> number of documents in the shard, number of distinct values in that field,
>>>> something else?
>>>>
>>>> 3. Is there a formula for calculating/estimating the overall usage?
>>>> (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)
>>>>
>>>> 4. Is the document type taken into account anywhere in this process? Or
>>>> is the data loading done across all types in the index?
>>>>
>>>> Let me go into 4 in a little more detail. Our index contains a large
>>>> number of different types (around a hundred I think) which have most of the
>>>> same fields in common. If someone does a facet on one type, will the data
>>>> for that field in across all types get loaded?
>>>>
>>>> If that's the case, are we perhaps better off having a separate *index*
>>>> for each type?
>>>>
>>>> Thanks in advance,
>>>>
>>>> A.
>>>>
> --
>
>



--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

Andrew Clegg
Sorry, catching up on backlog here...

n is indeed the number of unique terms in the field you're caching.

And yes, you wouldn't want to load second-level resolution into the
field data cache if possible (e.g. sorting or faceting).

If we're planning to facet on a datetime field, we truncate it to
minute before indexing. (No reason you can't index two copies of the
field, one for sorting/faceting and one for querying.)

On 16 December 2012 01:51, revdev <[hidden email]> wrote:

> Thanks for the forumula!
>
> So here "n" is the number of unique values that term can take?
> If thats so, then I would imagine that storing second level resolution for a
> date field would take lot of memory when building field cache. Do you have
> any experience with performance and capacity for storing dates?
>
> Vinay
>
> On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:
>>
>> We discussed this briefly after the ES training course in London a
>> couple of months ago.
>>
>> If I understood Shay correctly, here's the rough memory usage (in bytes).
>>
>> For single-valued fields:
>>
>> 4m + (4n * avg(term length in chars)) [string fields]
>>
>> 4m + (n * term size in bytes) [numeric fields]
>>
>> For multi-valued field:
>>
>> (4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
>> fields]
>>
>> (4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]
>>
>> Where m is the number of documents and n is the number of terms.
>>
>> On 14 December 2012 16:51, revdev <[hidden email]> wrote:
>> > Would love to see an answer for this. Thanks for the detailed question.
>> >
>> >
>> > On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:
>> >>
>> >> This is the third time I've tried to write this reply, thanks Google
>> >> Groups.
>> >>
>> >> I think you'd have to iterate over the field data twice, once to
>> >> construct
>> >> the estimate, and once again to load the data, so it might slow things
>> >> down.
>> >> And really it'd be meaningless unless you ran a GC first, as there's no
>> >> way
>> >> to know how much memory is *potentially* available until after a GC. So
>> >> you'd have to have a user-specified limit.
>> >>
>> >> Would this be a really silly idea:
>> >>
>> >> Wrap the whole FieldDataLoader#load method in a try/catch for
>> >> OutOfMemoryError.
>> >>
>> >> Then if you get one, do an immediate GC (in the catch block so all the
>> >> local variables are out of scope).
>> >>
>> >> Then throw an IOException: "Unable to load field data: out of heap
>> >> space"
>> >> instead.
>> >>
>> >> Is that crazy? It kinda sounds crazy, but no worse than being able to
>> >> take
>> >> down a node with a single bad facet.
>> >>
>> >> In answer to my own questions 1 and 4: I'm now 99% sure that filters
>> >> and
>> >> document type are irrelevant when loading field data into the cache, so
>> >> faceting really will cause you to load all the field values across all
>> >> the
>> >> types in your index.
>> >>
>> >> (Can anyone confirm/deny please?)
>> >>
>> >>
>> >> On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:
>> >>>
>> >>> Would love to see answers to this questions too.
>> >>>
>> >>> An important feature for ES would be a graceful rejection of faceting
>> >>> over a field by precomputing the memory consumption to prevent OOMs.
>> >>> Right
>> >>> now ES throws OOM if faceting fails, but will not automatically
>> >>> recover the
>> >>> index from that state (only manual cluster restart helps).
>> >>>
>> >>> Jörg
>> >>>
>> >>> On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
>> >>>>
>> >>>> I want to do some planning around how much cache memory it will take
>> >>>> to
>> >>>> facet over potentially a lot of records (millions, eventually
>> >>>> billions).
>> >>>>
>> >>>> These are mainly date histograms and term facets.
>> >>>>
>> >>>> So, I have a few questions.
>> >>>>
>> >>>> 1. Is it correct to say that running a facet on a field causes every
>> >>>> shard to load *all* the values for that field into memory? Before any
>> >>>> facet
>> >>>> filters are applied?
>> >>>>
>> >>>> 2. What factors affect the memory consumed when this happens? Is it:
>> >>>> number of documents in the shard, number of distinct values in that
>> >>>> field,
>> >>>> something else?
>> >>>>
>> >>>> 3. Is there a formula for calculating/estimating the overall usage?
>> >>>> (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)
>> >>>>
>> >>>> 4. Is the document type taken into account anywhere in this process?
>> >>>> Or
>> >>>> is the data loading done across all types in the index?
>> >>>>
>> >>>> Let me go into 4 in a little more detail. Our index contains a large
>> >>>> number of different types (around a hundred I think) which have most
>> >>>> of the
>> >>>> same fields in common. If someone does a facet on one type, will the
>> >>>> data
>> >>>> for that field in across all types get loaded?
>> >>>>
>> >>>> If that's the case, are we perhaps better off having a separate
>> >>>> *index*
>> >>>> for each type?
>> >>>>
>> >>>> Thanks in advance,
>> >>>>
>> >>>> A.
>> >>>>
>> > --
>> >
>> >
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
> --
>
>



--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--


Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

Otis Gospodnetic
Hi,

On Sunday, January 20, 2013 4:02:45 PM UTC-5, Andrew Clegg wrote:
Sorry, catching up on backlog here...

n is indeed the number of unique terms in the field you're caching.

And yes, you wouldn't want to load second-level resolution into the
field data cache if possible (e.g. sorting or faceting).

If we're planning to facet on a datetime field, we truncate it to
minute before indexing. (No reason you can't index two copies of the

Hm, and I thought this was the trick from the pre-trie-based date/time fields.
I just quickly grepped the ES code and didn't seem them.  Maybe ES doesn't support them yet?

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

 
field, one for sorting/faceting and one for querying.)

On 16 December 2012 01:51, revdev <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="OstZR3TrcvoJ">click...@...> wrote:

> Thanks for the forumula!
>
> So here "n" is the number of unique values that term can take?
> If thats so, then I would imagine that storing second level resolution for a
> date field would take lot of memory when building field cache. Do you have
> any experience with performance and capacity for storing dates?
>
> Vinay
>
> On Friday, December 14, 2012 11:17:48 AM UTC-8, Andrew Clegg wrote:
>>
>> We discussed this briefly after the ES training course in London a
>> couple of months ago.
>>
>> If I understood Shay correctly, here's the rough memory usage (in bytes).
>>
>> For single-valued fields:
>>
>> 4m + (4n * avg(term length in chars)) [string fields]
>>
>> 4m + (n * term size in bytes) [numeric fields]
>>
>> For multi-valued field:
>>
>> (4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string
>> fields]
>>
>> (4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]
>>
>> Where m is the number of documents and n is the number of terms.
>>
>> On 14 December 2012 16:51, revdev <[hidden email]> wrote:
>> > Would love to see an answer for this. Thanks for the detailed question.
>> >
>> >
>> > On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:
>> >>
>> >> This is the third time I've tried to write this reply, thanks Google
>> >> Groups.
>> >>
>> >> I think you'd have to iterate over the field data twice, once to
>> >> construct
>> >> the estimate, and once again to load the data, so it might slow things
>> >> down.
>> >> And really it'd be meaningless unless you ran a GC first, as there's no
>> >> way
>> >> to know how much memory is *potentially* available until after a GC. So
>> >> you'd have to have a user-specified limit.
>> >>
>> >> Would this be a really silly idea:
>> >>
>> >> Wrap the whole FieldDataLoader#load method in a try/catch for
>> >> OutOfMemoryError.
>> >>
>> >> Then if you get one, do an immediate GC (in the catch block so all the
>> >> local variables are out of scope).
>> >>
>> >> Then throw an IOException: "Unable to load field data: out of heap
>> >> space"
>> >> instead.
>> >>
>> >> Is that crazy? It kinda sounds crazy, but no worse than being able to
>> >> take
>> >> down a node with a single bad facet.
>> >>
>> >> In answer to my own questions 1 and 4: I'm now 99% sure that filters
>> >> and
>> >> document type are irrelevant when loading field data into the cache, so
>> >> faceting really will cause you to load all the field values across all
>> >> the
>> >> types in your index.
>> >>
>> >> (Can anyone confirm/deny please?)
>> >>
>> >>
>> >> On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:
>> >>>
>> >>> Would love to see answers to this questions too.
>> >>>
>> >>> An important feature for ES would be a graceful rejection of faceting
>> >>> over a field by precomputing the memory consumption to prevent OOMs.
>> >>> Right
>> >>> now ES throws OOM if faceting fails, but will not automatically
>> >>> recover the
>> >>> index from that state (only manual cluster restart helps).
>> >>>
>> >>> Jörg
>> >>>
>> >>> On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
>> >>>>
>> >>>> I want to do some planning around how much cache memory it will take
>> >>>> to
>> >>>> facet over potentially a lot of records (millions, eventually
>> >>>> billions).
>> >>>>
>> >>>> These are mainly date histograms and term facets.
>> >>>>
>> >>>> So, I have a few questions.
>> >>>>
>> >>>> 1. Is it correct to say that running a facet on a field causes every
>> >>>> shard to load *all* the values for that field into memory? Before any
>> >>>> facet
>> >>>> filters are applied?
>> >>>>
>> >>>> 2. What factors affect the memory consumed when this happens? Is it:
>> >>>> number of documents in the shard, number of distinct values in that
>> >>>> field,
>> >>>> something else?
>> >>>>
>> >>>> 3. Is there a formula for calculating/estimating the overall usage?
>> >>>> (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)
>> >>>>
>> >>>> 4. Is the document type taken into account anywhere in this process?
>> >>>> Or
>> >>>> is the data loading done across all types in the index?
>> >>>>
>> >>>> Let me go into 4 in a little more detail. Our index contains a large
>> >>>> number of different types (around a hundred I think) which have most
>> >>>> of the
>> >>>> same fields in common. If someone does a facet on one type, will the
>> >>>> data
>> >>>> for that field in across all types get loaded?
>> >>>>
>> >>>> If that's the case, are we perhaps better off having a separate
>> >>>> *index*
>> >>>> for each type?
>> >>>>
>> >>>> Thanks in advance,
>> >>>>
>> >>>> A.
>> >>>>
>> > --
>> >
>> >
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
> --
>
>



--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

Curt Kohler
In reply to this post by Andrew Clegg
I was curious if anyone knows if this formula is still valid for ElasticSearch 0.90.x?



On Friday, December 14, 2012 2:17:48 PM UTC-5, Andrew Clegg wrote:
We discussed this briefly after the ES training course in London a
couple of months ago.

If I understood Shay correctly, here's the rough memory usage (in bytes).

For single-valued fields:

4m + (4n * avg(term length in chars)) [string fields]

4m + (n * term size in bytes) [numeric fields]

For multi-valued field:

(4m * max(num terms in doc)) + (4n * avg(term length in chars)) [string fields]

(4m * max(num terms in doc)) + (n * term size in bytes) [numeric fields]

Where m is the number of documents and n is the number of terms.

On 14 December 2012 16:51, revdev <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="WHSjBhZEoCAJ">click...@...> wrote:

> Would love to see an answer for this. Thanks for the detailed question.
>
>
> On Monday, October 8, 2012 9:00:09 AM UTC-7, Andrew Clegg wrote:
>>
>> This is the third time I've tried to write this reply, thanks Google
>> Groups.
>>
>> I think you'd have to iterate over the field data twice, once to construct
>> the estimate, and once again to load the data, so it might slow things down.
>> And really it'd be meaningless unless you ran a GC first, as there's no way
>> to know how much memory is *potentially* available until after a GC. So
>> you'd have to have a user-specified limit.
>>
>> Would this be a really silly idea:
>>
>> Wrap the whole FieldDataLoader#load method in a try/catch for
>> OutOfMemoryError.
>>
>> Then if you get one, do an immediate GC (in the catch block so all the
>> local variables are out of scope).
>>
>> Then throw an IOException: "Unable to load field data: out of heap space"
>> instead.
>>
>> Is that crazy? It kinda sounds crazy, but no worse than being able to take
>> down a node with a single bad facet.
>>
>> In answer to my own questions 1 and 4: I'm now 99% sure that filters and
>> document type are irrelevant when loading field data into the cache, so
>> faceting really will cause you to load all the field values across all the
>> types in your index.
>>
>> (Can anyone confirm/deny please?)
>>
>>
>> On Monday, 8 October 2012 09:29:58 UTC+1, Jörg Prante wrote:
>>>
>>> Would love to see answers to this questions too.
>>>
>>> An important feature for ES would be a graceful rejection of faceting
>>> over a field by precomputing the memory consumption to prevent OOMs. Right
>>> now ES throws OOM if faceting fails, but will not automatically recover the
>>> index from that state (only manual cluster restart helps).
>>>
>>> Jörg
>>>
>>> On Sunday, October 7, 2012 11:45:44 AM UTC+2, Andrew Clegg wrote:
>>>>
>>>> I want to do some planning around how much cache memory it will take to
>>>> facet over potentially a lot of records (millions, eventually billions).
>>>>
>>>> These are mainly date histograms and term facets.
>>>>
>>>> So, I have a few questions.
>>>>
>>>> 1. Is it correct to say that running a facet on a field causes every
>>>> shard to load *all* the values for that field into memory? Before any facet
>>>> filters are applied?
>>>>
>>>> 2. What factors affect the memory consumed when this happens? Is it:
>>>> number of documents in the shard, number of distinct values in that field,
>>>> something else?
>>>>
>>>> 3. Is there a formula for calculating/estimating the overall usage?
>>>> (FieldDataLoader is a bit opaque if you're not a Lucene specialist.)
>>>>
>>>> 4. Is the document type taken into account anywhere in this process? Or
>>>> is the data loading done across all types in the index?
>>>>
>>>> Let me go into 4 in a little more detail. Our index contains a large
>>>> number of different types (around a hundred I think) which have most of the
>>>> same fields in common. If someone does a facet on one type, will the data
>>>> for that field in across all types get loaded?
>>>>
>>>> If that's the case, are we perhaps better off having a separate *index*
>>>> for each type?
>>>>
>>>> Thanks in advance,
>>>>
>>>> A.
>>>>
> --
>
>



--

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Estimating field cache size for facets in advance

Meghan Mahoney
This post has NOT been accepted by the mailing list yet.
To answer the original question 1:
Each shard will load the field you want to facet on into memory
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_memory_considerations_2