Mapping Size limitations

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Mapping Size limitations

revdev
Hi,

I am using dynamic templates and I am dealing with with couple thousand such dynamic fields. Each of these dynamic fields are objects have 2 to 3 subfields which are of type "byte". My question is , is there any performance penalty of having a large mapping? Right now I have couple thousand fields in mapping but in future it can increase to maybe 10k-20k fields. Will I see performance degradation with large mapping file. If so, what things will be effected? FYI, I am planning to use facets over these fields.

Thanks!

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

Chris Male
Hi,

The first thing that jumps out at me is, do you really need that many fields? Are you able to give us a bit of information about what you're doing and maybe we can come up with a way to do it without 20,000 fields.

On Sunday, November 11, 2012 11:12:51 AM UTC+13, revdev wrote:
Hi,

I am using dynamic templates and I am dealing with with couple thousand such dynamic fields. Each of these dynamic fields are objects have 2 to 3 subfields which are of type "byte". My question is , is there any performance penalty of having a large mapping? Right now I have couple thousand fields in mapping but in future it can increase to maybe 10k-20k fields. Will I see performance degradation with large mapping file. If so, what things will be effected? FYI, I am planning to use facets over these fields.

Thanks!

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

joergprante@gmail.com
In reply to this post by revdev
It's nearly impossible to manage 20k fields. The reasons are: each field consumes around some MB of resident memory in Lucene, each field mapping creation causes cluster blocks and proliferation of the mapping settings throughout the cluster nodes. Even if you manage to get 20k field created, constructing a query over 20k fields and the lookup time for each field settings and the mappings will eat up your performance.

Rule of thumbs: Facets are designed to perform well over a small number of fields with a high cardinality of values. They do not perform well over a high number of fields with low cardinality.

I would also be curious to learn about the scenario in which such a high number of fields is required.

Jörg

On Saturday, November 10, 2012 11:12:51 PM UTC+1, revdev wrote:
Hi,

I am using dynamic templates and I am dealing with with couple thousand such dynamic fields. Each of these dynamic fields are objects have 2 to 3 subfields which are of type "byte". My question is , is there any performance penalty of having a large mapping? Right now I have couple thousand fields in mapping but in future it can increase to maybe 10k-20k fields. Will I see performance degradation with large mapping file. If so, what things will be effected? FYI, I am planning to use facets over these fields.

Thanks!

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

Vinay-2
Thanks Jorg, Chris for quick reply. Let me explain you the situation which will clear what I am trying to accomplish. I am not giving the exact domain example but a similar example in different domain.

Scenario: Lets say, I am storing a list of restaurant reviews from all over the web. Each review document can have following fields:

review_id (long)
review_ratings (object)
          aspect_1_name (string) : rating (float)
          aspect_2_name (string) : rating (float)
          aspect_3_name (string) : rating (float)
         ...
date (datetime)

Requirement: The goal is to be able to calculate facets on "review aspects" and get avg rating given for each aspect across all reviews within a given period. In this case, aspects can be like "Visual appeal of dish", "taste", "smell" etc. Hypothetically, lets assume number of aspects can increase to 20k aspects. Note that, a single review can have maybe a dozen of aspects defined but when you talk about millions of reviews over a period, they might have collectively thousands of different aspects associated. So, here our mapping will become huge because of "review_ratings" object.

Now, to achieve this, I can use histogram facet over Field Date and Value as the Aspect Name. To get facets over, say 100 aspects, I can create 100 facets, one for each aspect. So, at a time, I will just be querying around 100 aspects at a time to get their avg rating.

Now that you know a sample scenario, can you guys tell me if my approach is correct? or I am doing something fundamentally wrong ?

Thanks a lot again for help guys!
Vinay






On Mon, Nov 12, 2012 at 12:32 AM, Jörg Prante <[hidden email]> wrote:
It's nearly impossible to manage 20k fields. The reasons are: each field consumes around some MB of resident memory in Lucene, each field mapping creation causes cluster blocks and proliferation of the mapping settings throughout the cluster nodes. Even if you manage to get 20k field created, constructing a query over 20k fields and the lookup time for each field settings and the mappings will eat up your performance.

Rule of thumbs: Facets are designed to perform well over a small number of fields with a high cardinality of values. They do not perform well over a high number of fields with low cardinality.

I would also be curious to learn about the scenario in which such a high number of fields is required.

Jörg


On Saturday, November 10, 2012 11:12:51 PM UTC+1, revdev wrote:
Hi,

I am using dynamic templates and I am dealing with with couple thousand such dynamic fields. Each of these dynamic fields are objects have 2 to 3 subfields which are of type "byte". My question is , is there any performance penalty of having a large mapping? Right now I have couple thousand fields in mapping but in future it can increase to maybe 10k-20k fields. Will I see performance degradation with large mapping file. If so, what things will be effected? FYI, I am planning to use facets over these fields.

Thanks!

--
 
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

es_learner
Do you already have some histogram data exhibiting the 20k spread?  I can see the possibility of 20k aspects and I can also see a very long tail.  Can the aspects be sub-grouped into doc_type's thereby reducing the set of aspects for each sub-group?  e.g. 'jasmine flavor' won't apply to fries/steaks but fine with tea.
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

revdev
You mean, create a separate Index Type for each subgroups? I can do that but that's the last resort I want to take since it will require quite a lot of code to make things look seamless to outside clients for both query and indexing.

Thanks for pitching that idea tho.


On Monday, November 12, 2012 2:18:13 PM UTC-8, es_learner wrote:
Do you already have some histogram data exhibiting the 20k spread?  I can see
the possibility of 20k aspects and I can also see a very long tail.  Can the
aspects be sub-grouped into doc_type's thereby reducing the set of aspects
for each sub-group?  e.g. 'jasmine flavor' won't apply to fries/steaks but
fine with tea.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Mapping-Size-limitations-tp4025257p4025334.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

Chris Male
Practically I can't see how you're going to be able to support so many fields.  Jorg is right, memory consumption is going to be immense, things will get unmanageable.

What if, along the lines of what es_learner suggested, you have a document per aspect? So each document could consist of review id, aspect_name, rating and datetime.  You could then filter aspect_name to choose which aspects you wanted to facet on.

On Tuesday, November 13, 2012 11:58:44 AM UTC+13, revdev wrote:
You mean, create a separate Index Type for each subgroups? I can do that but that's the last resort I want to take since it will require quite a lot of code to make things look seamless to outside clients for both query and indexing.

Thanks for pitching that idea tho.


On Monday, November 12, 2012 2:18:13 PM UTC-8, es_learner wrote:
Do you already have some histogram data exhibiting the 20k spread?  I can see
the possibility of 20k aspects and I can also see a very long tail.  Can the
aspects be sub-grouped into doc_type's thereby reducing the set of aspects
for each sub-group?  e.g. 'jasmine flavor' won't apply to fries/steaks but
fine with tea.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Mapping-Size-limitations-tp4025257p4025334.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

revdev
hmm, let me tell you one thing I tried but it did not work:

I changed my doc from this

review_id (long)
review_ratings (object)
          aspect_1_name (string) : rating (float)
          aspect_2_name (string) : rating (float)
          aspect_3_name (string) : rating (float)
         ...
date (datetime)

to

review_id (long)
review_ratings (Array)
          [
             {"name" : aspect_1_name (string), "rating": aspect_rating (float) },
             {"name" : aspect_2_name (string), "rating": aspect_rating (float) },
         ...
          ]
date (datetime)

Then I applied facet filter to filter aspects by their name, ie. "facet_filter": { 'term' : { "name" : "aspect_1_name"}} but the result was returning result by calculating mean on all elements within "review_ratings" array for reviews which had "aspect_name_1" in the review_rating array. 

For example take this example rating_array: 
{
"rating_array" : [
 {"name" : "a", "rating" : 3},
 {"name" : "b", "rating" : 4},
 {"name" : "c", "rating" : 5},
]
}
If I filtering by "name":b", ES will calculate total as 12 and mean as 12/3, rather than just returning total as 4 and mean as 4.

Can I do something that will just return results by calculating just the data from the array index which has name="aspect_1_name" ? or, am I doing something wrong? :)
thx!








On Monday, November 12, 2012 3:13:23 PM UTC-8, Chris Male wrote:
Practically I can't see how you're going to be able to support so many fields.  Jorg is right, memory consumption is going to be immense, things will get unmanageable.

What if, along the lines of what es_learner suggested, you have a document per aspect? So each document could consist of review id, aspect_name, rating and datetime.  You could then filter aspect_name to choose which aspects you wanted to facet on.

On Tuesday, November 13, 2012 11:58:44 AM UTC+13, revdev wrote:
You mean, create a separate Index Type for each subgroups? I can do that but that's the last resort I want to take since it will require quite a lot of code to make things look seamless to outside clients for both query and indexing.

Thanks for pitching that idea tho.


On Monday, November 12, 2012 2:18:13 PM UTC-8, es_learner wrote:
Do you already have some histogram data exhibiting the 20k spread?  I can see
the possibility of 20k aspects and I can also see a very long tail.  Can the
aspects be sub-grouped into doc_type's thereby reducing the set of aspects
for each sub-group?  e.g. 'jasmine flavor' won't apply to fries/steaks but
fine with tea.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Mapping-Size-limitations-tp4025257p4025334.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

es_learner
Is your (new) Array of review_ratings a nested type?

http://www.elasticsearch.org/guide/reference/mapping/nested-type.html
Reply | Threaded
Open this post in threaded view
|

Re: Mapping Size limitations

revdev
not now but that might be the solution I was looking for! :)
Let me experiment with nested_types.
Thanks a lot es_learner!

--