elasticsearch index MUCH larger then similar lucene index

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
54 messages Options
123
Reply | Threaded
Open this post in threaded view
|

elasticsearch index MUCH larger then similar lucene index

shlomivaknin
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Itamar Syn-Hershko
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

shlomivaknin
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Rtnk7OlmEnwJ">shlomi...@...> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="Rtnk7OlmEnwJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Michael Sick

do you have replication on?

On May 21, 2013 9:59 AM, "Shlomi" <[hidden email]> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Itamar Syn-Hershko
In reply to this post by shlomivaknin
You can disable storing the source, but then you don't have stored fields unless you specify you want that explicitly. And it costs more to load several fields from store when you don't have source enabled.


On Tue, May 21, 2013 at 6:59 PM, Shlomi <[hidden email]> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?


On Tuesday, May 21, <a href="tel:2013" value="+9722013" target="_blank">2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, <a href="tel:2013" value="+9722013" target="_blank">2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Matt Weber-2
In reply to this post by shlomivaknin
Don't forget about the _all field.  Also, if you don't store the source, you need to explicitly set "store" to yes on your field mappings so you can have them returned in the results. 


On Tue, May 21, 2013 at 8:59 AM, Shlomi <[hidden email]> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?


On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

shlomivaknin
Hey, 

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store" to "yes" for both fields (i dont care about perf for now..) - with this setting i still got a much larger size and was still unable to see the fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
Don't forget about the _all field.  Also, if you don't store the source, you need to explicitly set "store" to yes on your field mappings so you can have them returned in the results. 


On Tue, May 21, 2013 at 8:59 AM, Shlomi <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="AGpuF4vJfxAJ">shlomi...@...> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?


On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="AGpuF4vJfxAJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

shlomivaknin
here is a fraction of the mapping i have (i use clojure so its a bit different from json, but its essentially the same):

           {:test  {        
                     :_source {:enabled "false" }
                     :_all    {:enabled "false" }
                     :properties {:gram  {:type "string" :store "yes" :analyzer :ngram-index :compress "true"}
                                      :freq    {:type "long" :store "yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
Hey, 

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store" to "yes" for both fields (i dont care about perf for now..) - with this setting i still got a much larger size and was still unable to see the fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
Don't forget about the _all field.  Also, if you don't store the source, you need to explicitly set "store" to yes on your field mappings so you can have them returned in the results. 


On Tue, May 21, 2013 at 8:59 AM, Shlomi <[hidden email]> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?


On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

shlomivaknin
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as stored and indexed, your fields becomes invisible (although queriable)? or am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a bit different from json, but its essentially the same):

           {:test  {        
                     :_source {:enabled "false" }
                     :_all    {:enabled "false" }
                     :properties {:gram  {:type "string" :store "yes" :analyzer :ngram-index :compress "true"}
                                      :freq    {:type "long" :store "yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
Hey, 

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store" to "yes" for both fields (i dont care about perf for now..) - with this setting i still got a much larger size and was still unable to see the fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
Don't forget about the _all field.  Also, if you don't store the source, you need to explicitly set "store" to yes on your field mappings so you can have them returned in the results. 


On Tue, May 21, 2013 at 8:59 AM, Shlomi <[hidden email]> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?


On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Michael Sick
Not sure about the storage - you might try https://code.google.com/p/luke/ and https://github.com/jprante/elasticsearch-skywalker#readme to see into your indicies. I have not used either but had bookmarked for just such an occasion. 


On Wed, May 22, 2013 at 5:08 AM, Shlomi <[hidden email]> wrote:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as stored and indexed, your fields becomes invisible (although queriable)? or am i doing something totally wrong?..

Thanks


On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a bit different from json, but its essentially the same):

           {:test  {        
                     :_source {:enabled "false" }
                     :_all    {:enabled "false" }
                     :properties {:gram  {:type "string" :store "yes" :analyzer :ngram-index :compress "true"}
                                      :freq    {:type "long" :store "yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
Hey, 

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store" to "yes" for both fields (i dont care about perf for now..) - with this setting i still got a much larger size and was still unable to see the fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
Don't forget about the _all field.  Also, if you don't store the source, you need to explicitly set "store" to yes on your field mappings so you can have them returned in the results. 


On Tue, May 21, 2013 at 8:59 AM, Shlomi <[hidden email]> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?


On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

shlomivaknin
Hey Michael

I managed to find my fields (had to manually ask for them). The question remains though, why is the database so big. ill give skywalker a chance, see maybe it will shed some light on this situation..

The weird thing is that even though i disabled _source and _all, the size remained the same... meaning 17gb instead of 7gb. thats a lot of wasted space...

If anyone has any more ideas why elastic is so ridiculously large compared to a straight forward lucene, i am very interested to hear 

On Wednesday, May 22, 2013 4:13:03 PM UTC+3, Michael Sick wrote:
Not sure about the storage - you might try https://code.google.com/p/luke/ and https://github.com/jprante/elasticsearch-skywalker#readme to see into your indicies. I have not used either but had bookmarked for just such an occasion. 


On Wed, May 22, 2013 at 5:08 AM, Shlomi <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="cOCwFIbd5xgJ">shlomi...@...> wrote:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as stored and indexed, your fields becomes invisible (although queriable)? or am i doing something totally wrong?..

Thanks


On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a bit different from json, but its essentially the same):

           {:test  {        
                     :_source {:enabled "false" }
                     :_all    {:enabled "false" }
                     :properties {:gram  {:type "string" :store "yes" :analyzer :ngram-index :compress "true"}
                                      :freq    {:type "long" :store "yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
Hey, 

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store" to "yes" for both fields (i dont care about perf for now..) - with this setting i still got a much larger size and was still unable to see the fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
Don't forget about the _all field.  Also, if you don't store the source, you need to explicitly set "store" to yes on your field mappings so you can have them returned in the results. 


On Tue, May 21, 2013 at 8:59 AM, Shlomi <[hidden email]> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?


On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="cOCwFIbd5xgJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Mark Harwood
How many shards do you have?

A multi-sharded ES index will cost you more than a single Lucene index due to duplication of index terms./worse postings compression if the same content is split across many lucene indexes.


On Wednesday, May 22, 2013 2:49:32 PM UTC+1, Shlomi wrote:
Hey Michael

I managed to find my fields (had to manually ask for them). The question remains though, why is the database so big. ill give skywalker a chance, see maybe it will shed some light on this situation..

The weird thing is that even though i disabled _source and _all, the size remained the same... meaning 17gb instead of 7gb. thats a lot of wasted space...

If anyone has any more ideas why elastic is so ridiculously large compared to a straight forward lucene, i am very interested to hear 

On Wednesday, May 22, 2013 4:13:03 PM UTC+3, Michael Sick wrote:
Not sure about the storage - you might try https://code.google.com/p/luke/ and https://github.com/jprante/elasticsearch-skywalker#readme to see into your indicies. I have not used either but had bookmarked for just such an occasion. 


On Wed, May 22, 2013 at 5:08 AM, Shlomi <[hidden email]> wrote:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as stored and indexed, your fields becomes invisible (although queriable)? or am i doing something totally wrong?..

Thanks


On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a bit different from json, but its essentially the same):

           {:test  {        
                     :_source {:enabled "false" }
                     :_all    {:enabled "false" }
                     :properties {:gram  {:type "string" :store "yes" :analyzer :ngram-index :compress "true"}
                                      :freq    {:type "long" :store "yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
Hey, 

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store" to "yes" for both fields (i dont care about perf for now..) - with this setting i still got a much larger size and was still unable to see the fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
Don't forget about the _all field.  Also, if you don't store the source, you need to explicitly set "store" to yes on your field mappings so you can have them returned in the results. 


On Tue, May 21, 2013 at 8:59 AM, Shlomi <[hidden email]> wrote:
yes, so i was trying to exclude source, but then queries didnt return anything besides id. but in any case, even disabling source still gave me a large index..

any way to tell it to save just the fields?


On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:
Yes, because ES stores the entire source by default


On Tue, May 21, 2013 at 6:53 PM, Shlomi <[hidden email]> wrote:
Hey,

We have some old java code that uses lucene and grizzly to serve queries over text. we have two field, a string field and a numeric (long) field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and indexed the same data. 

the java based implementation took about 6gb, while to elastic took 17gb.. 

does this makes sense? what could i do about this? 

Thanks!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

David G Ortega
In reply to this post by shlomivaknin
I had the same size issue but was exactly what the colleagues have pointed, _all and _source enabled, makes sense... they only thing I can think and its so silly that really ashames me to ask you is "have you deleted the index before apply the mapping?"

I also see that you are using ngrams... I suppose that you use that in the vanilla lucene index...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

joergprante@gmail.com
In reply to this post by shlomivaknin
Please note, Skywalker needs an update for 0.90 - it is still on 0.20.
The update is in progress.

Jörg

Am 22.05.13 15:49, schrieb Shlomi:
> I managed to find my fields (had to manually ask for them). The
> question remains though, why is the database so big. ill give
> skywalker a chance, see maybe it will shed some light on this situation..

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

joergprante@gmail.com
In reply to this post by shlomivaknin
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:

> does ES store its numeric fields as strings?
>
> can someone confirm that if you disable _source and keep each field as
> stored and indexed, your fields becomes invisible (although
> queriable)? or am i doing something totally wrong?..
>
> Thanks
>
> On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
>
>     here is a fraction of the mapping i have (i use clojure so its a
>     bit different from json, but its essentially the same):
>
>                {:test  {
>                          :_source {:enabled "false" }
>                          :_all    {:enabled "false" }
>                          :properties {:gram  {:type "string" :store
>     "yes" :analyzer :ngram-index :compress "true"}
>                                           :freq    {:type "long"
>     :store "yes"} }}}]
>
>     On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
>
>         Hey,
>
>         thanks all, let me reply:
>
>         Michael - no, i set replicas to 0 (if that what you meant..)
>
>         Itamar & Matt - i disabled _all and _source, and explicitly
>         set "store" to "yes" for both fields (i dont care about perf
>         for now..) - with this setting i still got a much larger size
>         and was still unable to see the fields (although i set store
>         to yes) through queries (only got id's back)
>
>         On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
>
>             Don't forget about the _all field.  Also, if you don't
>             store the source, you need to explicitly set "store" to
>             yes on your field mappings so you can have them returned
>             in the results.
>
>
>             On Tue, May 21, 2013 at 8:59 AM, Shlomi
>             <[hidden email]> wrote:
>
>                 yes, so i was trying to exclude source, but then
>                 queries didnt return anything besides id. but in any
>                 case, even disabling source still gave me a large index..
>
>                 any way to tell it to save just the fields?
>
>
>                 On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
>                 Syn-Hershko wrote:
>
>                     Yes, because ES stores the entire source by default
>
>
>                     On Tue, May 21, 2013 at 6:53 PM, Shlomi
>                     <[hidden email]> wrote:
>
>                         Hey,
>
>                         We have some old java code that uses lucene
>                         and grizzly to serve queries over text. we
>                         have two field, a string field and a numeric
>                         (long) field. the indexing code is pretty
>                         straight forward.
>
>                         I was trying to migrate this to elastic,
>                         pretty simple configuration, and indexed the
>                         same data.
>
>                         the java based implementation took about 6gb,
>                         while to elastic took 17gb..
>
>                         does this makes sense? what could i do about
>                         this?
>
>                         Thanks!
>
>
>                         --
>                         You received this message because you are
>                         subscribed to the Google Groups
>                         "elasticsearch" group.
>                         To unsubscribe from this group and stop
>                         receiving emails from it, send an email to
>                         [hidden email].
>
>                         For more options, visit
>                         https://groups.google.com/groups/opt_out
>                         <https://groups.google.com/groups/opt_out>.
>
>
>
>                 --
>                 You received this message because you are subscribed
>                 to the Google Groups "elasticsearch" group.
>                 To unsubscribe from this group and stop receiving
>                 emails from it, send an email to
>                 [hidden email].
>                 For more options, visit
>                 https://groups.google.com/groups/opt_out
>                 <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [hidden email].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

shlomivaknin
Hey, 

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase" tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking.. :)

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:

> does ES store its numeric fields as strings?
>
> can someone confirm that if you disable _source and keep each field as
> stored and indexed, your fields becomes invisible (although
> queriable)? or am i doing something totally wrong?..
>
> Thanks
>
> On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
>
>     here is a fraction of the mapping i have (i use clojure so its a
>     bit different from json, but its essentially the same):
>
>                {:test  {
>                          :_source {:enabled "false" }
>                          :_all    {:enabled "false" }
>                          :properties {:gram  {:type "string" :store
>     "yes" :analyzer :ngram-index :compress "true"}
>                                           :freq    {:type "long"
>     :store "yes"} }}}]
>
>     On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
>
>         Hey,
>
>         thanks all, let me reply:
>
>         Michael - no, i set replicas to 0 (if that what you meant..)
>
>         Itamar & Matt - i disabled _all and _source, and explicitly
>         set "store" to "yes" for both fields (i dont care about perf
>         for now..) - with this setting i still got a much larger size
>         and was still unable to see the fields (although i set store
>         to yes) through queries (only got id's back)
>
>         On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
>
>             Don't forget about the _all field.  Also, if you don't
>             store the source, you need to explicitly set "store" to
>             yes on your field mappings so you can have them returned
>             in the results.
>
>
>             On Tue, May 21, 2013 at 8:59 AM, Shlomi
>             <[hidden email]> wrote:
>
>                 yes, so i was trying to exclude source, but then
>                 queries didnt return anything besides id. but in any
>                 case, even disabling source still gave me a large index..
>
>                 any way to tell it to save just the fields?
>
>
>                 On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
>                 Syn-Hershko wrote:
>
>                     Yes, because ES stores the entire source by default
>
>
>                     On Tue, May 21, 2013 at 6:53 PM, Shlomi
>                     <[hidden email]> wrote:
>
>                         Hey,
>
>                         We have some old java code that uses lucene
>                         and grizzly to serve queries over text. we
>                         have two field, a string field and a numeric
>                         (long) field. the indexing code is pretty
>                         straight forward.
>
>                         I was trying to migrate this to elastic,
>                         pretty simple configuration, and indexed the
>                         same data.
>
>                         the java based implementation took about 6gb,
>                         while to elastic took 17gb..
>
>                         does this makes sense? what could i do about
>                         this?
>
>                         Thanks!
>
>
>                         --
>                         You received this message because you are
>                         subscribed to the Google Groups
>                         "elasticsearch" group.
>                         To unsubscribe from this group and stop
>                         receiving emails from it, send an email to
>                         elasticsearc...@googlegroups.com.
>
>                         For more options, visit
>                         https://groups.google.com/groups/opt_out
>                         <https://groups.google.com/groups/opt_out>.
>
>
>
>                 --
>                 You received this message because you are subscribed
>                 to the Google Groups "elasticsearch" group.
>                 To unsubscribe from this group and stop receiving
>                 emails from it, send an email to
>                 elasticsearc...@googlegroups.com.
>                 For more options, visit
>                 https://groups.google.com/groups/opt_out
>                 <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="tL24LEe-Y18J">elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Matt Weber-2
Really we are just shooting in the dark here because of lack of information:

What version of ES?  What version of lucene?  What does your lucene index settings (tokenizer, analyzers, etc) look like?  Have you configured an ES mapping identical to what you use in lucene?  How are you measuring your index size?  Have your tried indexing a single document in lucene and ES and comparing the resulting index size?

Gist us your mapping (not the clojure version) , custom analyzer settings, index settings, etc and we might be able to figure this out for you.

Thanks,
Matt Weber
 


On Wed, May 22, 2013 at 10:44 AM, Shlomi <[hidden email]> wrote:
Hey, 

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase" tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking.. :)

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:

> does ES store its numeric fields as strings?
>
> can someone confirm that if you disable _source and keep each field as
> stored and indexed, your fields becomes invisible (although
> queriable)? or am i doing something totally wrong?..
>
> Thanks
>
> On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
>
>     here is a fraction of the mapping i have (i use clojure so its a
>     bit different from json, but its essentially the same):
>
>                {:test  {
>                          :_source {:enabled "false" }
>                          :_all    {:enabled "false" }
>                          :properties {:gram  {:type "string" :store
>     "yes" :analyzer :ngram-index :compress "true"}
>                                           :freq    {:type "long"
>     :store "yes"} }}}]
>
>     On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
>
>         Hey,
>
>         thanks all, let me reply:
>
>         Michael - no, i set replicas to 0 (if that what you meant..)
>
>         Itamar & Matt - i disabled _all and _source, and explicitly
>         set "store" to "yes" for both fields (i dont care about perf
>         for now..) - with this setting i still got a much larger size
>         and was still unable to see the fields (although i set store
>         to yes) through queries (only got id's back)
>
>         On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
>
>             Don't forget about the _all field.  Also, if you don't
>             store the source, you need to explicitly set "store" to
>             yes on your field mappings so you can have them returned
>             in the results.
>
>
>             On Tue, May 21, 2013 at 8:59 AM, Shlomi
>             <[hidden email]> wrote:
>
>                 yes, so i was trying to exclude source, but then
>                 queries didnt return anything besides id. but in any
>                 case, even disabling source still gave me a large index..
>
>                 any way to tell it to save just the fields?
>
>
>                 On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
>                 Syn-Hershko wrote:
>
>                     Yes, because ES stores the entire source by default
>
>
>                     On Tue, May 21, 2013 at 6:53 PM, Shlomi
>                     <[hidden email]> wrote:
>
>                         Hey,
>
>                         We have some old java code that uses lucene
>                         and grizzly to serve queries over text. we
>                         have two field, a string field and a numeric
>                         (long) field. the indexing code is pretty
>                         straight forward.
>
>                         I was trying to migrate this to elastic,
>                         pretty simple configuration, and indexed the
>                         same data.
>
>                         the java based implementation took about 6gb,
>                         while to elastic took 17gb..
>
>                         does this makes sense? what could i do about
>                         this?
>
>                         Thanks!
>
>
>                         --
>                         You received this message because you are
>                         subscribed to the Google Groups
>                         "elasticsearch" group.
>                         To unsubscribe from this group and stop
>                         receiving emails from it, send an email to
>                         elasticsearc...@googlegroups.com.
>
>                         For more options, visit
>                         https://groups.google.com/groups/opt_out
>                         <https://groups.google.com/groups/opt_out>.
>
>
>
>                 --
>                 You received this message because you are subscribed
>                 to the Google Groups "elasticsearch" group.
>                 To unsubscribe from this group and stop receiving
>                 emails from it, send an email to
>                 elasticsearc...@googlegroups.com.
>                 For more options, visit
>                 https://groups.google.com/groups/opt_out
>                 <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

simonw-2
I suggest you provide your lucene FieldTypes and your mapping, run your indexing against lucene and a single shard no-replica Elasticsearch instance. Then optimize the index and provide the output of ls -al on the index directory. it would also be interesting what exactly is "much larger". 

simon

On Wednesday, May 22, 2013 8:27:05 PM UTC+2, Matt Weber wrote:
Really we are just shooting in the dark here because of lack of information:

What version of ES?  What version of lucene?  What does your lucene index settings (tokenizer, analyzers, etc) look like?  Have you configured an ES mapping identical to what you use in lucene?  How are you measuring your index size?  Have your tried indexing a single document in lucene and ES and comparing the resulting index size?

Gist us your mapping (not the clojure version) , custom analyzer settings, index settings, etc and we might be able to figure this out for you.

Thanks,
Matt Weber
 


On Wed, May 22, 2013 at 10:44 AM, Shlomi <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="pRfh5BrPCIQJ">shlomi...@...> wrote:
Hey, 

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase" tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking.. :)

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:

> does ES store its numeric fields as strings?
>
> can someone confirm that if you disable _source and keep each field as
> stored and indexed, your fields becomes invisible (although
> queriable)? or am i doing something totally wrong?..
>
> Thanks
>
> On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
>
>     here is a fraction of the mapping i have (i use clojure so its a
>     bit different from json, but its essentially the same):
>
>                {:test  {
>                          :_source {:enabled "false" }
>                          :_all    {:enabled "false" }
>                          :properties {:gram  {:type "string" :store
>     "yes" :analyzer :ngram-index :compress "true"}
>                                           :freq    {:type "long"
>     :store "yes"} }}}]
>
>     On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
>
>         Hey,
>
>         thanks all, let me reply:
>
>         Michael - no, i set replicas to 0 (if that what you meant..)
>
>         Itamar & Matt - i disabled _all and _source, and explicitly
>         set "store" to "yes" for both fields (i dont care about perf
>         for now..) - with this setting i still got a much larger size
>         and was still unable to see the fields (although i set store
>         to yes) through queries (only got id's back)
>
>         On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
>
>             Don't forget about the _all field.  Also, if you don't
>             store the source, you need to explicitly set "store" to
>             yes on your field mappings so you can have them returned
>             in the results.
>
>
>             On Tue, May 21, 2013 at 8:59 AM, Shlomi
>             <[hidden email]> wrote:
>
>                 yes, so i was trying to exclude source, but then
>                 queries didnt return anything besides id. but in any
>                 case, even disabling source still gave me a large index..
>
>                 any way to tell it to save just the fields?
>
>
>                 On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
>                 Syn-Hershko wrote:
>
>                     Yes, because ES stores the entire source by default
>
>
>                     On Tue, May 21, 2013 at 6:53 PM, Shlomi
>                     <[hidden email]> wrote:
>
>                         Hey,
>
>                         We have some old java code that uses lucene
>                         and grizzly to serve queries over text. we
>                         have two field, a string field and a numeric
>                         (long) field. the indexing code is pretty
>                         straight forward.
>
>                         I was trying to migrate this to elastic,
>                         pretty simple configuration, and indexed the
>                         same data.
>
>                         the java based implementation took about 6gb,
>                         while to elastic took 17gb..
>
>                         does this makes sense? what could i do about
>                         this?
>
>                         Thanks!
>
>
>                         --
>                         You received this message because you are
>                         subscribed to the Google Groups
>                         "elasticsearch" group.
>                         To unsubscribe from this group and stop
>                         receiving emails from it, send an email to
>                         elasticsearc...@googlegroups.com.
>
>                         For more options, visit
>                         https://groups.google.com/groups/opt_out
>                         <https://groups.google.com/groups/opt_out>.
>
>
>
>                 --
>                 You received this message because you are subscribed
>                 to the Google Groups "elasticsearch" group.
>                 To unsubscribe from this group and stop receiving
>                 emails from it, send an email to
>                 elasticsearc...@googlegroups.com.
>                 For more options, visit
>                 https://groups.google.com/groups/opt_out
>                 <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="pRfh5BrPCIQJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Ivan Brusic
Just wanted to add that I always encountered the same issue with ElasticSearch. Indices are almost twice as big despite aggressive trimming. I have simply come to accept the issue as a fact and moved on. :)

-- 
Ivan


On Wed, May 22, 2013 at 12:35 PM, simonw <[hidden email]> wrote:
I suggest you provide your lucene FieldTypes and your mapping, run your indexing against lucene and a single shard no-replica Elasticsearch instance. Then optimize the index and provide the output of ls -al on the index directory. it would also be interesting what exactly is "much larger". 

simon


On Wednesday, May 22, 2013 8:27:05 PM UTC+2, Matt Weber wrote:
Really we are just shooting in the dark here because of lack of information:

What version of ES?  What version of lucene?  What does your lucene index settings (tokenizer, analyzers, etc) look like?  Have you configured an ES mapping identical to what you use in lucene?  How are you measuring your index size?  Have your tried indexing a single document in lucene and ES and comparing the resulting index size?

Gist us your mapping (not the clojure version) , custom analyzer settings, index settings, etc and we might be able to figure this out for you.

Thanks,
Matt Weber
 


On Wed, May 22, 2013 at 10:44 AM, Shlomi <[hidden email]> wrote:
Hey, 

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase" tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking.. :)

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:

> does ES store its numeric fields as strings?
>
> can someone confirm that if you disable _source and keep each field as
> stored and indexed, your fields becomes invisible (although
> queriable)? or am i doing something totally wrong?..
>
> Thanks
>
> On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
>
>     here is a fraction of the mapping i have (i use clojure so its a
>     bit different from json, but its essentially the same):
>
>                {:test  {
>                          :_source {:enabled "false" }
>                          :_all    {:enabled "false" }
>                          :properties {:gram  {:type "string" :store
>     "yes" :analyzer :ngram-index :compress "true"}
>                                           :freq    {:type "long"
>     :store "yes"} }}}]
>
>     On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
>
>         Hey,
>
>         thanks all, let me reply:
>
>         Michael - no, i set replicas to 0 (if that what you meant..)
>
>         Itamar & Matt - i disabled _all and _source, and explicitly
>         set "store" to "yes" for both fields (i dont care about perf
>         for now..) - with this setting i still got a much larger size
>         and was still unable to see the fields (although i set store
>         to yes) through queries (only got id's back)
>
>         On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
>
>             Don't forget about the _all field.  Also, if you don't
>             store the source, you need to explicitly set "store" to
>             yes on your field mappings so you can have them returned
>             in the results.
>
>
>             On Tue, May 21, 2013 at 8:59 AM, Shlomi
>             <[hidden email]> wrote:
>
>                 yes, so i was trying to exclude source, but then
>                 queries didnt return anything besides id. but in any
>                 case, even disabling source still gave me a large index..
>
>                 any way to tell it to save just the fields?
>
>
>                 On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
>                 Syn-Hershko wrote:
>
>                     Yes, because ES stores the entire source by default
>
>
>                     On Tue, May 21, 2013 at 6:53 PM, Shlomi
>                     <[hidden email]> wrote:
>
>                         Hey,
>
>                         We have some old java code that uses lucene
>                         and grizzly to serve queries over text. we
>                         have two field, a string field and a numeric
>                         (long) field. the indexing code is pretty
>                         straight forward.
>
>                         I was trying to migrate this to elastic,
>                         pretty simple configuration, and indexed the
>                         same data.
>
>                         the java based implementation took about 6gb,
>                         while to elastic took 17gb..
>
>                         does this makes sense? what could i do about
>                         this?
>
>                         Thanks!
>
>
>                         --
>                         You received this message because you are
>                         subscribed to the Google Groups
>                         "elasticsearch" group.
>                         To unsubscribe from this group and stop
>                         receiving emails from it, send an email to
>                         elasticsearc...@googlegroups.com.
>
>                         For more options, visit
>                         https://groups.google.com/groups/opt_out
>                         <https://groups.google.com/groups/opt_out>.
>
>
>
>                 --
>                 You received this message because you are subscribed
>                 to the Google Groups "elasticsearch" group.
>                 To unsubscribe from this group and stop receiving
>                 emails from it, send an email to
>                 elasticsearc...@googlegroups.com.
>                 For more options, visit
>                 https://groups.google.com/groups/opt_out
>                 <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: elasticsearch index MUCH larger then similar lucene index

Jérôme Gagnon
+1 on that, we couldn't do much about it, we just hope that this doesn't affect the disk IO performance...

On Thursday, May 23, 2013 10:34:38 AM UTC-4, Ivan Brusic wrote:
Just wanted to add that I always encountered the same issue with ElasticSearch. Indices are almost twice as big despite aggressive trimming. I have simply come to accept the issue as a fact and moved on. :)

-- 
Ivan


On Wed, May 22, 2013 at 12:35 PM, simonw <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="9QwgLeprvNwJ">simon.w...@elasticsearch.com> wrote:
I suggest you provide your lucene FieldTypes and your mapping, run your indexing against lucene and a single shard no-replica Elasticsearch instance. Then optimize the index and provide the output of ls -al on the index directory. it would also be interesting what exactly is "much larger". 

simon


On Wednesday, May 22, 2013 8:27:05 PM UTC+2, Matt Weber wrote:
Really we are just shooting in the dark here because of lack of information:

What version of ES?  What version of lucene?  What does your lucene index settings (tokenizer, analyzers, etc) look like?  Have you configured an ES mapping identical to what you use in lucene?  How are you measuring your index size?  Have your tried indexing a single document in lucene and ES and comparing the resulting index size?

Gist us your mapping (not the clojure version) , custom analyzer settings, index settings, etc and we might be able to figure this out for you.

Thanks,
Matt Weber
 


On Wed, May 22, 2013 at 10:44 AM, Shlomi <[hidden email]> wrote:
Hey, 

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase" tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking.. :)

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:

> does ES store its numeric fields as strings?
>
> can someone confirm that if you disable _source and keep each field as
> stored and indexed, your fields becomes invisible (although
> queriable)? or am i doing something totally wrong?..
>
> Thanks
>
> On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
>
>     here is a fraction of the mapping i have (i use clojure so its a
>     bit different from json, but its essentially the same):
>
>                {:test  {
>                          :_source {:enabled "false" }
>                          :_all    {:enabled "false" }
>                          :properties {:gram  {:type "string" :store
>     "yes" :analyzer :ngram-index :compress "true"}
>                                           :freq    {:type "long"
>     :store "yes"} }}}]
>
>     On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
>
>         Hey,
>
>         thanks all, let me reply:
>
>         Michael - no, i set replicas to 0 (if that what you meant..)
>
>         Itamar & Matt - i disabled _all and _source, and explicitly
>         set "store" to "yes" for both fields (i dont care about perf
>         for now..) - with this setting i still got a much larger size
>         and was still unable to see the fields (although i set store
>         to yes) through queries (only got id's back)
>
>         On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:
>
>             Don't forget about the _all field.  Also, if you don't
>             store the source, you need to explicitly set "store" to
>             yes on your field mappings so you can have them returned
>             in the results.
>
>
>             On Tue, May 21, 2013 at 8:59 AM, Shlomi
>             <[hidden email]> wrote:
>
>                 yes, so i was trying to exclude source, but then
>                 queries didnt return anything besides id. but in any
>                 case, even disabling source still gave me a large index..
>
>                 any way to tell it to save just the fields?
>
>
>                 On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
>                 Syn-Hershko wrote:
>
>                     Yes, because ES stores the entire source by default
>
>
>                     On Tue, May 21, 2013 at 6:53 PM, Shlomi
>                     <[hidden email]> wrote:
>
>                         Hey,
>
>                         We have some old java code that uses lucene
>                         and grizzly to serve queries over text. we
>                         have two field, a string field and a numeric
>                         (long) field. the indexing code is pretty
>                         straight forward.
>
>                         I was trying to migrate this to elastic,
>                         pretty simple configuration, and indexed the
>                         same data.
>
>                         the java based implementation took about 6gb,
>                         while to elastic took 17gb..
>
>                         does this makes sense? what could i do about
>                         this?
>
>                         Thanks!
>
>
>                         --
>                         You received this message because you are
>                         subscribed to the Google Groups
>                         "elasticsearch" group.
>                         To unsubscribe from this group and stop
>                         receiving emails from it, send an email to
>                         elasticsearc...@googlegroups.com.
>
>                         For more options, visit
>                         https://groups.google.com/groups/opt_out
>                         <https://groups.google.com/groups/opt_out>.
>
>
>
>                 --
>                 You received this message because you are subscribed
>                 to the Google Groups "elasticsearch" group.
>                 To unsubscribe from this group and stop receiving
>                 emails from it, send an email to
>                 elasticsearc...@googlegroups.com.
>                 For more options, visit
>                 https://groups.google.com/groups/opt_out
>                 <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="9QwgLeprvNwJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
123