Indices size

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Indices size

Jérôme Gagnon
Good Morning,

In my quest to switch from Sphinx to ElasticSearch again, I have found that the size on disk of the indices is about 4x time bigger for our ElasticSearch compared to our Sphinx files. The actual size I have is 82gb for 166M documents or about 2000doc/mb. In Sphinx we were able to store about 8000doc/mb. I'm a little worried about IO usage on my node disks about those large files. Plus I found this (kind of old) article; http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ saying that Lucene indices are smaller than Sphinx one ? Does anyone have and ideas why ElasticSearch indices would be that much bigger than Sphinx (and Lucene) one ?

I already have 
"_all" : {"enabled" : false} 
"_source":{"enabled":false}
and I'm storing only 4 fields; 3 long and 1 integer

The biggest files are .frq and .tis as the part one of my index shows;

 453M -rw-r--r-- 1 root root  453M 2012-10-19 11:28 _nc.fdt
  52M -rw-r--r-- 1 root root   52M 2012-10-19 11:28 _nc.fdx
 4.0K -rw-r--r-- 1 root root   204 2012-10-19 11:28 _nc.fnm
 1.8G -rw-r--r-- 1 root root  1.8G 2012-10-19 11:39 _nc.frq
  32M -rw-r--r-- 1 root root   32M 2012-10-19 11:39 _nc.nrm
 300M -rw-r--r-- 1 root root  300M 2012-10-19 11:39 _nc.prx
 820K -rw-r--r-- 1 root root  818K 2012-10-19 16:30 _nc_t4.del
 7.3M -rw-r--r-- 1 root root  7.3M 2012-10-19 11:39 _nc.tii
 598M -rw-r--r-- 1 root root  598M 2012-10-19 11:39 _nc.tis






--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Indices size

Igor Motov-3
Have you deleted a lot of documents from elasticsearch and reindexed them again? Could you run 

curl -XPOST 'http://localhost:9200/your-index/_optimize?only_expunge_deletes=true'

on your index and see if it will reduce the index size. 

Which version of elasticsearch are you using? 

On Monday, October 22, 2012 10:24:22 AM UTC-4, Jérôme Gagnon wrote:
Good Morning,

In my quest to switch from Sphinx to ElasticSearch again, I have found that the size on disk of the indices is about 4x time bigger for our ElasticSearch compared to our Sphinx files. The actual size I have is 82gb for 166M documents or about 2000doc/mb. In Sphinx we were able to store about 8000doc/mb. I'm a little worried about IO usage on my node disks about those large files. Plus I found this (kind of old) article; http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ saying that Lucene indices are smaller than Sphinx one ? Does anyone have and ideas why ElasticSearch indices would be that much bigger than Sphinx (and Lucene) one ?

I already have 
"_all" : {"enabled" : false} 
"_source":{"enabled":false}
and I'm storing only 4 fields; 3 long and 1 integer

The biggest files are .frq and .tis as the part one of my index shows;

 453M -rw-r--r-- 1 root root  453M 2012-10-19 11:28 _nc.fdt
  52M -rw-r--r-- 1 root root   52M 2012-10-19 11:28 _nc.fdx
 4.0K -rw-r--r-- 1 root root   204 2012-10-19 11:28 _nc.fnm
 1.8G -rw-r--r-- 1 root root  1.8G 2012-10-19 11:39 _nc.frq
  32M -rw-r--r-- 1 root root   32M 2012-10-19 11:39 _nc.nrm
 300M -rw-r--r-- 1 root root  300M 2012-10-19 11:39 _nc.prx
 820K -rw-r--r-- 1 root root  818K 2012-10-19 16:30 _nc_t4.del
 7.3M -rw-r--r-- 1 root root  7.3M 2012-10-19 11:39 _nc.tii
 598M -rw-r--r-- 1 root root  598M 2012-10-19 11:39 _nc.tis






--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Indices size

Stéphane Raux
Hi,

When you index a numeric field, Lucene actually stores several
versions of the data in order to optimize range and sort operations :
http://lucene.apache.org/core/old_versioned_docs/versions/2_9_0/api/all/org/apache/lucene/document/NumericField.html

You can disable this behaviour by using the 'precision_step' parameter
in your mapping.

Hope that helps,

Stéphane

2012/10/22 Igor Motov <[hidden email]>:

> Have you deleted a lot of documents from elasticsearch and reindexed them
> again? Could you run
>
> curl -XPOST
> 'http://localhost:9200/your-index/_optimize?only_expunge_deletes=true'
>
> on your index and see if it will reduce the index size.
>
> Which version of elasticsearch are you using?
>
>
> On Monday, October 22, 2012 10:24:22 AM UTC-4, Jérôme Gagnon wrote:
>>
>> Good Morning,
>>
>> In my quest to switch from Sphinx to ElasticSearch again, I have found
>> that the size on disk of the indices is about 4x time bigger for our
>> ElasticSearch compared to our Sphinx files. The actual size I have is 82gb
>> for 166M documents or about 2000doc/mb. In Sphinx we were able to store
>> about 8000doc/mb. I'm a little worried about IO usage on my node disks about
>> those large files. Plus I found this (kind of old) article;
>> http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
>> saying that Lucene indices are smaller than Sphinx one ? Does anyone have
>> and ideas why ElasticSearch indices would be that much bigger than Sphinx
>> (and Lucene) one ?
>>
>> I already have
>> "_all" : {"enabled" : false}
>> "_source":{"enabled":false}
>> and I'm storing only 4 fields; 3 long and 1 integer
>>
>> The biggest files are .frq and .tis as the part one of my index shows;
>>
>>  453M -rw-r--r-- 1 root root  453M 2012-10-19 11:28 _nc.fdt
>>   52M -rw-r--r-- 1 root root   52M 2012-10-19 11:28 _nc.fdx
>>  4.0K -rw-r--r-- 1 root root   204 2012-10-19 11:28 _nc.fnm
>>  1.8G -rw-r--r-- 1 root root  1.8G 2012-10-19 11:39 _nc.frq
>>   32M -rw-r--r-- 1 root root   32M 2012-10-19 11:39 _nc.nrm
>>  300M -rw-r--r-- 1 root root  300M 2012-10-19 11:39 _nc.prx
>>  820K -rw-r--r-- 1 root root  818K 2012-10-19 16:30 _nc_t4.del
>>  7.3M -rw-r--r-- 1 root root  7.3M 2012-10-19 11:39 _nc.tii
>>  598M -rw-r--r-- 1 root root  598M 2012-10-19 11:39 _nc.tis
>>
>>
>>
>>
>>
>>
> --
>
>

--


Reply | Threaded
Open this post in threaded view
|

Re: Indices size

Jérôme Gagnon
In reply to this post by Jérôme Gagnon
Tried optimize things, did near to nothing to the size... Upgraded to 0.20.RC1 removed frequencies on some fields and played with precision_step on low cardinality fields, but I think that there is only peanuts to win with precision_step

On Monday, October 22, 2012 10:24:22 AM UTC-4, Jérôme Gagnon wrote:
Good Morning,

In my quest to switch from Sphinx to ElasticSearch again, I have found that the size on disk of the indices is about 4x time bigger for our ElasticSearch compared to our Sphinx files. The actual size I have is 82gb for 166M documents or about 2000doc/mb. In Sphinx we were able to store about 8000doc/mb. I'm a little worried about IO usage on my node disks about those large files. Plus I found this (kind of old) article; http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ saying that Lucene indices are smaller than Sphinx one ? Does anyone have and ideas why ElasticSearch indices would be that much bigger than Sphinx (and Lucene) one ?

I already have 
"_all" : {"enabled" : false} 
"_source":{"enabled":false}
and I'm storing only 4 fields; 3 long and 1 integer

The biggest files are .frq and .tis as the part one of my index shows;

 453M -rw-r--r-- 1 root root  453M 2012-10-19 11:28 _nc.fdt
  52M -rw-r--r-- 1 root root   52M 2012-10-19 11:28 _nc.fdx
 4.0K -rw-r--r-- 1 root root   204 2012-10-19 11:28 _nc.fnm
 1.8G -rw-r--r-- 1 root root  1.8G 2012-10-19 11:39 _nc.frq
  32M -rw-r--r-- 1 root root   32M 2012-10-19 11:39 _nc.nrm
 300M -rw-r--r-- 1 root root  300M 2012-10-19 11:39 _nc.prx
 820K -rw-r--r-- 1 root root  818K 2012-10-19 16:30 _nc_t4.del
 7.3M -rw-r--r-- 1 root root  7.3M 2012-10-19 11:39 _nc.tii
 598M -rw-r--r-- 1 root root  598M 2012-10-19 11:39 _nc.tis






--