Index size improvements in 0.90?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Index size improvements in 0.90?

Ivan Brusic
I finally finished a grueling upgrade of my local code from Lucene 3.6 to 4.3. I don't use elasticsearch for everything and still have a fair amount of Lucene code. You name it, I have a custom class for it.

With the new Lucene jars in place, I was finally able to upgrade elasticsearch from 0.90.1 from 0.20.0 (Lucene class conflicts being the obstacle). So far my Lucene code has produced much smaller indices, which I'm still testing. My elasticsearch specific code has not changed (besides fixing API changes), and neither has my configuration. I do some pre-tokenization on the client side for various reasons, but elasticsearch does the bulk of the analysis. The resulting test index is one third of the original size:

size: 15.8gb (15.8gb)
docs: 8711039 (8711039)

size: 5.2gb (5.2gb)
docs: 8757039 (8757039)

I did disable timestamps (elasticsearch bug which I will fix), but everything else is the same. A two-thirds reduction scares me a bit. Has anyone seen such a dramatic reduction in index size?

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Index size improvements in 0.90?

InquiringMind
Hi, Ivan!

The vast majority of code changes that I needed to make had to do with the getters. For example, the (cleaner, IMHO) tokens method was gone and the getTokens method was called instead.

The truly vexing change had to do with facets, and once I figured it out the changes were rather simple. But until then, I almost wore out Google! For example:

First a change to the import:

-import org.elasticsearch.search.facet.AbstractFacetBuilder;

+import org.elasticsearch.search.facet.FacetBuilder;

And a change to the object class returned:

-  public AbstractFacetBuilder getFacetRequest();

+  public FacetBuilder getFacetRequest();

And when rippled throughout the abstract base class and my three derived classes (including one that implements a true multi-field hierarchy!), everything worked fine.

And yes, the indices were one-half to one-third the size when rebuilt. I remembered something about compression being the default in 0.90.0, and I never added compression options to my 0.20.4 indices.


Brian

On Friday, June 14, 2013 7:14:27 PM UTC-4, Ivan Brusic wrote:
I finally finished a grueling upgrade of my local code from Lucene 3.6 to 4.3. I don't use elasticsearch for everything and still have a fair amount of Lucene code. You name it, I have a custom class for it.

With the new Lucene jars in place, I was finally able to upgrade elasticsearch from 0.90.1 from 0.20.0 (Lucene class conflicts being the obstacle). So far my Lucene code has produced much smaller indices, which I'm still testing. My elasticsearch specific code has not changed (besides fixing API changes), and neither has my configuration. I do some pre-tokenization on the client side for various reasons, but elasticsearch does the bulk of the analysis. The resulting test index is one third of the original size:

size: 15.8gb (15.8gb)
docs: 8711039 (8711039)

size: 5.2gb (5.2gb)
docs: 8757039 (8757039)

I did disable timestamps (elasticsearch bug which I will fix), but everything else is the same. A two-thirds reduction scares me a bit. Has anyone seen such a dramatic reduction in index size?

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Index size improvements in 0.90?

cthoma
In reply to this post by Ivan Brusic
After your post i did a test with the playframework and play2-elasticsearch - Plugin + 1 remote elasticsearch node for each version.
You can switch in seconds between 0.20.5 and 0.90.1. 
I did a test with a class and only a long string. And yes. the newer lucene index is very small. 
It is 90M against 900M.  The results are the same when searching. 
My Es-Mapping is trivial:
{
  "indexTest": {
    "properties": {
      "name": {
        "type": "string",
        "store": "yes",
        "index": "analyzed",
        "null_value": "na"
      }
    }
  }
}

I hope i didn't miss anything, but i don't think so. 


Am Samstag, 15. Juni 2013 01:14:27 UTC+2 schrieb Ivan Brusic:
I finally finished a grueling upgrade of my local code from Lucene 3.6 to 4.3. I don't use elasticsearch for everything and still have a fair amount of Lucene code. You name it, I have a custom class for it.

With the new Lucene jars in place, I was finally able to upgrade elasticsearch from 0.90.1 from 0.20.0 (Lucene class conflicts being the obstacle). So far my Lucene code has produced much smaller indices, which I'm still testing. My elasticsearch specific code has not changed (besides fixing API changes), and neither has my configuration. I do some pre-tokenization on the client side for various reasons, but elasticsearch does the bulk of the analysis. The resulting test index is one third of the original size:

size: 15.8gb (15.8gb)
docs: 8711039 (8711039)

size: 5.2gb (5.2gb)
docs: 8757039 (8757039)

I did disable timestamps (elasticsearch bug which I will fix), but everything else is the same. A two-thirds reduction scares me a bit. Has anyone seen such a dramatic reduction in index size?

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Index size improvements in 0.90?

cthoma
In reply to this post by Ivan Brusic
I think the following link is interesting and explains a lot:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Index size improvements in 0.90?

Adrien Grand-2
Hi,

Indeed, Lucene 4.3 has much smaller indices, especially if you have small or easily-compressible documents (cthoma's link) and if you have highly frequent terms[1].

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Index size improvements in 0.90?

Ivan Brusic
I am on the Lucene mailing list (I rarely post though) and subscribe to Mike's and Andrien's blog feeds, but nowhere have a seen comments about how dramatic the reduction of index size can actually be.

I have been running without stored fields and compressed source for a while, so I assumed the new compression scheme used in Lucene 4 would not offer much savings. Then again, I really was not after reduced index size (although it helps with the IO cache). The exact reason for the elasticsearch upgrade is for better cache management since I have encountered a huge explosion of the field cache with the introduction of nested documents.

A rough test with these new indices showed little differences in QPS, but the number of threads and number of GCs went down dramatically. What I really want to monitor was the field cache usage, but that stat moved in 0.90 and I was not able to find it.

Great job by the Lucene and elasticsearch teams.

Cheers,

Ivan


On Sun, Jun 16, 2013 at 4:31 AM, Adrien Grand <[hidden email]> wrote:
Hi,

Indeed, Lucene 4.3 has much smaller indices, especially if you have small or easily-compressible documents (cthoma's link) and if you have highly frequent terms[1].

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.