Different Index sizes for same data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Different Index sizes for same data

Lavesh Gupta
Hi Everyone,

I am using the following configuration

2 Nodes, Number of Shards: 4, Number of Replicas: 0


I am currently indexing 50,000 (50K) files using pyelasticsearch of size amounting to 6 GB.

For indexing I am increasing the number of threads from 1 to 8 and each time I am getting an index having different size.


Num Threads           Time taken for Indexing          Size of index on Node 1                 Size of Index on node 2
        1                             4069.559 s                              3.50 GB                                          3.22 GB
        2                             2236.544 s                              4.61 GB                                          4.54 GB
        4                             1990.098 s                              5.45 GB                                          5.31 GB
        8                             1965.987 s                              2.94 GB                                          2.96 GB


The mapping I am using is
dtype: {
        "_source": {"enabled": False},
        "_all": {"enabled": False},
        "properties":   {
            "filecontent": {"type": "string", "store": False},
            "filename": {"type": "string", "index": "not_analyzed", "store": True},
            "filepath": {"type": "string", "index": "not_analyzed", "store": True},
            "filetype": {"type": "string", "index": "not_analyzed", "store": True},
            "tokens": {"type": "string", "store": True},
            "rules": {"type": "string", "store": True}
        }
    }

where in FIELD "filecontent" I am passing extracted text of the file which I got from using Tika
for Field "tokens" I am storing some values I get from the text by running my regex and based on my values I populate Field "rules"



My question is why there is a discrepancy in size of index formed when I just changing number of threads to send indexing requests.

Please note: After Indexing has been completed, I am letting ES to cool down so that merging of segments can be achieved.


Please let me know why the discrepancy in Index size


Thanks,
Lavesh


This message contains confidential information and is intended only for the individual to whom it is addressed. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and permanently delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, late or incomplete, or could contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required, please request a hard-copy version from the sender. Druva, www.druva.com

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/122a8818-5217-41f2-ab65-316191f1aa7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Different Index sizes for same data

Nikolas Everett

Its hard to tell exactly what you did so there could be lots of reasons for the differences. One of them is that indexing isn't deterministic because merges trigger based on the on disk shape and on the number of other merges running. And they typically run async from your index commands.

If you run optimize after the inserts then you should get very close numbers. In prod optimize is only good if you are done indexing or updating. It causes trouble later on otherwise.

Nik

On May 19, 2015 7:04 AM, "Lavesh Gupta" <[hidden email]> wrote:
Hi Everyone,

I am using the following configuration

2 Nodes, Number of Shards: 4, Number of Replicas: 0


I am currently indexing 50,000 (50K) files using pyelasticsearch of size amounting to 6 GB.

For indexing I am increasing the number of threads from 1 to 8 and each time I am getting an index having different size.


Num Threads           Time taken for Indexing          Size of index on Node 1                 Size of Index on node 2
        1                             4069.559 s                              3.50 GB                                          3.22 GB
        2                             2236.544 s                              4.61 GB                                          4.54 GB
        4                             1990.098 s                              5.45 GB                                          5.31 GB
        8                             1965.987 s                              2.94 GB                                          2.96 GB


The mapping I am using is
dtype: {
        "_source": {"enabled": False},
        "_all": {"enabled": False},
        "properties":   {
            "filecontent": {"type": "string", "store": False},
            "filename": {"type": "string", "index": "not_analyzed", "store": True},
            "filepath": {"type": "string", "index": "not_analyzed", "store": True},
            "filetype": {"type": "string", "index": "not_analyzed", "store": True},
            "tokens": {"type": "string", "store": True},
            "rules": {"type": "string", "store": True}
        }
    }

where in FIELD "filecontent" I am passing extracted text of the file which I got from using Tika
for Field "tokens" I am storing some values I get from the text by running my regex and based on my values I populate Field "rules"



My question is why there is a discrepancy in size of index formed when I just changing number of threads to send indexing requests.

Please note: After Indexing has been completed, I am letting ES to cool down so that merging of segments can be achieved.


Please let me know why the discrepancy in Index size


Thanks,
Lavesh


This message contains confidential information and is intended only for the individual to whom it is addressed. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and permanently delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, late or incomplete, or could contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required, please request a hard-copy version from the sender. Druva, www.druva.com

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/122a8818-5217-41f2-ab65-316191f1aa7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3Mhu%3DXV_ByCzhy6whhiAYUh8BsypZEMBkWZjJhAoqEPg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.