ingest performance degrades sharply along with the documents having more fileds

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

kimchy
Administrator
Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant perf improvements now for this case (including improvements Lucene wise, that are for now in ES, but will be in Lucene next version). Those include:

6648: https://github.com/elasticsearch/elasticsearch/pull/6648
6714: https://github.com/elasticsearch/elasticsearch/pull/6714
6707: https://github.com/elasticsearch/elasticsearch/pull/6707

It would be interesting if you can run the tests again with 1.x branch. Note, also, please use default features in ES for now, no disable flushing and such.

On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/94f69102-a3ff-4aea-9513-0a07300a8a92%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Mahesh Venkat
Thanks Shay for updating us with perf improvements.
Apart from using the default parameters, should we follow the guideline listed in 

http://elasticsearch-users.115913.n3.nabble.com/Is-ES-es-index-store-type-memory-equivalent-to-Lucene-s-RAMDirectory-td4057417.html

Lucene supports MMapDirectory at the data indexing phase (in a batch) and switch to in-memory for queries to optimize on search latency.

Should we use JVM system parameter -Des.index.store.type=memory .  Isn't this equivalent to using RAMDirectory in Lucene for in-memory search query  ?

Thanks
--Mahesh

On Saturday, July 5, 2014 8:46:59 AM UTC-7, kimchy wrote:
Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant perf improvements now for this case (including improvements Lucene wise, that are for now in ES, but will be in Lucene next version). Those include:

6648: <a href="https://github.com/elasticsearch/elasticsearch/pull/6648" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6648\46sa\75D\46sntz\0751\46usg\75AFQjCNGOPlRnVsi2yKjLbxirvI8YY69SrQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6648\46sa\75D\46sntz\0751\46usg\75AFQjCNGOPlRnVsi2yKjLbxirvI8YY69SrQ';return true;">https://github.com/elasticsearch/elasticsearch/pull/6648
6714: <a href="https://github.com/elasticsearch/elasticsearch/pull/6714" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6714\46sa\75D\46sntz\0751\46usg\75AFQjCNHRTo7Mxx0TBq5CaQ-Z5gfcrMq5lA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6714\46sa\75D\46sntz\0751\46usg\75AFQjCNHRTo7Mxx0TBq5CaQ-Z5gfcrMq5lA';return true;">https://github.com/elasticsearch/elasticsearch/pull/6714
6707: <a href="https://github.com/elasticsearch/elasticsearch/pull/6707" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6707\46sa\75D\46sntz\0751\46usg\75AFQjCNGrv6-_EOJLT23lMTc0RPrKH798VA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6707\46sa\75D\46sntz\0751\46usg\75AFQjCNGrv6-_EOJLT23lMTc0RPrKH798VA';return true;">https://github.com/elasticsearch/elasticsearch/pull/6707

It would be interesting if you can run the tests again with 1.x branch. Note, also, please use default features in ES for now, no disable flushing and such.

On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9456c6ab-1f0b-4021-b011-d8573032915a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma
In reply to this post by kimchy
Hi Kimchy,

I rerun the benchmark using ES1.3 with default settings (just disable the _source & _all ) and it makes a great progress on the performance. However Solr still outperforms ES 1.3:
Number of different meta data field 
ES 
ES with disable _all/codec bloom filter 

ES 1.3 
Solr 

Scenario 0: 1000
12secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36Mb
iowait: 0.02%
13 secs ->769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1

13 secs->769 docs/sec
CPU: 44.22%
iowait: 0.01%
Heap: 1.38G
Index Size: 69M
Ingestion speed change: 2 1 1 1 1 1 2 0 2 2
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2

Scenario 1: 10k
29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36Mb
31 secs -> 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

20 secs->500 docs/sec
CPU: 54.74%
iowait: 0.02%
Heap: 3.06G
Index Size: 133M
Ingestion speed change: 2 2 1 2 2 3 2 2 2 1
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2

Scenario 2: 100k
17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75Mb
14 mins 24 secs -> 11.6 docs/sec
CPU: 52.30%
iowait: 0.02%
Heap:
Index Size: 1.5M
Ingestion speed change: 93 153 151 112 84 65 61 53 51 41

1 mins 24 secs-> 119 docs/sec
CPU: 47.67%
iowait: 0.12%
Heap: 8.66G
Index Size: 163M
Ingestion speed change: 9 14 12 12 8 8 5 7 5 4
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2

Scenario 3: 1M
183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594


11 mins 9 secs->15docs/sec
CPU: 41.45%
iowait: 0.07%
Heap: 36.12G
Index Size: 163M
Ingestion speed change: 12 24 38 55 70 86 106 117 83 78
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2



Best Regards
Maco

On Saturday, July 5, 2014 11:46:59 PM UTC+8, kimchy wrote:
Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant perf improvements now for this case (including improvements Lucene wise, that are for now in ES, but will be in Lucene next version). Those include:

6648: <a href="https://github.com/elasticsearch/elasticsearch/pull/6648" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6648\46sa\75D\46sntz\0751\46usg\75AFQjCNGOPlRnVsi2yKjLbxirvI8YY69SrQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6648\46sa\75D\46sntz\0751\46usg\75AFQjCNGOPlRnVsi2yKjLbxirvI8YY69SrQ';return true;">https://github.com/elasticsearch/elasticsearch/pull/6648
6714: <a href="https://github.com/elasticsearch/elasticsearch/pull/6714" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6714\46sa\75D\46sntz\0751\46usg\75AFQjCNHRTo7Mxx0TBq5CaQ-Z5gfcrMq5lA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6714\46sa\75D\46sntz\0751\46usg\75AFQjCNHRTo7Mxx0TBq5CaQ-Z5gfcrMq5lA';return true;">https://github.com/elasticsearch/elasticsearch/pull/6714
6707: <a href="https://github.com/elasticsearch/elasticsearch/pull/6707" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6707\46sa\75D\46sntz\0751\46usg\75AFQjCNGrv6-_EOJLT23lMTc0RPrKH798VA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6707\46sa\75D\46sntz\0751\46usg\75AFQjCNGrv6-_EOJLT23lMTc0RPrKH798VA';return true;">https://github.com/elasticsearch/elasticsearch/pull/6707

It would be interesting if you can run the tests again with 1.x branch. Note, also, please use default features in ES for now, no disable flushing and such.

On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3a2572a6-c97d-47f5-a801-b1d933c22990%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

kimchy
Administrator
In reply to this post by Mahesh Venkat
Yes, this is the equivalent of using RAMDirectory. Please, don't use this, Mmap is optimized for random access and if the lucene index can fit in heap (to use ram dir), it can certainly fit in OS RAM, without the implications of loading it to heap.

On Monday, July 7, 2014 6:26:07 PM UTC+2, Mahesh Venkat wrote:
Thanks Shay for updating us with perf improvements.
Apart from using the default parameters, should we follow the guideline listed in 

<a href="http://elasticsearch-users.115913.n3.nabble.com/Is-ES-es-index-store-type-memory-equivalent-to-Lucene-s-RAMDirectory-td4057417.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FIs-ES-es-index-store-type-memory-equivalent-to-Lucene-s-RAMDirectory-td4057417.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHQEb_4dwBHUMhzIjGvwdCklQIObg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FIs-ES-es-index-store-type-memory-equivalent-to-Lucene-s-RAMDirectory-td4057417.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHQEb_4dwBHUMhzIjGvwdCklQIObg';return true;">http://elasticsearch-users.115913.n3.nabble.com/Is-ES-es-index-store-type-memory-equivalent-to-Lucene-s-RAMDirectory-td4057417.html

Lucene supports MMapDirectory at the data indexing phase (in a batch) and switch to in-memory for queries to optimize on search latency.

Should we use JVM system parameter -Des.index.store.type=memory .  Isn't this equivalent to using RAMDirectory in Lucene for in-memory search query  ?

Thanks
--Mahesh

On Saturday, July 5, 2014 8:46:59 AM UTC-7, kimchy wrote:
Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant perf improvements now for this case (including improvements Lucene wise, that are for now in ES, but will be in Lucene next version). Those include:

6648: <a href="https://github.com/elasticsearch/elasticsearch/pull/6648" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6648\46sa\75D\46sntz\0751\46usg\75AFQjCNGOPlRnVsi2yKjLbxirvI8YY69SrQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6648\46sa\75D\46sntz\0751\46usg\75AFQjCNGOPlRnVsi2yKjLbxirvI8YY69SrQ';return true;">https://github.com/elasticsearch/elasticsearch/pull/6648
6714: <a href="https://github.com/elasticsearch/elasticsearch/pull/6714" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6714\46sa\75D\46sntz\0751\46usg\75AFQjCNHRTo7Mxx0TBq5CaQ-Z5gfcrMq5lA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6714\46sa\75D\46sntz\0751\46usg\75AFQjCNHRTo7Mxx0TBq5CaQ-Z5gfcrMq5lA';return true;">https://github.com/elasticsearch/elasticsearch/pull/6714
6707: <a href="https://github.com/elasticsearch/elasticsearch/pull/6707" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6707\46sa\75D\46sntz\0751\46usg\75AFQjCNGrv6-_EOJLT23lMTc0RPrKH798VA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6707\46sa\75D\46sntz\0751\46usg\75AFQjCNGrv6-_EOJLT23lMTc0RPrKH798VA';return true;">https://github.com/elasticsearch/elasticsearch/pull/6707

It would be interesting if you can run the tests again with 1.x branch. Note, also, please use default features in ES for now, no disable flushing and such.

On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/450fdf38-bdfe-49c2-9938-627b9854892c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

kimchy
Administrator
In reply to this post by Maco Ma
Hi, thanks for running the tests!. My tests were capped at 10k fields and improve for it, any more than that, I, and anybody here on Elasticsearch (+Lucene: Mike/Robert) simply don't recommend and can't really be behind when it comes to supporting it.

In Elasticsearch, there is a conscious decision to have concrete mappings for fields introduced. This allows for nice upstream features, such as autocomplete on Kibana and Sense, as well as certain index/search level optimizations that can't be done without concrete mapping for each field introduced. This incurs a cost when it comes to many fields introduced.

The idea here, is that a system that tries to put 1M different fields into Lucene simply not going to scale. The cost overhead, and even testability of such a system, is simply not something that we can support.

Aside from the obvious overhead when it comes to just wrangling so many fields in Lucene (merge costs that keep being incremental, ...), there is also the plan of what to do with it. For example, if sorting is enabled, then there is a multiplied cost at loading it for sorting (compared to using nested documents, where the cost is constant, since its the same field).

I think that there might be other factors in play to the performance test numbers I see below aside from the 100k and 1M different fields scenario. We can try and chase them, but the bottom line is the same, we can't support a system that asks to have 1M different fields, as we don't believe it uses either ES or Lucene correctly at this point.

I suggest looking into nested documents (regardless of the system you decided to use) as a viable alternative to the many fields solution. This is the only way you will be able to scale such a system, especially across multiple nodes (nested document scales out well, many fields don't).

On Tuesday, July 8, 2014 11:41:11 AM UTC+2, Maco Ma wrote:
Hi Kimchy,

I rerun the benchmark using ES1.3 with default settings (just disable the _source & _all ) and it makes a great progress on the performance. However Solr still outperforms ES 1.3:
Number of different meta data field 
ES 
ES with disable _all/codec bloom filter 

ES 1.3 
Solr 

Scenario 0: 1000
12secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36Mb
iowait: 0.02%
13 secs ->769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1

13 secs->769 docs/sec
CPU: 44.22%
iowait: 0.01%
Heap: 1.38G
Index Size: 69M
Ingestion speed change: 2 1 1 1 1 1 2 0 2 2
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2

Scenario 1: 10k
29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36Mb
31 secs -> 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

20 secs->500 docs/sec
CPU: 54.74%
iowait: 0.02%
Heap: 3.06G
Index Size: 133M
Ingestion speed change: 2 2 1 2 2 3 2 2 2 1
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2

Scenario 2: 100k
17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75Mb
14 mins 24 secs -> 11.6 docs/sec
CPU: 52.30%
iowait: 0.02%
Heap:
Index Size: 1.5M
Ingestion speed change: 93 153 151 112 84 65 61 53 51 41

1 mins 24 secs-> 119 docs/sec
CPU: 47.67%
iowait: 0.12%
Heap: 8.66G
Index Size: 163M
Ingestion speed change: 9 14 12 12 8 8 5 7 5 4
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2

Scenario 3: 1M
183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594


11 mins 9 secs->15docs/sec
CPU: 41.45%
iowait: 0.07%
Heap: 36.12G
Index Size: 163M
Ingestion speed change: 12 24 38 55 70 86 106 117 83 78
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2



Best Regards
Maco

On Saturday, July 5, 2014 11:46:59 PM UTC+8, kimchy wrote:
Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant perf improvements now for this case (including improvements Lucene wise, that are for now in ES, but will be in Lucene next version). Those include:

6648: <a href="https://github.com/elasticsearch/elasticsearch/pull/6648" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6648\46sa\75D\46sntz\0751\46usg\75AFQjCNGOPlRnVsi2yKjLbxirvI8YY69SrQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6648\46sa\75D\46sntz\0751\46usg\75AFQjCNGOPlRnVsi2yKjLbxirvI8YY69SrQ';return true;">https://github.com/elasticsearch/elasticsearch/pull/6648
6714: <a href="https://github.com/elasticsearch/elasticsearch/pull/6714" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6714\46sa\75D\46sntz\0751\46usg\75AFQjCNHRTo7Mxx0TBq5CaQ-Z5gfcrMq5lA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6714\46sa\75D\46sntz\0751\46usg\75AFQjCNHRTo7Mxx0TBq5CaQ-Z5gfcrMq5lA';return true;">https://github.com/elasticsearch/elasticsearch/pull/6714
6707: <a href="https://github.com/elasticsearch/elasticsearch/pull/6707" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6707\46sa\75D\46sntz\0751\46usg\75AFQjCNGrv6-_EOJLT23lMTc0RPrKH798VA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F6707\46sa\75D\46sntz\0751\46usg\75AFQjCNGrv6-_EOJLT23lMTc0RPrKH798VA';return true;">https://github.com/elasticsearch/elasticsearch/pull/6707

It would be interesting if you can run the tests again with 1.x branch. Note, also, please use default features in ES for now, no disable flushing and such.

On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ffbc1e5f-84f6-47bc-8b0d-cb863fe0f271%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
12