ingest performance degrades sharply along with the documents having more fileds

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

ingest performance degrades sharply along with the documents having more fileds

Maco Ma
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Mark Walkom
It's not surprising that the time increases when you have an order of magnitude more fields.

Are you using the bulk API?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: [hidden email]
web: www.campaignmonitor.com


On 13 June 2014 15:57, Maco Ma <[hidden email]> wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bVPUUUAWJAaeLKwTrzSjprtdbFpp_SkBPHRkLxOdUaHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma
I used the curl command to do the ingestion(one command, one doc) and flush. I also tried the Solr(disabled the soft/hard commit & do the commit with client program) with the same data & commands and its performance did not degrade. Lucene are used for both of them and not sure why there is a big difference with the performances. 

On Friday, June 13, 2014 2:02:58 PM UTC+8, Mark Walkom wrote:
It's not surprising that the time increases when you have an order of magnitude more fields.

Are you using the bulk API?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: <a href="javascript:" target="_blank" gdf-obfuscated-mailto="oqp-WCGI1UQJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">ma...@...
web: <a href="http://www.campaignmonitor.com" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.campaignmonitor.com\46sa\75D\46sntz\0751\46usg\75AFQjCNFv30c-WBiP6sfBmxXaWBP5YBZg1Q';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.campaignmonitor.com\46sa\75D\46sntz\0751\46usg\75AFQjCNFv30c-WBiP6sfBmxXaWBP5YBZg1Q';return true;">www.campaignmonitor.com


On 13 June 2014 15:57, Maco Ma <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="oqp-WCGI1UQJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">mayao...@...> wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="oqp-WCGI1UQJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8694a4da-68f6-40b3-9d40-fbbc63041cad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Cindy Hsin
In reply to this post by Maco Ma
Hi, Mark:

We are doing single document ingestion. We did a performance comparison between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the metadata fields where Solr performance remains the same.
The performance is done in very small data set (ie. 10k documents, the index size is only 75mb). The machine is a high spec machine with 48GB memory.
You can see ES performance drop 50% even when the machine have plenty memory. ES consumes all the machine memory when metadata field increased to 100k.
This behavior seems abnormal since the data is really tiny.

We also tried with larger data set (ie. 100k and 1Mil documents), ES throw OOW for scenario 2 for 1 Mil doc scenario.
We want to know whether this is a bug in ES and/or is there any workaround (config step) we can use to eliminate the performance degradation.
Currently ES performance does not meet the customer requirement so we want to see if there is anyway we can bring ES performance to the same level as Solr.

Below is the configuration setting and benchmark results for 10k document set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.
  • disable hard-commit & soft commit + use a client to do commit (ES & Solr) every 10 second
    • ES: flush, refresh are disabled
    • Solr: autoSoftCommit are disabled
  • monitor load on the system (cpu, memory, etc) or the ingestion speed change over time
  • monitor the ingestion speed (is there any degradation over time?)
  • new ES config:new_ES_config.sh; new ingestion: new_ES_ingest_threads.pl
  • new Solr ingestion: new_Solr_ingest_threads.pl
  • flush interval: 10s

Number of different meta data field
ESSolr
Scenario 0: 100012secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
Scenario 1: 10k29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
Scenario 2: 100k17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2
Scenario 3: 1M183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2


Thanks!
Cindy

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Michael McCandless-3
Hi,

Could you post the scripts you linked to (new_ES_config.sh, new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't download them from where you linked.

Optimizing every 10 seconds or 10 minutes is really not a good idea in general, but I guess if you're doing the same with ES and Solr then the comparison is at least "fair".

It's odd you see such a slowdown with ES...

Mike

On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <[hidden email]> wrote:
Hi, Mark:

We are doing single document ingestion. We did a performance comparison between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the metadata fields where Solr performance remains the same.
The performance is done in very small data set (ie. 10k documents, the index size is only 75mb). The machine is a high spec machine with 48GB memory.
You can see ES performance drop 50% even when the machine have plenty memory. ES consumes all the machine memory when metadata field increased to 100k.
This behavior seems abnormal since the data is really tiny.

We also tried with larger data set (ie. 100k and 1Mil documents), ES throw OOW for scenario 2 for 1 Mil doc scenario.
We want to know whether this is a bug in ES and/or is there any workaround (config step) we can use to eliminate the performance degradation.
Currently ES performance does not meet the customer requirement so we want to see if there is anyway we can bring ES performance to the same level as Solr.

Below is the configuration setting and benchmark results for 10k document set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.
  • disable hard-commit & soft commit + use a client to do commit (ES & Solr) every 10 second
    • ES: flush, refresh are disabled
    • Solr: autoSoftCommit are disabled
  • monitor load on the system (cpu, memory, etc) or the ingestion speed change over time
  • monitor the ingestion speed (is there any degradation over time?)
  • new ES config:new_ES_config.sh; new ingestion: new_ES_ingest_threads.pl
  • new Solr ingestion: new_Solr_ingest_threads.pl
  • flush interval: 10s

Number of different meta data field
ESSolr
Scenario 0: 100012secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
Scenario 1: 10k29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
Scenario 2: 100k17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2
Scenario 3: 1M183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2


Thanks!
Cindy

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRfsxEPvTjfv%2BPWgpyWD5fLE1DTaPUfAe9%3DdLVzXRe4p4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Michael McCandless-3
In reply to this post by Cindy Hsin
I tested roughly your Scenario 2 (100K unique fields, 100 fields per document) with a straight Lucene test (attached, but not sure if the list strips attachments).  Net/net I see ~100 docs/sec with one thread ... which is very slow.

Lucene stores quite a lot for each unique indexed field name and it's really a bad idea to plan on having so many unique fields in the index: you'll spend lots of RAM and CPU.

Can you describe the wider use case here?  Maybe there's a more performant way to achieve it...



On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <[hidden email]> wrote:
Hi, Mark:

We are doing single document ingestion. We did a performance comparison between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the metadata fields where Solr performance remains the same.
The performance is done in very small data set (ie. 10k documents, the index size is only 75mb). The machine is a high spec machine with 48GB memory.
You can see ES performance drop 50% even when the machine have plenty memory. ES consumes all the machine memory when metadata field increased to 100k.
This behavior seems abnormal since the data is really tiny.

We also tried with larger data set (ie. 100k and 1Mil documents), ES throw OOW for scenario 2 for 1 Mil doc scenario.
We want to know whether this is a bug in ES and/or is there any workaround (config step) we can use to eliminate the performance degradation.
Currently ES performance does not meet the customer requirement so we want to see if there is anyway we can bring ES performance to the same level as Solr.

Below is the configuration setting and benchmark results for 10k document set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.
  • disable hard-commit & soft commit + use a client to do commit (ES & Solr) every 10 second
    • ES: flush, refresh are disabled
    • Solr: autoSoftCommit are disabled
  • monitor load on the system (cpu, memory, etc) or the ingestion speed change over time
  • monitor the ingestion speed (is there any degradation over time?)
  • new ES config:new_ES_config.sh; new ingestion: new_ES_ingest_threads.pl
  • new Solr ingestion: new_Solr_ingest_threads.pl
  • flush interval: 10s

Number of different meta data field
ESSolr
Scenario 0: 100012secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
Scenario 1: 10k29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
Scenario 2: 100k17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2
Scenario 3: 1M183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2


Thanks!
Cindy

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRcDKZWA8tjsqfcthGUKcEX7q2dohWy_1vcFyKo7JgB53w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

ManyLuceneFields.java (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Cindy Hsin
In reply to this post by Maco Ma
The way we make Solr ingest faster (single document ingest) is by turn off the engine soft commit and hard commit and use a client to commit the changes every 10 seconds.

Solr ingest speed remains at 800 docs per second where ES ingest speed drops in half when we increase the fields (ie. from 1000 to 10k).
I have asked Maco to send you the requested script so you can do more analysis.

If you can help to solve the first level ES performance degradation (ie. 1000 to 10k) as a starting point, that will be the best.

We do have real customer scenario that require large amount of metadata fields, that is why this is a blocking issue for the stack evaluation between Solr and Elastic Search.

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/79911a7f-4118-4421-bc2d-2284eccebd3f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma
In reply to this post by Michael McCandless-3
Hi Mike,

new_ES_config.sh(define the templates and disable the refresh/flush):
curl -XPOST localhost:9200/doc -d '{
      "mappings" : {
          "type" : {
                  "_source" : { "enabled" : false },
                  "dynamic_templates" : [
                    {"t1":{
                  "match" : "*_ss",
                  "mapping":{
                        "type": "string",
                        "store":false,
                        "norms" : {"enabled" : false}
                        }
                        }},
                    {"t2":{
                  "match" : "*_dt",
                  "mapping":{
                        "type": "date",
                        "store": false
                        }
                        }},
                    {"t3":{
                  "match" : "*_i",
                  "mapping":{
                        "type": "integer",
                        "store": false
                        }
                        }}
]
              }
        }
  }'

curl -XPUT localhost:9200/doc/_settings -d '{
      "index.refresh_interval" : "-1"
}'

curl -XPUT localhost:9200/doc/_settings -d '{
      "index.translog.disable_flush" : true
}'

new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest the doc and one thread to flush/optimize periodically):

my $num_args = $#ARGV + 1;
if ($num_args < 1 || $num_args > 2) {
  print "\n usuage:$0 [src_dir] [thread_count]\n";
  exit;
}

my $INST_HOME="/scratch/aime/elasticsearch-1.2.1";

my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
chomp($pid);
if( "$pid" eq "")
{
  print "Instance is not up\n";
  exit;
}


my $dir = $ARGV[0];
my $td_count = 10;
$td_count = $ARGV[1] if($num_args == 2);
open(FH, ">$lf");
print FH "source dir: $dir\nthread_count: $td_count\n";
print FH localtime()."\n";

use threads;
use threads::shared;

my $flush_intv = 10;

my $no:shared=0;
my $total = 10000;
my $intv = 1000;
my $tstr:shared = "";
my $ltime:shared = time;

sub commit {
  $SIG{'KILL'} = sub {`curl -XPOST 'http://localhost:9200/doc/_flush'`;print "forced commit done on ".localtime()."\n";threads->exit();};

  while ($no < $total )
  {
    `curl -XPOST 'http://localhost:9200/doc/_flush'`;
    `curl -XPOST 'http://localhost:9200/doc/_optimize'`;
    print "commit on ".localtime()."\n";
    sleep($flush_intv);
  }
  `curl -XPOST 'http://localhost:9200/doc/_flush'`;
  print "commit done on ".localtime()."\n";
}

sub do {
  my $c = -1;
  while(1)
  {
    {
      lock($no);
      $c=$no;
      $no++;
    }
    last if($c >= $total);
    `curl -XPOST -s localhost:9200/doc/type/$c --data-binary \@$dir/$c.json`;
    if( ($c +1) % $intv == 0 )
    {
      lock($ltime);
      $curtime = time;
      $tstr .= ($curtime - $ltime)." ";
      $ltime = $curtime;
    }
  }
}

# start the monitor processes
my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho \$!);
my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho \$!);

my $ct = threads->create(\&commit);
my $start = time;
my @ts=();
for $i (1..$td_count)
{
  my $t = threads->create(\&do);
  push(@ts, $t);
}

for my $t (@ts)
{
  $t->join();
}

$ct->kill('KILL');
my $fin = time;

qx(kill -9 $sarId\nkill -9 $jgcId);

print FH localtime()."\n";
$ct->join();
print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*');
close(FH);

new_Solr_ingest_threads.pl is similar to the file  new_ES_ingest_threads.pl and uses the different parameters for curl commands. Only post the differences here:

sub commit {
  while ($no < $total )
  {
    `curl  'http://localhost:8983/solr/collection2/update?commit=true'`;
    `curl  'http://localhost:8983/solr/collection2/update?optimize=true'`;
    print "commit on ".localtime()."\n";
    sleep(10);
  }
  `curl  'http://localhost:8983/solr/collection2/update?commit=true'`;
  print "commit done on ".localtime()."\n";
}


sub do {
  my $c = -1;
  while(1)
  {
    {
      lock($no);
      $c=$no;
      $no++;
    }
    last if($c >= $total);
    `curl  -s 'http://localhost:8983/solr/collection2/update/json' --data-binary \@$dir/$c.json -H 'Content-type:application/json'`;
    if( ($c +1) % $intv == 0 )
    {
      lock($ltime);
      $curtime = time;
      $tstr .= ($curtime - $ltime)." ";
      $ltime = $curtime;
    }
  }
}


B&R
Maco

On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote:
Hi,

Could you post the scripts you linked to (new_ES_config.sh, new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't download them from where you linked.

Optimizing every 10 seconds or 10 minutes is really not a good idea in general, but I guess if you're doing the same with ES and Solr then the comparison is at least "fair".

It's odd you see such a slowdown with ES...

Mike

On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="mV6LwYGsEFcJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">cindy...@...> wrote:
Hi, Mark:

We are doing single document ingestion. We did a performance comparison between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the metadata fields where Solr performance remains the same.
The performance is done in very small data set (ie. 10k documents, the index size is only 75mb). The machine is a high spec machine with 48GB memory.
You can see ES performance drop 50% even when the machine have plenty memory. ES consumes all the machine memory when metadata field increased to 100k.
This behavior seems abnormal since the data is really tiny.

We also tried with larger data set (ie. 100k and 1Mil documents), ES throw OOW for scenario 2 for 1 Mil doc scenario.
We want to know whether this is a bug in ES and/or is there any workaround (config step) we can use to eliminate the performance degradation.
Currently ES performance does not meet the customer requirement so we want to see if there is anyway we can bring ES performance to the same level as Solr.

Below is the configuration setting and benchmark results for 10k document set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.
  • disable hard-commit & soft commit + use a client to do commit (ES & Solr) every 10 second
    • ES: flush, refresh are disabled
    • Solr: autoSoftCommit are disabled
  • monitor load on the system (cpu, memory, etc) or the ingestion speed change over time
  • monitor the ingestion speed (is there any degradation over time?)
  • new ES config:new_ES_config.sh; new ingestion: new_ES_ingest_threads.pl
  • new Solr ingestion: new_Solr_ingest_threads.pl
  • flush interval: 10s

Number of different meta data field
ESSolr
Scenario 0: 100012secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
Scenario 1: 10k29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
Scenario 2: 100k17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2
Scenario 3: 1M183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2


Thanks!
Cindy

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d0dc7fa-64cd-4adf-8c8b-f1a2ebd644f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma
In reply to this post by Michael McCandless-3
I tried your script with setting iwc.setRAMBufferSizeMB(40000)/ and 48G heap size. The speed can be around 430 docs/sec before the first flush and the final speed is 350 docs/sec. Not sure what configuration Solr uses and its ingestion speed can be 800 docs/sec.

Maco

On Wednesday, June 18, 2014 6:09:07 AM UTC+8, Michael McCandless wrote:
I tested roughly your Scenario 2 (100K unique fields, 100 fields per document) with a straight Lucene test (attached, but not sure if the list strips attachments).  Net/net I see ~100 docs/sec with one thread ... which is very slow.

Lucene stores quite a lot for each unique indexed field name and it's really a bad idea to plan on having so many unique fields in the index: you'll spend lots of RAM and CPU.

Can you describe the wider use case here?  Maybe there's a more performant way to achieve it...



On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Gz8ODJMwCT8J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">cindy...@...> wrote:
Hi, Mark:

We are doing single document ingestion. We did a performance comparison between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the metadata fields where Solr performance remains the same.
The performance is done in very small data set (ie. 10k documents, the index size is only 75mb). The machine is a high spec machine with 48GB memory.
You can see ES performance drop 50% even when the machine have plenty memory. ES consumes all the machine memory when metadata field increased to 100k.
This behavior seems abnormal since the data is really tiny.

We also tried with larger data set (ie. 100k and 1Mil documents), ES throw OOW for scenario 2 for 1 Mil doc scenario.
We want to know whether this is a bug in ES and/or is there any workaround (config step) we can use to eliminate the performance degradation.
Currently ES performance does not meet the customer requirement so we want to see if there is anyway we can bring ES performance to the same level as Solr.

Below is the configuration setting and benchmark results for 10k document set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.
  • disable hard-commit & soft commit + use a client to do commit (ES & Solr) every 10 second
    • ES: flush, refresh are disabled
    • Solr: autoSoftCommit are disabled
  • monitor load on the system (cpu, memory, etc) or the ingestion speed change over time
  • monitor the ingestion speed (is there any degradation over time?)
  • new ES config:<a href="https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_ES_config.sh" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fstbeehive.oracle.com%2Fcontent%2Fdav%2Fst%2FCloud%2520Search%2FDocuments%2Fnew_ES_config.sh\46sa\75D\46sntz\0751\46usg\75AFQjCNGg9Gzonw-_os6akF1WD0bFUQG5mw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fstbeehive.oracle.com%2Fcontent%2Fdav%2Fst%2FCloud%2520Search%2FDocuments%2Fnew_ES_config.sh\46sa\75D\46sntz\0751\46usg\75AFQjCNGg9Gzonw-_os6akF1WD0bFUQG5mw';return true;">new_ES_config.sh; new ingestion: <a href="https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_ES_ingest_threads.pl" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fstbeehive.oracle.com%2Fcontent%2Fdav%2Fst%2FCloud%2520Search%2FDocuments%2Fnew_ES_ingest_threads.pl\46sa\75D\46sntz\0751\46usg\75AFQjCNEUtOkQ4pH7Op6uwfbwlQ4FgcJXzg';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fstbeehive.oracle.com%2Fcontent%2Fdav%2Fst%2FCloud%2520Search%2FDocuments%2Fnew_ES_ingest_threads.pl\46sa\75D\46sntz\0751\46usg\75AFQjCNEUtOkQ4pH7Op6uwfbwlQ4FgcJXzg';return true;">new_ES_ingest_threads.pl
  • new Solr ingestion: <a href="https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_Solr_ingest_threads.pl" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fstbeehive.oracle.com%2Fcontent%2Fdav%2Fst%2FCloud%2520Search%2FDocuments%2Fnew_Solr_ingest_threads.pl\46sa\75D\46sntz\0751\46usg\75AFQjCNEv7cG1BExsg51Inf1VzwhGedvNhw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fstbeehive.oracle.com%2Fcontent%2Fdav%2Fst%2FCloud%2520Search%2FDocuments%2Fnew_Solr_ingest_threads.pl\46sa\75D\46sntz\0751\46usg\75AFQjCNEv7cG1BExsg51Inf1VzwhGedvNhw';return true;">new_Solr_ingest_threads.pl
  • flush interval: 10s

Number of different meta data field
ESSolr
Scenario 0: 100012secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
Scenario 1: 10k29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
Scenario 2: 100k17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2
Scenario 3: 1M183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2


Thanks!
Cindy

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="Gz8ODJMwCT8J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bb58c57b-37b1-46b2-b8b6-f26761cdd55f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Michael McCandless-3
On Wed, Jun 18, 2014 at 2:38 AM, Maco Ma <[hidden email]> wrote:
I tried your script with setting iwc.setRAMBufferSizeMB(40000)/ and 48G heap size. The speed can be around 430 docs/sec before the first flush and the final speed is 350 docs/sec. Not sure what configuration Solr uses and its ingestion speed can be 800 docs/sec.

Well, probably the difference is threads?  That simple Lucene test uses only 1 thread, but your ES/Solr test uses 10 threads.

I think the cost in ES is how the MapperService maintains mappings for all fields; I don't think there's a quick fix to reduce this cost.

But net/net you really need to take a step back and re-evaluate your approach here: even if you use Solr, indexing at 800 docs/sec using 10 threads is awful indexing performance and this is because Lucene itself has a high cost per field, at indexing time and searching time.  E.g. have you tried opening a searcher once you've built a large index with so many unique fields?  The heap usage will be very high.  Tested search performance on that searcher?  Merging cost will be very high, etc.

Lucene is just not optimized for the "zillions of unique fields" case, because you can so easily move those N fields into a single field; e.g. if this is just for simple term filtering, make a single field and then as terms insert "fieldName:fieldValue" as your tokens.

If you insist on creating so many unique fields in your use case you will be unhappy down the road with Lucene ...

Mike McCandless


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRdMdQDP1e8MhxnJb%2BBWU02pmjTVfoV6r-BTNescv4%3DSvQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Cindy Hsin
In reply to this post by Maco Ma
Hi, Mike:

Since both ES and Solr uses Lucene, do you know why we only see big ingest performance degradation with ES but not Solr?

Are you suggesting that if our customer require large amount of Metadata field, even Solr won't be able to provide decent performance when ingest and search are happening concurrently?

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ffa8325c-ea37-4775-9f78-9fb27d9d4438%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Michael McCandless-3
On Fri, Jun 20, 2014 at 8:00 PM, Cindy Hsin <[hidden email]> wrote:
Hi, Mike:

Since both ES and Solr uses Lucene, do you know why we only see big ingest performance degradation with ES but not Solr?

I'm not sure why: clearly something is slow with ES as you add more and more fields.  I think it has to do with how it manages its mappings.
 
Are you suggesting that if our customer require large amount of Metadata field, even Solr won't be able to provide decent performance when ingest and search are happening concurrently?

Exactly.  Even if you/we fixed ES's slowness as you add tons of fields, or if you went with Solr, you're still going to see poor indexing/merging/searching performance because Lucene itself doesn't scale very well to so many fields: this use case (tons of fields) has never been a priority for Lucene developers because it's typically easy for the application to change its approach to not use so many fields.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRce61ZAPYv2zdFfFqjQ_onvCWN3K6Qopu6-iG1aa9MHNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

joergprante@gmail.com
In reply to this post by Maco Ma
Two things to add, to make Elasticsearch/Solr comparison more fair.

In the ES mapping, you did not disable the _all field.

If you have _all field enabled, all tokens will be indexed twice, one for the field, one for _all.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

Also you may want to disable ES codec bloom filter

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings

because loading the bloom filter consumes significant memory.

Not sure why you call curl from perl, since this adds overhead. There are nice Solr/ES perl clients to push docs using bulk indexing.

Jörg


On Wednesday, June 18, 2014 4:50:13 AM UTC+2, Maco Ma wrote:
Hi Mike,

new_ES_config.sh(define the templates and disable the refresh/flush):
curl -XPOST localhost:9200/doc -d '{
      "mappings" : {
          "type" : {
                  "_source" : { "enabled" : false },
                  "dynamic_templates" : [
                    {"t1":{
                  "match" : "*_ss",
                  "mapping":{
                        "type": "string",
                        "store":false,
                        "norms" : {"enabled" : false}
                        }
                        }},
                    {"t2":{
                  "match" : "*_dt",
                  "mapping":{
                        "type": "date",
                        "store": false
                        }
                        }},
                    {"t3":{
                  "match" : "*_i",
                  "mapping":{
                        "type": "integer",
                        "store": false
                        }
                        }}
]
              }
        }
  }'

curl -XPUT localhost:9200/doc/_settings -d '{
      "index.refresh_interval" : "-1"
}'

curl -XPUT localhost:9200/doc/_settings -d '{
      "index.translog.disable_flush" : true
}'

new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest the doc and one thread to flush/optimize periodically):

my $num_args = $#ARGV + 1;
if ($num_args < 1 || $num_args > 2) {
  print "\n usuage:$0 [src_dir] [thread_count]\n";
  exit;
}

my $INST_HOME="/scratch/aime/elasticsearch-1.2.1";

my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
chomp($pid);
if( "$pid" eq "")
{
  print "Instance is not up\n";
  exit;
}


my $dir = $ARGV[0];
my $td_count = 10;
$td_count = $ARGV[1] if($num_args == 2);
open(FH, ">$lf");
print FH "source dir: $dir\nthread_count: $td_count\n";
print FH localtime()."\n";

use threads;
use threads::shared;

my $flush_intv = 10;

my $no:shared=0;
my $total = 10000;
my $intv = 1000;
my $tstr:shared = "";
my $ltime:shared = time;

sub commit {
  $SIG{'KILL'} = sub {`curl -XPOST '<a href="http://localhost:9200/doc/_flush';print" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47%3Bprint\46sa\75D\46sntz\0751\46usg\75AFQjCNEN6mDPFp2C7AsCcbEbB2YogHuaKQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47%3Bprint\46sa\75D\46sntz\0751\46usg\75AFQjCNEN6mDPFp2C7AsCcbEbB2YogHuaKQ';return true;">http://localhost:9200/doc/_flush'`;print "forced commit done on ".localtime()."\n";threads->exit();};

  while ($no < $total )
  {
    `curl -XPOST '<a href="http://localhost:9200/doc/_flush'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47\46sa\75D\46sntz\0751\46usg\75AFQjCNFGD_BWMTIvRMWH-CNN85m1VpHyVg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47\46sa\75D\46sntz\0751\46usg\75AFQjCNFGD_BWMTIvRMWH-CNN85m1VpHyVg';return true;">http://localhost:9200/doc/_flush'`;
    `curl -XPOST '<a href="http://localhost:9200/doc/_optimize'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_optimize\47\46sa\75D\46sntz\0751\46usg\75AFQjCNE_vacBab2GIWdNZYQTY4Q8mjdtQQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_optimize\47\46sa\75D\46sntz\0751\46usg\75AFQjCNE_vacBab2GIWdNZYQTY4Q8mjdtQQ';return true;">http://localhost:9200/doc/_optimize'`;
    print "commit on ".localtime()."\n";
    sleep($flush_intv);
  }
  `curl -XPOST '<a href="http://localhost:9200/doc/_flush'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47\46sa\75D\46sntz\0751\46usg\75AFQjCNFGD_BWMTIvRMWH-CNN85m1VpHyVg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47\46sa\75D\46sntz\0751\46usg\75AFQjCNFGD_BWMTIvRMWH-CNN85m1VpHyVg';return true;">http://localhost:9200/doc/_flush'`;
  print "commit done on ".localtime()."\n";
}

sub do {
  my $c = -1;
  while(1)
  {
    {
      lock($no);
      $c=$no;
      $no++;
    }
    last if($c >= $total);
    `curl -XPOST -s localhost:9200/doc/type/$c --data-binary \@$dir/$c.json`;
    if( ($c +1) % $intv == 0 )
    {
      lock($ltime);
      $curtime = time;
      $tstr .= ($curtime - $ltime)." ";
      $ltime = $curtime;
    }
  }
}

# start the monitor processes
my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho \$!);
my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho \$!);

my $ct = threads->create(\&commit);
my $start = time;
my @ts=();
for $i (1..$td_count)
{
  my $t = threads->create(\&do);
  push(@ts, $t);
}

for my $t (@ts)
{
  $t->join();
}

$ct->kill('KILL');
my $fin = time;

qx(kill -9 $sarId\nkill -9 $jgcId);

print FH localtime()."\n";
$ct->join();
print FH qx(curl '<a href="http://localhost:9200/doc/type/_count?q=*'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2Ftype%2F_count%3Fq%3D*\47\46sa\75D\46sntz\0751\46usg\75AFQjCNH-hUDuanZL2KJwlWUHg7vUpDFIHA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2Ftype%2F_count%3Fq%3D*\47\46sa\75D\46sntz\0751\46usg\75AFQjCNH-hUDuanZL2KJwlWUHg7vUpDFIHA';return true;">http://localhost:9200/doc/type/_count?q=*');
close(FH);

new_Solr_ingest_threads.pl is similar to the file  new_ES_ingest_threads.pl and uses the different parameters for curl commands. Only post the differences here:

sub commit {
  while ($no < $total )
  {
    `curl  '<a href="http://localhost:8983/solr/collection2/update?commit=true'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Fcommit%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNHJQ4KpwFxnL9d5dXxD6F5FWi_5rw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Fcommit%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNHJQ4KpwFxnL9d5dXxD6F5FWi_5rw';return true;">http://localhost:8983/solr/collection2/update?commit=true'`;
    `curl  '<a href="http://localhost:8983/solr/collection2/update?optimize=true'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Foptimize%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNH6-wStE0ti2O3h0jJdVgZu75kgWw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Foptimize%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNH6-wStE0ti2O3h0jJdVgZu75kgWw';return true;">http://localhost:8983/solr/collection2/update?optimize=true'`;
    print "commit on ".localtime()."\n";
    sleep(10);
  }
  `curl  '<a href="http://localhost:8983/solr/collection2/update?commit=true'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Fcommit%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNHJQ4KpwFxnL9d5dXxD6F5FWi_5rw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Fcommit%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNHJQ4KpwFxnL9d5dXxD6F5FWi_5rw';return true;">http://localhost:8983/solr/collection2/update?commit=true'`;
  print "commit done on ".localtime()."\n";
}


sub do {
  my $c = -1;
  while(1)
  {
    {
      lock($no);
      $c=$no;
      $no++;
    }
    last if($c >= $total);
    `curl  -s '<a href="http://localhost:8983/solr/collection2/update/json" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%2Fjson\46sa\75D\46sntz\0751\46usg\75AFQjCNFvMZUBs9_vW5m9KITCmaSSrB3eXg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%2Fjson\46sa\75D\46sntz\0751\46usg\75AFQjCNFvMZUBs9_vW5m9KITCmaSSrB3eXg';return true;">http://localhost:8983/solr/collection2/update/json' --data-binary \@$dir/$c.json -H 'Content-type:application/json'`;
    if( ($c +1) % $intv == 0 )
    {
      lock($ltime);
      $curtime = time;
      $tstr .= ($curtime - $ltime)." ";
      $ltime = $curtime;
    }
  }
}


B&R
Maco

On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote:
Hi,

Could you post the scripts you linked to (new_ES_config.sh, new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't download them from where you linked.

Optimizing every 10 seconds or 10 minutes is really not a good idea in general, but I guess if you're doing the same with ES and Solr then the comparison is at least "fair".

It's odd you see such a slowdown with ES...

Mike

On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <[hidden email]> wrote:
Hi, Mark:

We are doing single document ingestion. We did a performance comparison between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the metadata fields where Solr performance remains the same.
The performance is done in very small data set (ie. 10k documents, the index size is only 75mb). The machine is a high spec machine with 48GB memory.
You can see ES performance drop 50% even when the machine have plenty memory. ES consumes all the machine memory when metadata field increased to 100k.
This behavior seems abnormal since the data is really tiny.

We also tried with larger data set (ie. 100k and 1Mil documents), ES throw OOW for scenario 2 for 1 Mil doc scenario.
We want to know whether this is a bug in ES and/or is there any workaround (config step) we can use to eliminate the performance degradation.
Currently ES performance does not meet the customer requirement so we want to see if there is anyway we can bring ES performance to the same level as Solr.

Below is the configuration setting and benchmark results for 10k document set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.
  • disable hard-commit & soft commit + use a client to do commit (ES & Solr) every 10 second
    • ES: flush, refresh are disabled
    • Solr: autoSoftCommit are disabled
  • monitor load on the system (cpu, memory, etc) or the ingestion speed change over time
  • monitor the ingestion speed (is there any degradation over time?)
  • new ES config:new_ES_config.sh; new ingestion: new_ES_ingest_threads.pl
  • new Solr ingestion: new_Solr_ingest_threads.pl
  • flush interval: 10s

Number of different meta data field
ESSolr
Scenario 0: 100012secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
Scenario 1: 10k29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
Scenario 2: 100k17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2
Scenario 3: 1M183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2


Thanks!
Cindy

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a62db16-378e-4079-a48e-461d579a1f83%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Cindy Hsin
In reply to this post by Maco Ma
Thanks!

I have asked Maco to re-test ES with these two parameter disabled.

One more question regard Lucene's capability with large amount of metadata fields. What is the largest meta data fileds Lucene supports per Index?
What are different strategy to solve the large metadata fields issue? Do you recommend to use "type" to partition different set of meta data fields within a index?
I will clarify with our team regard their usage for large meta data fields as well.

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8c5874cd-a1ff-432b-9bdf-e8a54a505fcb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Michael McCandless-3
Hi Cindy,

There isn't a hard limit on the number of field Lucene supports, it's more than per-field there is highish heap used, added CPU/IO cost for merging, etc.  It's just not a well tested usage of Lucene, not something the developers focus on optimizing, etc.

Partitioning by _type won't change things (it's still a single Lucene index).

How you design your schema really depends on how you want to search on them.  E.g. if these are single-token text fields that you need to filter on then you can index them all under a single field (say allFilterFields), pre-pending your original field name onto each token, and then at search time doing the same (searching for field:text as your text token within allFilterFields).




On Tue, Jun 24, 2014 at 12:12 AM, Cindy Hsin <[hidden email]> wrote:
Thanks!

I have asked Maco to re-test ES with these two parameter disabled.

One more question regard Lucene's capability with large amount of metadata fields. What is the largest meta data fileds Lucene supports per Index?
What are different strategy to solve the large metadata fields issue? Do you recommend to use "type" to partition different set of meta data fields within a index?
I will clarify with our team regard their usage for large meta data fields as well.


Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8c5874cd-a1ff-432b-9bdf-e8a54a505fcb%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRectTyYXUBJPW7Li6pK7WT9mOguODLwY2X%3DDK6Js_cMsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma
In reply to this post by joergprante@gmail.com
Hi Jörg,

I rerun the benchmark with disabling the _all and codec bloom filter: just the index data size got reduced dramatically but ingestion speed is still similar as previous:
Number of different meta data field 
ES 
ES with disable _all/codec bloom filter 
Scenario 0: 1000
12secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36Mb
iowait: 0.02%
13 secs ->769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
Scenario 1: 10k
29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36Mb
31 secs -> 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 47.95G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2
Scenario 2: 100k
17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75Mb
14 mins 24 secs -> 11.6 docs/sec
CPU: 52.30%
iowait: 0.02%
Heap: 47.96G
Index Size: 1.5M
Ingestion speed change: 93 153 151 112 84 65 61 53 51 41

We ingested one single doc once, instead of bulk ingestion, and that was our real world requirements.

scripts to disable _all/bloom filer:
curl -XPOST localhost:9200/doc -d '{
      "mappings" : {
          "type" : {
                  "_source" : { "enabled" : false },
                  "_all" : { "enabled" : false },
                  "dynamic_templates" : [
                    {"t1":{
                  "match" : "*_ss",
                  "mapping":{
                        "type": "string",
                        "store":false,
                        "norms" : {"enabled" : false}
                        }
                        }},
                    {"t2":{
                  "match" : "*_dt",
                  "mapping":{
                        "type": "date",
                        "store": false
                        }
                        }},
                    {"t3":{
                  "match" : "*_i",
                  "mapping":{
                        "type": "integer",
                        "store": false
                        }
                        }}
]
              }
        }
  }'


curl -XPUT localhost:9200/doc/_settings -d '{
      "index.codec.bloom.load" :false
}'

Best Regards
Maco

On Monday, June 23, 2014 12:17:27 AM UTC+8, Jörg Prante wrote:
Two things to add, to make Elasticsearch/Solr comparison more fair.

In the ES mapping, you did not disable the _all field.

If you have _all field enabled, all tokens will be indexed twice, one for the field, one for _all.

<a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fmapping-all-field.html\46sa\75D\46sntz\0751\46usg\75AFQjCNElPoSmkXbV9J7QAVBYz8tKTzzW7g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fmapping-all-field.html\46sa\75D\46sntz\0751\46usg\75AFQjCNElPoSmkXbV9J7QAVBYz8tKTzzW7g';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

Also you may want to disable ES codec bloom filter

<a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Findex-modules-codec.html%23bloom-postings\46sa\75D\46sntz\0751\46usg\75AFQjCNG5yHeGDycQwIYjadqn6-TB2XenIw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Findex-modules-codec.html%23bloom-postings\46sa\75D\46sntz\0751\46usg\75AFQjCNG5yHeGDycQwIYjadqn6-TB2XenIw';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings

because loading the bloom filter consumes significant memory.

Not sure why you call curl from perl, since this adds overhead. There are nice Solr/ES perl clients to push docs using bulk indexing.

Jörg


On Wednesday, June 18, 2014 4:50:13 AM UTC+2, Maco Ma wrote:
Hi Mike,

new_ES_config.sh(define the templates and disable the refresh/flush):
curl -XPOST localhost:9200/doc -d '{
      "mappings" : {
          "type" : {
                  "_source" : { "enabled" : false },
                  "dynamic_templates" : [
                    {"t1":{
                  "match" : "*_ss",
                  "mapping":{
                        "type": "string",
                        "store":false,
                        "norms" : {"enabled" : false}
                        }
                        }},
                    {"t2":{
                  "match" : "*_dt",
                  "mapping":{
                        "type": "date",
                        "store": false
                        }
                        }},
                    {"t3":{
                  "match" : "*_i",
                  "mapping":{
                        "type": "integer",
                        "store": false
                        }
                        }}
]
              }
        }
  }'

curl -XPUT localhost:9200/doc/_settings -d '{
      "index.refresh_interval" : "-1"
}'

curl -XPUT localhost:9200/doc/_settings -d '{
      "index.translog.disable_flush" : true
}'

new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest the doc and one thread to flush/optimize periodically):

my $num_args = $#ARGV + 1;
if ($num_args < 1 || $num_args > 2) {
  print "\n usuage:$0 [src_dir] [thread_count]\n";
  exit;
}

my $INST_HOME="/scratch/aime/elasticsearch-1.2.1";

my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
chomp($pid);
if( "$pid" eq "")
{
  print "Instance is not up\n";
  exit;
}


my $dir = $ARGV[0];
my $td_count = 10;
$td_count = $ARGV[1] if($num_args == 2);
open(FH, ">$lf");
print FH "source dir: $dir\nthread_count: $td_count\n";
print FH localtime()."\n";

use threads;
use threads::shared;

my $flush_intv = 10;

my $no:shared=0;
my $total = 10000;
my $intv = 1000;
my $tstr:shared = "";
my $ltime:shared = time;

sub commit {
  $SIG{'KILL'} = sub {`curl -XPOST '<a href="http://localhost:9200/doc/_flush';print" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47%3Bprint\46sa\75D\46sntz\0751\46usg\75AFQjCNEN6mDPFp2C7AsCcbEbB2YogHuaKQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47%3Bprint\46sa\75D\46sntz\0751\46usg\75AFQjCNEN6mDPFp2C7AsCcbEbB2YogHuaKQ';return true;">http://localhost:9200/doc/_flush'`;print "forced commit done on ".localtime()."\n";threads->exit();};

  while ($no < $total )
  {
    `curl -XPOST '<a href="http://localhost:9200/doc/_flush'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47\46sa\75D\46sntz\0751\46usg\75AFQjCNFGD_BWMTIvRMWH-CNN85m1VpHyVg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47\46sa\75D\46sntz\0751\46usg\75AFQjCNFGD_BWMTIvRMWH-CNN85m1VpHyVg';return true;">http://localhost:9200/doc/_flush'`;
    `curl -XPOST '<a href="http://localhost:9200/doc/_optimize'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_optimize\47\46sa\75D\46sntz\0751\46usg\75AFQjCNE_vacBab2GIWdNZYQTY4Q8mjdtQQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_optimize\47\46sa\75D\46sntz\0751\46usg\75AFQjCNE_vacBab2GIWdNZYQTY4Q8mjdtQQ';return true;">http://localhost:9200/doc/_optimize'`;
    print "commit on ".localtime()."\n";
    sleep($flush_intv);
  }
  `curl -XPOST '<a href="http://localhost:9200/doc/_flush'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47\46sa\75D\46sntz\0751\46usg\75AFQjCNFGD_BWMTIvRMWH-CNN85m1VpHyVg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2F_flush\47\46sa\75D\46sntz\0751\46usg\75AFQjCNFGD_BWMTIvRMWH-CNN85m1VpHyVg';return true;">http://localhost:9200/doc/_flush'`;
  print "commit done on ".localtime()."\n";
}

sub do {
  my $c = -1;
  while(1)
  {
    {
      lock($no);
      $c=$no;
      $no++;
    }
    last if($c >= $total);
    `curl -XPOST -s localhost:9200/doc/type/$c --data-binary \@$dir/$c.json`;
    if( ($c +1) % $intv == 0 )
    {
      lock($ltime);
      $curtime = time;
      $tstr .= ($curtime - $ltime)." ";
      $ltime = $curtime;
    }
  }
}

# start the monitor processes
my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho \$!);
my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho \$!);

my $ct = threads->create(\&commit);
my $start = time;
my @ts=();
for $i (1..$td_count)
{
  my $t = threads->create(\&do);
  push(@ts, $t);
}

for my $t (@ts)
{
  $t->join();
}

$ct->kill('KILL');
my $fin = time;

qx(kill -9 $sarId\nkill -9 $jgcId);

print FH localtime()."\n";
$ct->join();
print FH qx(curl '<a href="http://localhost:9200/doc/type/_count?q=*'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2Ftype%2F_count%3Fq%3D*\47\46sa\75D\46sntz\0751\46usg\75AFQjCNH-hUDuanZL2KJwlWUHg7vUpDFIHA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2Ftype%2F_count%3Fq%3D*\47\46sa\75D\46sntz\0751\46usg\75AFQjCNH-hUDuanZL2KJwlWUHg7vUpDFIHA';return true;">http://localhost:9200/doc/type/_count?q=*');
close(FH);

new_Solr_ingest_threads.pl is similar to the file  new_ES_ingest_threads.pl and uses the different parameters for curl commands. Only post the differences here:

sub commit {
  while ($no < $total )
  {
    `curl  '<a href="http://localhost:8983/solr/collection2/update?commit=true'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Fcommit%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNHJQ4KpwFxnL9d5dXxD6F5FWi_5rw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Fcommit%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNHJQ4KpwFxnL9d5dXxD6F5FWi_5rw';return true;">http://localhost:8983/solr/collection2/update?commit=true'`;
    `curl  '<a href="http://localhost:8983/solr/collection2/update?optimize=true'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Foptimize%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNH6-wStE0ti2O3h0jJdVgZu75kgWw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Foptimize%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNH6-wStE0ti2O3h0jJdVgZu75kgWw';return true;">http://localhost:8983/solr/collection2/update?optimize=true'`;
    print "commit on ".localtime()."\n";
    sleep(10);
  }
  `curl  '<a href="http://localhost:8983/solr/collection2/update?commit=true'" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Fcommit%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNHJQ4KpwFxnL9d5dXxD6F5FWi_5rw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%3Fcommit%3Dtrue\47\46sa\75D\46sntz\0751\46usg\75AFQjCNHJQ4KpwFxnL9d5dXxD6F5FWi_5rw';return true;">http://localhost:8983/solr/collection2/update?commit=true'`;
  print "commit done on ".localtime()."\n";
}


sub do {
  my $c = -1;
  while(1)
  {
    {
      lock($no);
      $c=$no;
      $no++;
    }
    last if($c >= $total);
    `curl  -s '<a href="http://localhost:8983/solr/collection2/update/json" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%2Fjson\46sa\75D\46sntz\0751\46usg\75AFQjCNFvMZUBs9_vW5m9KITCmaSSrB3eXg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fcollection2%2Fupdate%2Fjson\46sa\75D\46sntz\0751\46usg\75AFQjCNFvMZUBs9_vW5m9KITCmaSSrB3eXg';return true;">http://localhost:8983/solr/collection2/update/json' --data-binary \@$dir/$c.json -H 'Content-type:application/json'`;
    if( ($c +1) % $intv == 0 )
    {
      lock($ltime);
      $curtime = time;
      $tstr .= ($curtime - $ltime)." ";
      $ltime = $curtime;
    }
  }
}


B&R
Maco

On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote:
Hi,

Could you post the scripts you linked to (new_ES_config.sh, new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't download them from where you linked.

Optimizing every 10 seconds or 10 minutes is really not a good idea in general, but I guess if you're doing the same with ES and Solr then the comparison is at least "fair".

It's odd you see such a slowdown with ES...

Mike

On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <[hidden email]> wrote:
Hi, Mark:

We are doing single document ingestion. We did a performance comparison between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the metadata fields where Solr performance remains the same.
The performance is done in very small data set (ie. 10k documents, the index size is only 75mb). The machine is a high spec machine with 48GB memory.
You can see ES performance drop 50% even when the machine have plenty memory. ES consumes all the machine memory when metadata field increased to 100k.
This behavior seems abnormal since the data is really tiny.

We also tried with larger data set (ie. 100k and 1Mil documents), ES throw OOW for scenario 2 for 1 Mil doc scenario.
We want to know whether this is a bug in ES and/or is there any workaround (config step) we can use to eliminate the performance degradation.
Currently ES performance does not meet the customer requirement so we want to see if there is anyway we can bring ES performance to the same level as Solr.

Below is the configuration setting and benchmark results for 10k document set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.
  • disable hard-commit & soft commit + use a client to do commit (ES & Solr) every 10 second
    • ES: flush, refresh are disabled
    • Solr: autoSoftCommit are disabled
  • monitor load on the system (cpu, memory, etc) or the ingestion speed change over time
  • monitor the ingestion speed (is there any degradation over time?)
  • new ES config:new_ES_config.sh; new ingestion: new_ES_ingest_threads.pl
  • new Solr ingestion: new_Solr_ingest_threads.pl
  • flush interval: 10s

Number of different meta data field
ESSolr
Scenario 0: 100012secs -> 833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
Scenario 1: 10k29secs -> 345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
Scenario 2: 100k17 mins 44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M
13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2
Scenario 3: 1M183 mins 8 secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594
15 secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2


Thanks!
Cindy

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0086886a-330b-4db4-8e3d-5301df616eb5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Cindy Hsin
In reply to this post by Maco Ma
Looks like the memory usage increased a lot with 10k fields with these two parameter disabled.

Based on the experiment we have done, looks like ES have abnormal memory usage and performance degradation when number of fields are large (ie. 10k). Where Solr memory usage and performance remains for the large number fields.

If we are only looking at 10k fields scenario, is there a way for ES to make the ingest performance better (perhaps via a bug fix)? Looking at the performance number, I think this abnormal memory usage & performance drop is most likely a bug in ES layer. If this is not technically feasible then we'll report back that we have checked with ES experts and confirmed that there is no way for ES to provide a fix to address this issue. The solution Mike suggestion sounds like a workaround (ie combine multiple fields into one field to reduce the large number of fields). I can run it by our team but not sure if this will fly.

I have also asked Maco to do one more benchmark (where search and ingest runs concurrently) for both ES and Solr to check whether there is any performance degradation for Solr when search and ingest happens concurrently. I think this is one point that Mike mentioned, right? Even with Solr, you think we will hit some performance issue with large fields when ingest and query runs concurrently.

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/06d319c4-ee7a-40e3-b11a-6e0adff2c686%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma
I run the benchmark where search and ingest runs concurrently. Paste the results here:
Number of different meta data field 
ES with disable _all/codec bloom filter 
ES disabled params (Ingestion & Query concurrently) 
Scenario 0: 1000
13 secs ->769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
14 secs ->714 docs/sec
CPU: 27.51%
iowait: 0.03%
Heap: 1.27G
Index Size: 304K
Ingestion speed change: 3 1 1 1 1 1 1 2 2 1
Scenario 1: 10k
31 secs -> 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2
35 secs -> 285docs/sec
CPU: 42.46%
iowait: 0.01%
Heap: 5.14G
Index Size: 336K
Ingestion speed change: 13 2 1 1 2 1 1 4 1 2 

I added one more thread to do the query to the existing ingestion script:
sub query {
  my $qstr = q(curl -s 'http://localhost:9200/doc/type/_search' -d'{"query":{"filtered":{"query":{"query_string":{"fields" : [");
  my $fstr = q(curl -s 'http://localhost:9200/doc/type/_search' -d'{"query":{"filtered":{"query":{"match_all":{}},"filter":{");
  my $fieldNum =  1000;

  while ($no < $total )
  {
    $tr= int(rand(5));
    if( $tr == 0 )
    {
      $fieldName = "field".int(rand($fieldNum))."_i";
      $fieldValue = "*1*";
    }
    elsif ($tr == 1)
    {
      $fieldName = "field".int(rand($fieldNum))."_dt";
      $fieldValue = "*2*";
    }
    else
    {
      $fieldName = "field".int(rand($fieldNum))."_ss";
      $fieldValue = "f*";
    }

    $cstr = $qstr. "$fieldName" . q("],"query":") . $fieldValue . q("}}}}}');
    print $cstr."\n";
    print `$cstr`."\n";

    $tr= int(rand(5));
    if( $tr == 0 )
    {
      $cstr = $fstr. q(range":{ "field).int(rand($fieldNum)).q(_i":{"gte":). int(rand(1000)). q(}}}}}}');
    }
    elsif ($tr == 1)
    {
      $cstr = $fstr. q(range":{ "field). int(rand($fieldNum)).q(_dt":{"from": "2010-01-).(1+int(rand(31))).q(T02:10:03"}}}}}}');
    }
    else
    {
      $cstr = $fstr. q(regexp":{"field).int(rand($fieldNum)).q(_ss":"f.*"}}}}}');
    }
print $cstr."\n";
    print `$cstr`."\n";
  }
}


Maco

On Wednesday, June 25, 2014 1:04:08 AM UTC+8, Cindy Hsin wrote:
Looks like the memory usage increased a lot with 10k fields with these two parameter disabled.

Based on the experiment we have done, looks like ES have abnormal memory usage and performance degradation when number of fields are large (ie. 10k). Where Solr memory usage and performance remains for the large number fields.

If we are only looking at 10k fields scenario, is there a way for ES to make the ingest performance better (perhaps via a bug fix)? Looking at the performance number, I think this abnormal memory usage & performance drop is most likely a bug in ES layer. If this is not technically feasible then we'll report back that we have checked with ES experts and confirmed that there is no way for ES to provide a fix to address this issue. The solution Mike suggestion sounds like a workaround (ie combine multiple fields into one field to reduce the large number of fields). I can run it by our team but not sure if this will fly.

I have also asked Maco to do one more benchmark (where search and ingest runs concurrently) for both ES and Solr to check whether there is any performance degradation for Solr when search and ingest happens concurrently. I think this is one point that Mike mentioned, right? Even with Solr, you think we will hit some performance issue with large fields when ingest and query runs concurrently.

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d12f9b2c-6d53-4811-8849-d3cb0ba47ae6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Michael McCandless-3
In reply to this post by Cindy Hsin
Some responses below:

On Tue, Jun 24, 2014 at 7:04 PM, Cindy Hsin <[hidden email]> wrote:
Looks like the memory usage increased a lot with 10k fields with these two parameter disabled.

Based on the experiment we have done, looks like ES have abnormal memory usage and performance degradation when number of fields are large (ie. 10k). Where Solr memory usage and performance remains for the large number fields.

If we are only looking at 10k fields scenario, is there a way for ES to make the ingest performance better (perhaps via a bug fix)?

I've opened an ES issue to address the slowdown as more and more unique fields are added via dynamic templates: https://github.com/elasticsearch/elasticsearch/issues/6619
 
The solution Mike suggestion sounds like a workaround (ie combine multiple fields into one field to reduce the large number of fields). I can run it by our team but not sure if this will fly.

Well, I think both Solr and ES (once we fix the above issue) will still have high cost if you index so many fields, since they both are based on Lucene.

One simple but effective approach, whether you use Solr or ES, is to use nested documents, where the parent document holds any "common" fields across all of your documents, and then each child document has two fields, key and value.  key holds the original field name you wanted to index, and value holds the original field value, so you have as many child documents as you had field+values to index for your original document.  This approach has worked well in other applications that needed so many fields...

It essentially changes the wide range of field names and field values instead, which Lucene handles very well.  It results in more, smaller documents, but this scales out well as you add nodes.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRecxnOmVOrrNfgfk5vmKZaP3ReEcM9P%2BVu2qRgLxSL%2BKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma
In reply to this post by Maco Ma
Added the Solr benchmark as well:

Number of different meta data field 

ES with disable _all/codec bloom filter 
ES (Ingestion & Query concurrently) 
Solr 
Solr(Ingestion & Query concurrently) 
Scenario 0: 1000

13 secs ->769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
14 secs ->714 docs/sec
CPU: 27.51%
iowait: 0.03%
Heap: 1.27G
Index Size: 304K
Ingestion speed change: 3 1 1 1 1 1 1 2 2 1
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
14 secs->714 docs/sec
CPU: 37.02%
Heap: 10G
Ingestion speed change: 2 2 1 1 1 1 2 2 1 1 
Scenario 1: 10k

31 secs -> 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2
35 secs -> 285docs/sec
CPU: 42.46%
iowait: 0.01%
Heap: 5.14G
Index Size: 336K
Ingestion speed change: 13 2 1 1 2 1 1 4 1 2 
12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
16 secs-> 625 docs/sec
CPU: 34.07%
Heap: 10G
Ingestion speed change: 2 2 1 1 1 1 2 2 2 2


List several sample queries for Solr:
curl -s 'http://localhost:8983/solr/collection2/query?rows=0&q=field282_ss:f*'
curl -s 'http://localhost:8983/solr/collection2/query?rows=0&q=field989_dt:\[2012-3-06T01%3A15%3A51Z%20TO%20NOW\]'
curl -s 'http://localhost:8983/solr/collection2/query?rows=0&q=field363_i:\[0%20TO%20177\]'

filters:
curl -s 'http://localhost:8983/solr/collection2/query?rows=0&q=*&fq=field118_i:\[0%20TO%2029\]'
curl -s 'http://localhost:8983/solr/collection2/query?rows=0&q=*&fq=field91_dt:\[2012-1-06T01%3A15%3A51Z%20TO%20NOW\]'
curl -s 'http://localhost:8983/solr/collection2/query?rows=0&q=*&fq=field879_ss:f*'

Maco

On Wednesday, June 25, 2014 5:23:16 PM UTC+8, Maco Ma wrote:
I run the benchmark where search and ingest runs concurrently. Paste the results here:
Number of different meta data field 
ES with disable _all/codec bloom filter 
ES disabled params (Ingestion & Query concurrently) 
Scenario 0: 1000
13 secs ->769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
14 secs ->714 docs/sec
CPU: 27.51%
iowait: 0.03%
Heap: 1.27G
Index Size: 304K
Ingestion speed change: 3 1 1 1 1 1 1 2 2 1
Scenario 1: 10k
31 secs -> 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2
35 secs -> 285docs/sec
CPU: 42.46%
iowait: 0.01%
Heap: 5.14G
Index Size: 336K
Ingestion speed change: 13 2 1 1 2 1 1 4 1 2 

I added one more thread to do the query to the existing ingestion script:
sub query {
  my $qstr = q(curl -s '<a href="http://localhost:9200/doc/type/_search" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2Ftype%2F_search\46sa\75D\46sntz\0751\46usg\75AFQjCNGg72GmJjkYwbgVG_gxi-kjtF2AFQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2Ftype%2F_search\46sa\75D\46sntz\0751\46usg\75AFQjCNGg72GmJjkYwbgVG_gxi-kjtF2AFQ';return true;">http://localhost:9200/doc/type/_search' -d'{"query":{"filtered":{"query":{"query_string":{"fields" : [");
  my $fstr = q(curl -s '<a href="http://localhost:9200/doc/type/_search" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2Ftype%2F_search\46sa\75D\46sntz\0751\46usg\75AFQjCNGg72GmJjkYwbgVG_gxi-kjtF2AFQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fdoc%2Ftype%2F_search\46sa\75D\46sntz\0751\46usg\75AFQjCNGg72GmJjkYwbgVG_gxi-kjtF2AFQ';return true;">http://localhost:9200/doc/type/_search' -d'{"query":{"filtered":{"query":{"match_all":{}},"filter":{");
  my $fieldNum =  1000;

  while ($no < $total )
  {
    $tr= int(rand(5));
    if( $tr == 0 )
    {
      $fieldName = "field".int(rand($fieldNum))."_i";
      $fieldValue = "*1*";
    }
    elsif ($tr == 1)
    {
      $fieldName = "field".int(rand($fieldNum))."_dt";
      $fieldValue = "*2*";
    }
    else
    {
      $fieldName = "field".int(rand($fieldNum))."_ss";
      $fieldValue = "f*";
    }

    $cstr = $qstr. "$fieldName" . q("],"query":") . $fieldValue . q("}}}}}');
    print $cstr."\n";
    print `$cstr`."\n";

    $tr= int(rand(5));
    if( $tr == 0 )
    {
      $cstr = $fstr. q(range":{ "field).int(rand($fieldNum)).q(_i":{"gte":). int(rand(1000)). q(}}}}}}');
    }
    elsif ($tr == 1)
    {
      $cstr = $fstr. q(range":{ "field). int(rand($fieldNum)).q(_dt":{"from": "2010-01-).(1+int(rand(31))).q(T02:10:03"}}}}}}');
    }
    else
    {
      $cstr = $fstr. q(regexp":{"field).int(rand($fieldNum)).q(_ss":"f.*"}}}}}');
    }
print $cstr."\n";
    print `$cstr`."\n";
  }
}


Maco

On Wednesday, June 25, 2014 1:04:08 AM UTC+8, Cindy Hsin wrote:
Looks like the memory usage increased a lot with 10k fields with these two parameter disabled.

Based on the experiment we have done, looks like ES have abnormal memory usage and performance degradation when number of fields are large (ie. 10k). Where Solr memory usage and performance remains for the large number fields.

If we are only looking at 10k fields scenario, is there a way for ES to make the ingest performance better (perhaps via a bug fix)? Looking at the performance number, I think this abnormal memory usage & performance drop is most likely a bug in ES layer. If this is not technically feasible then we'll report back that we have checked with ES experts and confirmed that there is no way for ES to provide a fix to address this issue. The solution Mike suggestion sounds like a workaround (ie combine multiple fields into one field to reduce the large number of fields). I can run it by our team but not sure if this will fly.

I have also asked Maco to do one more benchmark (where search and ingest runs concurrently) for both ES and Solr to check whether there is any performance degradation for Solr when search and ingest happens concurrently. I think this is one point that Mike mentioned, right? Even with Solr, you think we will hit some performance issue with large fields when ingest and query runs concurrently.

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:
I try to measure the performance of ingesting the documents having lots of fields.


The latest elasticsearch 1.2.1:
Total docs count: 10k (a small set definitely)
ES_HEAP_SIZE: 48G
settings:
{"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}

mappings:
{"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}

All fields in the documents mach the templates in the mappings.

Since I disabled the flush & refresh, I submitted the flush command (along with optimize command after it) in the client program every 10 seconds. (I tried the another interval 10mins and got the similar results)

Scenario 0 - 10k docs have 1000 different fields:
Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used heap memory).


Scenario 1 - 10k docs have 10k different fields(10 times fields compared with scenario0):
This time ingestion took 29 secs.   Only 5.74G heap mem is used.

Not sure why the performance degrades sharply.

If I try to ingest the docs having 100k different fields, it will take 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so badly. 

Anyone can give suggestion to improve the performance?







--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/765afa4b-5b9a-414d-91f5-e1c6f234a9a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
12