Looking for a best practice to get all data according to some filters

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Looking for a best practice to get all data according to some filters

Ron Sher
Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

dadoonet
Scan/scroll is the best option to extract a huge amount of data.
Never use size:10000000 or from:10000000. 

It's not realtime because you basically scroll over a given set of segments and all new changes that will come in new segments won't be taken into account during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 10000 and see how it goes when you increase it. You will may be discover that getting 10*10000 docs is the same as 1*100000. :)

Best

David

Le 10 déc. 2014 à 19:09, Ron Sher <[hidden email]> a écrit :

Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F1FB312D-0FEA-4D59-88EA-3E16C457DAE0%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

Ron Sher
So you're saying there's no impact on elasticsearch if I issue a large size? 
If that's the case then why shouldn't I just call size of 1M if I want to make sure I get everything?

On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
Scan/scroll is the best option to extract a huge amount of data.
Never use size:10000000 or from:10000000. 

It's not realtime because you basically scroll over a given set of segments and all new changes that will come in new segments won't be taken into account during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 10000 and see how it goes when you increase it. You will may be discover that getting 10*10000 docs is the same as 1*100000. :)

Best

David

Le 10 déc. 2014 à 19:09, Ron Sher <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="PThsc05xNWgJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">ron....@...> a écrit :

Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="PThsc05xNWgJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

dadoonet
No I did not say that. Or I did not mean that. Sorry if it was unclear.
I said: don’t use large sizes:

Never use size:10000000 or from:10000000. 


-- 
David Pilato | Technical Advocate | Elasticsearch.com



Le 10 déc. 2014 à 21:16, Ron Sher <[hidden email]> a écrit :

So you're saying there's no impact on elasticsearch if I issue a large size? 
If that's the case then why shouldn't I just call size of 1M if I want to make sure I get everything?

On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
Scan/scroll is the best option to extract a huge amount of data.
Never use size:10000000 or from:10000000. 

It's not realtime because you basically scroll over a given set of segments and all new changes that will come in new segments won't be taken into account during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 10000 and see how it goes when you increase it. You will may be discover that getting 10*10000 docs is the same as 1*100000. :)

Best

David

Le 10 déc. 2014 à 19:09, Ron Sher <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="PThsc05xNWgJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;" class="">ron....@...> a écrit :

Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="PThsc05xNWgJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;" class="">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" class="">https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;" class="">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/D2511659-9029-41CB-89B5-CC5E363B656B%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

Ron Sher
Just tested this.
When I used a large number to get all of my documents according to some criteria (4926 in the result) I got:
13.951s when using a size of 1M
43.6s when using scan/scroll (with a size of 100)

Looks like I should be using the not recommended paging.
Can I make the scroll better?

Thanks,
Ron

On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:
No I did not say that. Or I did not mean that. Sorry if it was unclear.
I said: don’t use large sizes:

Never use size:10000000 or from:10000000. 

You should read this: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-scroll.html%23scroll-scan\46sa\75D\46sntz\0751\46usg\75AFQjCNE-Hr1D5-J9jvGPFEaLKs-HC7Je8g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-scroll.html%23scroll-scan\46sa\75D\46sntz\0751\46usg\75AFQjCNE-Hr1D5-J9jvGPFEaLKs-HC7Je8g';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

-- 
David Pilato | Technical Advocate | <a href="http://Elasticsearch.com" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2FElasticsearch.com\46sa\75D\46sntz\0751\46usg\75AFQjCNGt3jnDFhlOggktrjOnwqFMpvxxzQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2FElasticsearch.com\46sa\75D\46sntz\0751\46usg\75AFQjCNGt3jnDFhlOggktrjOnwqFMpvxxzQ';return true;">Elasticsearch.com
<a href="https://twitter.com/dadoonet" style="color:rgb(17,85,204)" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;">@dadoonet | <a href="https://twitter.com/elasticsearchfr" style="color:rgb(17,85,204)" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;">@elasticsearchfr | <a href="https://twitter.com/scrutmydocs" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;">@scrutmydocs



Le 10 déc. 2014 à 21:16, Ron Sher <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="x9xde1xgip0J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">ron....@...> a écrit :

So you're saying there's no impact on elasticsearch if I issue a large size? 
If that's the case then why shouldn't I just call size of 1M if I want to make sure I get everything?

On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
Scan/scroll is the best option to extract a huge amount of data.
Never use size:10000000 or from:10000000. 

It's not realtime because you basically scroll over a given set of segments and all new changes that will come in new segments won't be taken into account during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 10000 and see how it goes when you increase it. You will may be discover that getting 10*10000 docs is the same as 1*100000. :)

Best

David

Le 10 déc. 2014 à 19:09, Ron Sher <[hidden email]> a écrit :

Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="x9xde1xgip0J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d41729a8-8dfc-48eb-ae7b-1ac16cd05787%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

Dani Castro
Hi,
  I am facing the same situation:
We would like to get all the ids of the documents matching certain criteria. In the worst case (which is the one I am exposing here), the documents matching the criteria would be around 200K, and in our first tests it is really slow (around 15 seconds). However, if we do the same query just for count documents, ES replies in just 10-15ms, which is amazing.
I suspect that the problem is on the transport layer and the latency generated by transferring a big JSON result. 

Would you recommend, in a situation like this, to use another transport layer like Thirf or a custom solution?.

Thanks in advance

El jueves, 11 de diciembre de 2014 14:00:05 UTC+1, Ron Sher escribió:
Just tested this.
When I used a large number to get all of my documents according to some criteria (4926 in the result) I got:
13.951s when using a size of 1M
43.6s when using scan/scroll (with a size of 100)

Looks like I should be using the not recommended paging.
Can I make the scroll better?

Thanks,
Ron

On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:
No I did not say that. Or I did not mean that. Sorry if it was unclear.
I said: don’t use large sizes:

Never use size:10000000 or from:10000000. 

You should read this: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-scroll.html%23scroll-scan\46sa\75D\46sntz\0751\46usg\75AFQjCNE-Hr1D5-J9jvGPFEaLKs-HC7Je8g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-scroll.html%23scroll-scan\46sa\75D\46sntz\0751\46usg\75AFQjCNE-Hr1D5-J9jvGPFEaLKs-HC7Je8g';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

-- 
David Pilato | Technical Advocate | <a href="http://Elasticsearch.com" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2FElasticsearch.com\46sa\75D\46sntz\0751\46usg\75AFQjCNGt3jnDFhlOggktrjOnwqFMpvxxzQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2FElasticsearch.com\46sa\75D\46sntz\0751\46usg\75AFQjCNGt3jnDFhlOggktrjOnwqFMpvxxzQ';return true;">Elasticsearch.com
<a href="https://twitter.com/dadoonet" style="color:rgb(17,85,204)" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;">@dadoonet | <a href="https://twitter.com/elasticsearchfr" style="color:rgb(17,85,204)" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;">@elasticsearchfr | <a href="https://twitter.com/scrutmydocs" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;">@scrutmydocs



Le 10 déc. 2014 à 21:16, Ron Sher <[hidden email]> a écrit :

So you're saying there's no impact on elasticsearch if I issue a large size? 
If that's the case then why shouldn't I just call size of 1M if I want to make sure I get everything?

On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
Scan/scroll is the best option to extract a huge amount of data.
Never use size:10000000 or from:10000000. 

It's not realtime because you basically scroll over a given set of segments and all new changes that will come in new segments won't be taken into account during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 10000 and see how it goes when you increase it. You will may be discover that getting 10*10000 docs is the same as 1*100000. :)

Best

David

Le 10 déc. 2014 à 19:09, Ron Sher <[hidden email]> a écrit :

Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dcadf7e0-1ba5-4b28-b193-13b7c3a5cabb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

Ron Sher
In reply to this post by dadoonet
Again, why not use a very large count size? What are the implications of using a very large count?
Regarding performance - it seems doing 1 request with a very large count performs better than using scan scroll (with count of 100 using 32 shards)

On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:
No I did not say that. Or I did not mean that. Sorry if it was unclear.
I said: don’t use large sizes:

Never use size:10000000 or from:10000000. 

You should read this: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-scroll.html%23scroll-scan\46sa\75D\46sntz\0751\46usg\75AFQjCNE-Hr1D5-J9jvGPFEaLKs-HC7Je8g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-scroll.html%23scroll-scan\46sa\75D\46sntz\0751\46usg\75AFQjCNE-Hr1D5-J9jvGPFEaLKs-HC7Je8g';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

-- 
David Pilato | Technical Advocate | <a href="http://Elasticsearch.com" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2FElasticsearch.com\46sa\75D\46sntz\0751\46usg\75AFQjCNGt3jnDFhlOggktrjOnwqFMpvxxzQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2FElasticsearch.com\46sa\75D\46sntz\0751\46usg\75AFQjCNGt3jnDFhlOggktrjOnwqFMpvxxzQ';return true;">Elasticsearch.com
<a href="https://twitter.com/dadoonet" style="color:rgb(17,85,204)" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;">@dadoonet | <a href="https://twitter.com/elasticsearchfr" style="color:rgb(17,85,204)" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;">@elasticsearchfr | <a href="https://twitter.com/scrutmydocs" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;">@scrutmydocs



Le 10 déc. 2014 à 21:16, Ron Sher <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="x9xde1xgip0J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">ron....@...> a écrit :

So you're saying there's no impact on elasticsearch if I issue a large size? 
If that's the case then why shouldn't I just call size of 1M if I want to make sure I get everything?

On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
Scan/scroll is the best option to extract a huge amount of data.
Never use size:10000000 or from:10000000. 

It's not realtime because you basically scroll over a given set of segments and all new changes that will come in new segments won't be taken into account during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 10000 and see how it goes when you increase it. You will may be discover that getting 10*10000 docs is the same as 1*100000. :)

Best

David

Le 10 déc. 2014 à 19:09, Ron Sher <[hidden email]> a écrit :

Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="x9xde1xgip0J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ac0841ac-4150-435c-a3da-afbf2a4b06a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

dadoonet
Implication is the memory needed to be allocated on each shard.


David

Le 14 déc. 2014 à 05:46, Ron Sher <[hidden email]> a écrit :

Again, why not use a very large count size? What are the implications of using a very large count?
Regarding performance - it seems doing 1 request with a very large count performs better than using scan scroll (with count of 100 using 32 shards)

On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:
No I did not say that. Or I did not mean that. Sorry if it was unclear.
I said: don’t use large sizes:

Never use size:10000000 or from:10000000. 

You should read this: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-scroll.html%23scroll-scan\46sa\75D\46sntz\0751\46usg\75AFQjCNE-Hr1D5-J9jvGPFEaLKs-HC7Je8g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-scroll.html%23scroll-scan\46sa\75D\46sntz\0751\46usg\75AFQjCNE-Hr1D5-J9jvGPFEaLKs-HC7Je8g';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

-- 
David Pilato | Technical Advocate | <a href="http://Elasticsearch.com" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2FElasticsearch.com\46sa\75D\46sntz\0751\46usg\75AFQjCNGt3jnDFhlOggktrjOnwqFMpvxxzQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2FElasticsearch.com\46sa\75D\46sntz\0751\46usg\75AFQjCNGt3jnDFhlOggktrjOnwqFMpvxxzQ';return true;">Elasticsearch.com
<a href="https://twitter.com/dadoonet" style="color:rgb(17,85,204)" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;">@dadoonet | <a href="https://twitter.com/elasticsearchfr" style="color:rgb(17,85,204)" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;">@elasticsearchfr | <a href="https://twitter.com/scrutmydocs" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;">@scrutmydocs



Le 10 déc. 2014 à 21:16, Ron Sher <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="x9xde1xgip0J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">ron....@...> a écrit :

So you're saying there's no impact on elasticsearch if I issue a large size? 
If that's the case then why shouldn't I just call size of 1M if I want to make sure I get everything?

On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
Scan/scroll is the best option to extract a huge amount of data.
Never use size:10000000 or from:10000000. 

It's not realtime because you basically scroll over a given set of segments and all new changes that will come in new segments won't be taken into account during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 10000 and see how it goes when you increase it. You will may be discover that getting 10*10000 docs is the same as 1*100000. :)

Best

David

Le 10 déc. 2014 à 19:09, Ron Sher <[hidden email]> a écrit :

Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="x9xde1xgip0J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ac0841ac-4150-435c-a3da-afbf2a4b06a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7717B0E2-E971-4653-A0A7-BA66EC3EAE9F%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

Nikolas Everett

Search consumes O(offset + size) memory and O(ln(offset + size)*(offset+size) CPU. Scan scroll has higher overhead but is O(size) the whole time. I don't know the break even point.

The other thing is that scroll provides a consistent snapshot. That means it consumes resources you shouldn't let end users expose but it won't miss results or have repeats like scrolling with increasing offset.

You can certainly do large fetches with big size but its less stable in general.

Finally, scan/scroll has always been pretty quick for me. I usually use a batch size in the thousands.

Nik

On Dec 14, 2014 4:13 AM, "David Pilato" <[hidden email]> wrote:
Implication is the memory needed to be allocated on each shard.


David

Le 14 déc. 2014 à 05:46, Ron Sher <[hidden email]> a écrit :

Again, why not use a very large count size? What are the implications of using a very large count?
Regarding performance - it seems doing 1 request with a very large count performs better than using scan scroll (with count of 100 using 32 shards)

On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:
No I did not say that. Or I did not mean that. Sorry if it was unclear.
I said: don’t use large sizes:

Never use size:10000000 or from:10000000. 


-- 
David Pilato | Technical Advocate | Elasticsearch.com



Le 10 déc. 2014 à 21:16, Ron Sher <[hidden email]> a écrit :

So you're saying there's no impact on elasticsearch if I issue a large size? 
If that's the case then why shouldn't I just call size of 1M if I want to make sure I get everything?

On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
Scan/scroll is the best option to extract a huge amount of data.
Never use size:10000000 or from:10000000. 

It's not realtime because you basically scroll over a given set of segments and all new changes that will come in new segments won't be taken into account during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 10000 and see how it goes when you increase it. You will may be discover that getting 10*10000 docs is the same as 1*100000. :)

Best

David

Le 10 déc. 2014 à 19:09, Ron Sher <[hidden email]> a écrit :

Hi,

I was wondering about best practices to to get all data according to some filters.
The options as I see them are:
  • Use a very big size that will return all accounts, i.e. use some value like 1m to make sure I get everything back (even if I need just a few hundreds or tens of documents). This is the quickest way, development wise.
  • Use paging - using size and from. This requires looping over the result and the performance gets worse as we advance to later pages. Also, we need to use preference if we want to get consistent results over the pages. Also, it's not clear what's the recommended size for each page.
  • Use scan/scroll - this gives consistent paging but also has several drawbacks: If I use search_type=scan then it can't be sorted; using scan/scroll is (maybe) less performant than paging (the documentation says it's not for realtime use); again not clear which size is recommended.
So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ac0841ac-4150-435c-a3da-afbf2a4b06a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7717B0E2-E971-4653-A0A7-BA66EC3EAE9F%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1ULXMG-f_dF_9HVDoGjU724cCqdPk5zGLz12iWYKdhvA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for a best practice to get all data according to some filters

Jonathan Foy
In reply to this post by Ron Sher
Just to reword what others have said, ES will allocate memory for [size] scores as I understand it (per shard?) regardless of the final result count. If you're getting back 4986 results from a query, it'd be faster to use "size": 4986 than "size": 1000000.

What I've done in similar situations is to issue a count first with the same filter (which is very fast), then use the result of that in the size field. It worked much better/faster than using a default large size.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ef65d92f-9a9c-4206-a2b9-5b769e4cec87%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.