Hi,
I am trying to pull out 192 millions records from an index of ES. I am interested in only on "_source" part, which i want to use that in my another spark program. So my Idea is to get that 192 millions records out with only "_source" part.
In the below way, i tried to get the data in chunks.
curl -XGET '
http://localhost:80/asset1/_search?scroll=2m&filter_path=hits.hits._source' -d '{"from" : 1, "size" : 25000}' > hist_data_1-25000.txt;
curl -XGET '
http://localhost:80/asset1/_search?scroll=2m&filter_path=hits.hits._source' -d '{"from" : 25001, "size" : 25000}' > hist_data_25001-50000.txt;
curl -XGET '
http://localhost:80/asset1/_search?scroll=2m&filter_path=hits.hits._source' -d '{"from" : 50001, "size" : 25000}' > hist_data_50001-75000.txt;
curl -XGET '
http://localhost:80/asset1/_search?scroll=2m&filter_path=hits.hits._source' -d '{"from" : 75001, "size" : 25000}' > hist_data_75001-100000.txt;
I can't use the date column to filter as date value is not present in all the documents. So, I had to dependent on pagination by rows.... I have chosen the above option
I am really not sure, here am i doing any wrong? Can someone help me in this.
Thanks,
Praveen Kumar B