lost data due to out of memory error

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

lost data due to out of memory error

dottom
I encountered lost data as a result of java.lang.OutOfMemoryError
condition.  I am using:

- 0.17.4
- 4g min/max heap size
- mlockall
- single node, 1 shard per index, 0 replicas
- indexed 300m documents across 6 indexes

config:
> index:
>   number_of_shards: 1
>   number_of_replicas: 0
>   refresh_interval: 20s
>
> boostrap:
>   mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

  curl -XGET http://localhost:9200/_search -d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents.  Previous to this,
other queries returned results fine.  Queries to the health endpoint
and any other query did not return a result (just "hangs").  The only
evidence of a problem other than the queries hanging was this in the
log file:

> failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted.  The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

> [2011-08-11 01:07:58,378][WARN ][indices.cluster          ] [Unuscione, Angelo] [index0006][0] failed to start shard
> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index0006][0] shard allocated for local recovery (post api), should exists, but doesn't

Is there a way to predict when we might encounter out-of-memory
errors?  I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually.  The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.
Reply | Threaded
Open this post in threaded view
|

Re: lost data due to out of memory error

kimchy
Administrator
I think you might have hit this one: https://github.com/elasticsearch/elasticsearch/issues/1227, which I fixed yesterday. Did shards failed before you shutdown the node?

On Thu, Aug 11, 2011 at 4:15 AM, Tom Le <[hidden email]> wrote:
I encountered lost data as a result of java.lang.OutOfMemoryError
condition.  I am using:

- 0.17.4
- 4g min/max heap size
- mlockall
- single node, 1 shard per index, 0 replicas
- indexed 300m documents across 6 indexes

config:
> index:
>   number_of_shards: 1
>   number_of_replicas: 0
>   refresh_interval: 20s
>
> boostrap:
>   mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

 curl -XGET http://localhost:9200/_search -d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents.  Previous to this,
other queries returned results fine.  Queries to the health endpoint
and any other query did not return a result (just "hangs").  The only
evidence of a problem other than the queries hanging was this in the
log file:

> failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted.  The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

> [2011-08-11 01:07:58,378][WARN ][indices.cluster          ] [Unuscione, Angelo] [index0006][0] failed to start shard
> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index0006][0] shard allocated for local recovery (post api), should exists, but doesn't

Is there a way to predict when we might encounter out-of-memory
errors?  I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually.  The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.

Reply | Threaded
Open this post in threaded view
|

Re: lost data due to out of memory error

dottom
The logs don't show anything before the OOM error, upon restart there was a "failed to start shard".  

That issue looks very similar so it's likely the case.  I will test some OOM conditions with the new version and gist any problems.  The 3 main scenarios I want to ensure reliable recovery are:

- OOM
- too many open files
- file system full

In theory, proper health monitoring would mean we don't encounter these scenarios.  I know you mentioned previously that both Lucene and ES goes to great lengths to provide recovery from various failure states.


On Thu, Aug 11, 2011 at 1:26 AM, Shay Banon <[hidden email]> wrote:
I think you might have hit this one: https://github.com/elasticsearch/elasticsearch/issues/1227, which I fixed yesterday. Did shards failed before you shutdown the node?


On Thu, Aug 11, 2011 at 4:15 AM, Tom Le <[hidden email]> wrote:
I encountered lost data as a result of java.lang.OutOfMemoryError
condition.  I am using:

- 0.17.4
- 4g min/max heap size
- mlockall
- single node, 1 shard per index, 0 replicas
- indexed 300m documents across 6 indexes

config:
> index:
>   number_of_shards: 1
>   number_of_replicas: 0
>   refresh_interval: 20s
>
> boostrap:
>   mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

 curl -XGET http://localhost:9200/_search -d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents.  Previous to this,
other queries returned results fine.  Queries to the health endpoint
and any other query did not return a result (just "hangs").  The only
evidence of a problem other than the queries hanging was this in the
log file:

> failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted.  The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

> [2011-08-11 01:07:58,378][WARN ][indices.cluster          ] [Unuscione, Angelo] [index0006][0] failed to start shard
> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index0006][0] shard allocated for local recovery (post api), should exists, but doesn't

Is there a way to predict when we might encounter out-of-memory
errors?  I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually.  The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.


Reply | Threaded
Open this post in threaded view
|

Re: lost data due to out of memory error

kimchy
Administrator
Yes, even with a no replicas index (single node or several), you should not loose data in those failure cases.

On Thu, Aug 11, 2011 at 8:44 PM, Tom Le <[hidden email]> wrote:
The logs don't show anything before the OOM error, upon restart there was a "failed to start shard".  

That issue looks very similar so it's likely the case.  I will test some OOM conditions with the new version and gist any problems.  The 3 main scenarios I want to ensure reliable recovery are:

- OOM
- too many open files
- file system full

In theory, proper health monitoring would mean we don't encounter these scenarios.  I know you mentioned previously that both Lucene and ES goes to great lengths to provide recovery from various failure states.


On Thu, Aug 11, 2011 at 1:26 AM, Shay Banon <[hidden email]> wrote:
I think you might have hit this one: https://github.com/elasticsearch/elasticsearch/issues/1227, which I fixed yesterday. Did shards failed before you shutdown the node?


On Thu, Aug 11, 2011 at 4:15 AM, Tom Le <[hidden email]> wrote:
I encountered lost data as a result of java.lang.OutOfMemoryError
condition.  I am using:

- 0.17.4
- 4g min/max heap size
- mlockall
- single node, 1 shard per index, 0 replicas
- indexed 300m documents across 6 indexes

config:
> index:
>   number_of_shards: 1
>   number_of_replicas: 0
>   refresh_interval: 20s
>
> boostrap:
>   mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

 curl -XGET http://localhost:9200/_search -d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents.  Previous to this,
other queries returned results fine.  Queries to the health endpoint
and any other query did not return a result (just "hangs").  The only
evidence of a problem other than the queries hanging was this in the
log file:

> failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted.  The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

> [2011-08-11 01:07:58,378][WARN ][indices.cluster          ] [Unuscione, Angelo] [index0006][0] failed to start shard
> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index0006][0] shard allocated for local recovery (post api), should exists, but doesn't

Is there a way to predict when we might encounter out-of-memory
errors?  I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually.  The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.



Reply | Threaded
Open this post in threaded view
|

Re: lost data due to out of memory error

Ben Coe
Is this problem fixed in the 0.17.5 release. If so, I'll upgrade to it
ASAP ;)

Thanks for the hard work Shay,

-- Ben.

On Aug 11, 10:48 am, Shay Banon <[hidden email]> wrote:

> Yes, even with a no replicas index (single node or several), you should not
> loose data in those failure cases.
>
>
>
>
>
>
>
> On Thu, Aug 11, 2011 at 8:44 PM, Tom Le <[hidden email]> wrote:
> > The logs don't show anything before the OOM error, upon restart there was a
> > "failed to start shard".
>
> > That issue looks very similar so it's likely the case.  I will test some
> > OOM conditions with the new version and gist any problems.  The 3 main
> > scenarios I want to ensure reliable recovery are:
>
> > - OOM
> > - too many open files
> > - file system full
>
> > In theory, proper health monitoring would mean we don't encounter these
> > scenarios.  I know you mentioned previously that both Lucene and ES goes to
> > great lengths to provide recovery from various failure states.
>
> > On Thu, Aug 11, 2011 at 1:26 AM, Shay Banon <[hidden email]> wrote:
>
> >> I think you might have hit this one:
> >>https://github.com/elasticsearch/elasticsearch/issues/1227, which I fixed
> >> yesterday. Did shards failed before you shutdown the node?
>
> >> On Thu, Aug 11, 2011 at 4:15 AM, Tom Le <[hidden email]> wrote:
>
> >>> I encountered lost data as a result of java.lang.OutOfMemoryError
> >>> condition.  I am using:
>
> >>> - 0.17.4
> >>> - 4g min/max heap size
> >>> - mlockall
> >>> - single node, 1 shard per index, 0 replicas
> >>> - indexed 300m documents across 6 indexes
>
> >>> config:
> >>> > index:
> >>> >   number_of_shards: 1
> >>> >   number_of_replicas: 0
> >>> >   refresh_interval: 20s
>
> >>> > boostrap:
> >>> >   mlockall: true
>
> >>> I have been indexing about 3000 messages per second constantly for
> >>> over 80-hours, then at the 300m document level, I ran this query and
> >>> the query hung waiting for a response:
>
> >>>  curl -XGEThttp://localhost:9200/_search-d '{"query":
> >>> {"query_string":{"query":"someword"}}}'
>
> >>> Where someword was contained in all 300m documents.  Previous to this,
> >>> other queries returned results fine.  Queries to the health endpoint
> >>> and any other query did not return a result (just "hangs").  The only
> >>> evidence of a problem other than the queries hanging was this in the
> >>> log file:
>
> >>> > failed engine java.lang.OutOfMemoryError: Java heap space
>
> >>> A normal "kill PID" command did not initiate a graceful shudown and I
> >>> had to perform a "kill -9".
>
> >>> "lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
> >>> is set at 100000.
>
> >>> Upon restart, with updated min/max heap size to 8g, all of the files
> >>> in the ./index and ./translog directories were deleted.  The cluster
> >>> won't start up and shows a constant stream of errors like this and
> >>> this is continuous without stopping:
>
> >>> > [2011-08-11 01:07:58,378][WARN ][indices.cluster          ] [Unuscione,
> >>> Angelo] [index0006][0] failed to start shard
> >>> > org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
> >>> [index0006][0] shard allocated for local recovery (post api), should exists,
> >>> but doesn't
>
> >>> Is there a way to predict when we might encounter out-of-memory
> >>> errors?  I can enforce better queries with the user interface if there
> >>> are some queries that we should not be executing.
>
> >>> Is there a better way to startup ES if I know I previously had an
> >>> unclean shutdown?
>
> >>> I am expecting each node to have at least 10 billion documents across
> >>> 100 to 300 indexes eventually.  The document sizes are small, ranging
> >>> from 200 to 800 bytes, plus an additional 200 bytes for the fields.
Reply | Threaded
Open this post in threaded view
|

Re: lost data due to out of memory error

kimchy
Administrator
Yes, in 0.17.5 there is a fix for a case where data could be lost (in a single node / no replicas case). I've ran several long running failure driven tests simulating those failures, and could not get to a state where data was loss.

On Fri, Aug 12, 2011 at 7:57 PM, Ben Coe <[hidden email]> wrote:
Is this problem fixed in the 0.17.5 release. If so, I'll upgrade to it
ASAP ;)

Thanks for the hard work Shay,

-- Ben.

On Aug 11, 10:48 am, Shay Banon <[hidden email]> wrote:
> Yes, even with a no replicas index (single node or several), you should not
> loose data in those failure cases.
>
>
>
>
>
>
>
> On Thu, Aug 11, 2011 at 8:44 PM, Tom Le <[hidden email]> wrote:
> > The logs don't show anything before the OOM error, upon restart there was a
> > "failed to start shard".
>
> > That issue looks very similar so it's likely the case.  I will test some
> > OOM conditions with the new version and gist any problems.  The 3 main
> > scenarios I want to ensure reliable recovery are:
>
> > - OOM
> > - too many open files
> > - file system full
>
> > In theory, proper health monitoring would mean we don't encounter these
> > scenarios.  I know you mentioned previously that both Lucene and ES goes to
> > great lengths to provide recovery from various failure states.
>
> > On Thu, Aug 11, 2011 at 1:26 AM, Shay Banon <[hidden email]> wrote:
>
> >> I think you might have hit this one:
> >>https://github.com/elasticsearch/elasticsearch/issues/1227, which I fixed
> >> yesterday. Did shards failed before you shutdown the node?
>
> >> On Thu, Aug 11, 2011 at 4:15 AM, Tom Le <[hidden email]> wrote:
>
> >>> I encountered lost data as a result of java.lang.OutOfMemoryError
> >>> condition.  I am using:
>
> >>> - 0.17.4
> >>> - 4g min/max heap size
> >>> - mlockall
> >>> - single node, 1 shard per index, 0 replicas
> >>> - indexed 300m documents across 6 indexes
>
> >>> config:
> >>> > index:
> >>> >   number_of_shards: 1
> >>> >   number_of_replicas: 0
> >>> >   refresh_interval: 20s
>
> >>> > boostrap:
> >>> >   mlockall: true
>
> >>> I have been indexing about 3000 messages per second constantly for
> >>> over 80-hours, then at the 300m document level, I ran this query and
> >>> the query hung waiting for a response:
>
> >>>  curl -XGEThttp://localhost:9200/_search-d '{"query":
> >>> {"query_string":{"query":"someword"}}}'
>
> >>> Where someword was contained in all 300m documents.  Previous to this,
> >>> other queries returned results fine.  Queries to the health endpoint
> >>> and any other query did not return a result (just "hangs").  The only
> >>> evidence of a problem other than the queries hanging was this in the
> >>> log file:
>
> >>> > failed engine java.lang.OutOfMemoryError: Java heap space
>
> >>> A normal "kill PID" command did not initiate a graceful shudown and I
> >>> had to perform a "kill -9".
>
> >>> "lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
> >>> is set at 100000.
>
> >>> Upon restart, with updated min/max heap size to 8g, all of the files
> >>> in the ./index and ./translog directories were deleted.  The cluster
> >>> won't start up and shows a constant stream of errors like this and
> >>> this is continuous without stopping:
>
> >>> > [2011-08-11 01:07:58,378][WARN ][indices.cluster          ] [Unuscione,
> >>> Angelo] [index0006][0] failed to start shard
> >>> > org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
> >>> [index0006][0] shard allocated for local recovery (post api), should exists,
> >>> but doesn't
>
> >>> Is there a way to predict when we might encounter out-of-memory
> >>> errors?  I can enforce better queries with the user interface if there
> >>> are some queries that we should not be executing.
>
> >>> Is there a better way to startup ES if I know I previously had an
> >>> unclean shutdown?
>
> >>> I am expecting each node to have at least 10 billion documents across
> >>> 100 to 300 indexes eventually.  The document sizes are small, ranging
> >>> from 200 to 800 bytes, plus an additional 200 bytes for the fields.