Persistency

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Persistency

aemadrid
I'm very interested in ES and I'm wondering how you can do persistency
and how it impacts performance. Can you elaborate a little bit more on
that?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Persistency

kimchy
Administrator
Sure. Basically, persistency is done in a "write behind manner" and it called gateway in elasticsearch. The cluster meta data and including indices that you choose to are written to a long term persistent storage asynchronously in the background.

The cluster meta data include all the indices (and their settings) that have been created in the cluster (not including the indices content). This is written to the gateway every time something changes (like creating a new index). Each index in turn can also be persistent to a gateway. The index content itself is persistent to the gateway, along with its transaction log.

In terms of high availability, if you create an index with 1 replica per shard, then you have "real time" high availability, which means that if a node fails, there is a replica for it. Of course, you can always create more than one replica per shard. The replication to the replicas is done in parallel. The nice thing about elasticsearch is that reads go to one of the replica shards, which means that you can scale reads/search with more replicas.

Each shard replication group has a primary shard, which is responsible for persisting the index and the transaction log into the long term gateway storage. This is done in a scheduled manner (you can control it), and there is even an API to force it (the gateway snapshot API).

The snapshotting of the index into the persistent storage is done in the background, so there is no impact on performance of actual operations done against the index.

The gateway module itself (both the cluster one, and the index one) are completely pluggable. The current implementation is a file system based one. There are more to come including cloud based ones to persist to Amazon S3 for example.

This solution is my preferred solution for handling long term persistency of of a cluster since it means that node storage is completely temporal. This in turn means that you can store the index in memory for example, get the performance benefits that comes with it, without scarifying long term persistency.

Some docs links:


-shay.banon

On Sat, Feb 13, 2010 at 11:03 PM, aemadrid <[hidden email]> wrote:
I'm very interested in ES and I'm wondering how you can do persistency
and how it impacts performance. Can you elaborate a little bit more on
that?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Persistency

aemadrid
Thanks so much for the explanation. It makes sense to me although I'm getting a little bit lost on the formulas for shards/replicas. Something tells me that there would be a couple of scenarios when it comes to HA. Some people )like me) are after HA in the smallest sense possible: 2 servers to keep everything running in case one fails and everything to disk in case both fail. Some other people I bet are after the 3+  servers (10+, 100+). Could you write/video how you would go about setting something like those scenarios 2 up? One of the selling points for me is how servers find each other (only localhost or local network too?) and remaster each other on failures. It seems that setting up persistence is a little more involved.

Thanks in advance,


Adrian Madrid
My eBiz, Senior Developer
3082 W. Maple Loop Dr
Lehi, UT 84043
801-341-3824



On Sat, Feb 13, 2010 at 14:57, Shay Banon <[hidden email]> wrote:
Sure. Basically, persistency is done in a "write behind manner" and it called gateway in elasticsearch. The cluster meta data and including indices that you choose to are written to a long term persistent storage asynchronously in the background.

The cluster meta data include all the indices (and their settings) that have been created in the cluster (not including the indices content). This is written to the gateway every time something changes (like creating a new index). Each index in turn can also be persistent to a gateway. The index content itself is persistent to the gateway, along with its transaction log.

In terms of high availability, if you create an index with 1 replica per shard, then you have "real time" high availability, which means that if a node fails, there is a replica for it. Of course, you can always create more than one replica per shard. The replication to the replicas is done in parallel. The nice thing about elasticsearch is that reads go to one of the replica shards, which means that you can scale reads/search with more replicas.

Each shard replication group has a primary shard, which is responsible for persisting the index and the transaction log into the long term gateway storage. This is done in a scheduled manner (you can control it), and there is even an API to force it (the gateway snapshot API).

The snapshotting of the index into the persistent storage is done in the background, so there is no impact on performance of actual operations done against the index.

The gateway module itself (both the cluster one, and the index one) are completely pluggable. The current implementation is a file system based one. There are more to come including cloud based ones to persist to Amazon S3 for example.

This solution is my preferred solution for handling long term persistency of of a cluster since it means that node storage is completely temporal. This in turn means that you can store the index in memory for example, get the performance benefits that comes with it, without scarifying long term persistency.

Some docs links:


-shay.banon

On Sat, Feb 13, 2010 at 11:03 PM, aemadrid <[hidden email]> wrote:
I'm very interested in ES and I'm wondering how you can do persistency
and how it impacts performance. Can you elaborate a little bit more on
that?


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Persistency

kimchy
Administrator
The discovery module uses jgroups, which in turn can use multicast/unicast. I will document it later today/tomorrow more properly. It works across machines.

In case of simple, two servers, I suggest you deploy either 2 shards, each with 1 replica, or 4 shards, each with one replica. In this case, you will have either 2 shard on each machine , or 4 shards on each machine. The reason for the extra shards is the ability to grow in the future and for added concurrency.

If things are a bit more difficult, just leave everything as is (it defaults to 5 shards, each with one replica), and change the gateway configuration in the elasticsearch.yml fie to:

gateway:
    type : fs
    fs : 
          location : /shared/fs/location

-shay.banon

On Sun, Feb 14, 2010 at 5:55 AM, Adrian Madrid <[hidden email]> wrote:
Thanks so much for the explanation. It makes sense to me although I'm getting a little bit lost on the formulas for shards/replicas. Something tells me that there would be a couple of scenarios when it comes to HA. Some people )like me) are after HA in the smallest sense possible: 2 servers to keep everything running in case one fails and everything to disk in case both fail. Some other people I bet are after the 3+  servers (10+, 100+). Could you write/video how you would go about setting something like those scenarios 2 up? One of the selling points for me is how servers find each other (only localhost or local network too?) and remaster each other on failures. It seems that setting up persistence is a little more involved.

Thanks in advance,


Adrian Madrid
My eBiz, Senior Developer
3082 W. Maple Loop Dr
Lehi, UT 84043
801-341-3824




On Sat, Feb 13, 2010 at 14:57, Shay Banon <[hidden email]> wrote:
Sure. Basically, persistency is done in a "write behind manner" and it called gateway in elasticsearch. The cluster meta data and including indices that you choose to are written to a long term persistent storage asynchronously in the background.

The cluster meta data include all the indices (and their settings) that have been created in the cluster (not including the indices content). This is written to the gateway every time something changes (like creating a new index). Each index in turn can also be persistent to a gateway. The index content itself is persistent to the gateway, along with its transaction log.

In terms of high availability, if you create an index with 1 replica per shard, then you have "real time" high availability, which means that if a node fails, there is a replica for it. Of course, you can always create more than one replica per shard. The replication to the replicas is done in parallel. The nice thing about elasticsearch is that reads go to one of the replica shards, which means that you can scale reads/search with more replicas.

Each shard replication group has a primary shard, which is responsible for persisting the index and the transaction log into the long term gateway storage. This is done in a scheduled manner (you can control it), and there is even an API to force it (the gateway snapshot API).

The snapshotting of the index into the persistent storage is done in the background, so there is no impact on performance of actual operations done against the index.

The gateway module itself (both the cluster one, and the index one) are completely pluggable. The current implementation is a file system based one. There are more to come including cloud based ones to persist to Amazon S3 for example.

This solution is my preferred solution for handling long term persistency of of a cluster since it means that node storage is completely temporal. This in turn means that you can store the index in memory for example, get the performance benefits that comes with it, without scarifying long term persistency.

Some docs links:


-shay.banon

On Sat, Feb 13, 2010 at 11:03 PM, aemadrid <[hidden email]> wrote:
I'm very interested in ES and I'm wondering how you can do persistency
and how it impacts performance. Can you elaborate a little bit more on
that?



Loading...