how persistence works in ElasticSearch..

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

how persistence works in ElasticSearch..

Berkay Mollamustafaoglu-2

I can use some help verifying/understanding how persistence works. Here is my understanding of how it works:

Regardless of whether the index is stored in memory or file system, it is considered temporary and removed when the node is stopped, hence if all the nodes in the cluster stop, indices would be lost. 

As such, for persistence write behind gateway needs to be used. Gateway keeps a transaction log and (periodically?) creates indices. If all the nodes in the cluster were stopped and restarting, the indices and the transaction logs created by the gateway are used to recreate node indices. 

Is this right? 



Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Fri, Mar 26, 2010 at 1:10 PM, Tim Robertson <[hidden email]> wrote:
Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?) for RPC instead of JSON and serializing and deserializing?

Cheers
Tim



On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon <[hidden email]> wrote:
Hi,

  You won't enjoy locally between elasticsearch and hadoop in any case since both use different distribution model. The locality would only make sense for the indexing part, and think that you probably won't really need it (it should be fast enough).

  What language are you going to write your jobs at? If Java, then make use of the native Java client (obtained from a "non data" Server started) and not HTTP. More here:http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon


On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <[hidden email]> wrote:
Hey,

Is anyone building their indexes using Hadoop?  If so, are they deploying ES across the same cluster as Hadoop and trying to reduce network noise by making use of data locality, or keeping the clusters separate and just calling over HTTP from MapReduce when building the indexes?  I am about to set up on EC2, and planned to keep the search and processing machines separate.

Cheers,
Tim



Reply | Threaded
Open this post in threaded view
|

Re: how persistence works in ElasticSearch..

kimchy
Administrator
Yes, thats basically how it works. Regarding the transaction log, basically, the gateway is responsible for mirroring the current shard lucene index, and the delta transaction log. When a commit occurs (either through an API call, or automatically by elasticsearch), the transaction log is flushed.

The benefit of this architecture is that the indexable state of the cluster can be written in an async manner, and, the actual storage of the index is irrelevant for long term persistency, which means you can still store the index in memory (or just parts of it, with the upcoming cacheable FS storage), and not loose it on failure.

-shay.banon

On Fri, Mar 26, 2010 at 11:57 PM, Berkay Mollamustafaoglu <[hidden email]> wrote:

I can use some help verifying/understanding how persistence works. Here is my understanding of how it works:

Regardless of whether the index is stored in memory or file system, it is considered temporary and removed when the node is stopped, hence if all the nodes in the cluster stop, indices would be lost. 

As such, for persistence write behind gateway needs to be used. Gateway keeps a transaction log and (periodically?) creates indices. If all the nodes in the cluster were stopped and restarting, the indices and the transaction logs created by the gateway are used to recreate node indices. 

Is this right? 



Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Fri, Mar 26, 2010 at 1:10 PM, Tim Robertson <[hidden email]> wrote:
Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?) for RPC instead of JSON and serializing and deserializing?

Cheers
Tim



On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon <[hidden email]> wrote:
Hi,

  You won't enjoy locally between elasticsearch and hadoop in any case since both use different distribution model. The locality would only make sense for the indexing part, and think that you probably won't really need it (it should be fast enough).

  What language are you going to write your jobs at? If Java, then make use of the native Java client (obtained from a "non data" Server started) and not HTTP. More here:http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon


On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <[hidden email]> wrote:
Hey,

Is anyone building their indexes using Hadoop?  If so, are they deploying ES across the same cluster as Hadoop and trying to reduce network noise by making use of data locality, or keeping the clusters separate and just calling over HTTP from MapReduce when building the indexes?  I am about to set up on EC2, and planned to keep the search and processing machines separate.

Cheers,
Tim