Adding S3 gateway on a local-gateway machine

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding S3 gateway on a local-gateway machine

mrflip
We have a machine set up currently with a local gateway (full config
at https://gist.github.com/f003c19dd0ce53c654cb )

gateway:
  type:                         local
index:
  gateway:
    snapshot_interval:          -1
    snapshot_on_close:          false

We're considering moving the cluster to use the S3 gateway. It's a 16-
machine cluster; when all is done it will hold about 11 indexes, 176
shards x 2 (replicas = 1), each of about 5-15GB actual-on-disk usage.

Can we switch the cluster over to use the s3 gateway without losing
files?
I know I'll have to trigger a snapshot using eg.
  curl -XPOST 'http://localhost:9200/_gateway/snapshot'
My concern is that once I update the config, I'll have to restart each
data node; will it try to initiate recovery from the (empty) s3
gateway, or can I make it adopt the local files already presence and
then push them to S3 after going green?

Also, are there any non-obvious performance implications for pushing
that much data through s3? Will new nodes  recover from their peers or
pull from s3?

thanks,
flip
Reply | Threaded
Open this post in threaded view
|

Re: Adding S3 gateway on a local-gateway machine

kimchy
Administrator
Hi,

   There is no way to switch from local to gateway without reindexing the data.

   Regarding the overhead of s3, there are basically two. The first is the initial recovery on full cluster startup. If you set the gateway.recovery_after_xxx settings, then shards will be allocated to nodes that have the most common local data with regards to s3, so the recovery times should be minimal.

   The second problem with s3 is more concerning, which is the need to push the data to s3. This will require network resources, which are very rare on ec2 ;), and will compete with indexing / searching network operations... .

-shay.banon

On Fri, Dec 24, 2010 at 1:12 AM, mrflip <[hidden email]> wrote:
We have a machine set up currently with a local gateway (full config
at https://gist.github.com/f003c19dd0ce53c654cb )

gateway:
 type:                         local
index:
 gateway:
   snapshot_interval:          -1
   snapshot_on_close:          false

We're considering moving the cluster to use the S3 gateway. It's a 16-
machine cluster; when all is done it will hold about 11 indexes, 176
shards x 2 (replicas = 1), each of about 5-15GB actual-on-disk usage.

Can we switch the cluster over to use the s3 gateway without losing
files?
I know I'll have to trigger a snapshot using eg.
 curl -XPOST 'http://localhost:9200/_gateway/snapshot'
My concern is that once I update the config, I'll have to restart each
data node; will it try to initiate recovery from the (empty) s3
gateway, or can I make it adopt the local files already presence and
then push them to S3 after going green?

Also, are there any non-obvious performance implications for pushing
that much data through s3? Will new nodes  recover from their peers or
pull from s3?

thanks,
flip