Networking issues + dumb ES clients = FAIL

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Networking issues + dumb ES clients = FAIL

Ævar Arnfjörð Bjarmason
This week I had a severe issue with ElasticSearch in a production
environment related to the particulars of the setup I had. I thought
I'd share it in case anyone else had a similar setup, and especially
if someone had found a good way to solve the problem.

The way I have ElasticSearch setup is:

 * I have a 3 node cluster being queried by ~200 machines
 * Each of those machines is a dumb webserver
 * Each dumb webserver pre-loads a list of available ElasticSearch
   nodes on startup before forking.
 * Each forked child lives for a fairly short time (i.e. only a few
   hundred requests), and since it's doing mixed traffic it's likely
   that it'll only do 1-2 ElasticSearch queries.

The failure that I had was that the switch pointing to 1/3 ES nodes
went down, so newly forked server children trying had a 1/3 chance of
contacting that machine for their first request and running into their
HTTP connection timeout before moving into the next one.

Thus effectively 1/3 requests to ElasticSearch would have to wait for
$HTTP_TIMEOUT seconds before marking that node as bad and proceeding
onto the next one, and since each child lives for such a few number of
requests the built-in safety valve in the client library of not
retrying queries against known-bad nodes effectively did nothing.

I'm currently pondering a few solutions to this which each have their
own pros and cons.

 1. Patch the client library to share the state of what nodes are
    good/bad between all processes on the system, e.g. using shared
    memory, some dumb file storage etc. This would be relatively easy
    and I could feed the changes back to the client library.

 2. Stick a load balancer in front of the ES boxes. I'd done this
    previously and it caused some problems due to the LB only
    understanding "connection refused" as a failure mode. I.e. it
    didn't understand that it should try again if a node was starting
    up and replying with "go away, I'm initializing".

    That could be solved by a smarter LB, or having the clients retry
    N times on the LB in case of failure, hoping that they'll get an
    OK node on the next request.

 3. Stick a long-living intermediary between the dumb machines and the
    ES servers, i.e. have search requests served by an API that uses
    the ES client library.
Reply | Threaded
Open this post in threaded view
|

Re: Networking issues + dumb ES clients = FAIL

Paul Loy
is option 4 to use a 'real' ES client that knows about cluster state? You didn't say what your webservers were running. But if it's Java, you can join the ES cluster in client mode (http://www.elasticsearch.org/guide/reference/java-api/client.html) and then you get all of elastic search's clustering for free including adding/removing nodes when they come up / go bad.

On Sat, Nov 26, 2011 at 4:20 PM, Ævar Arnfjörð Bjarmason <[hidden email]> wrote:
This week I had a severe issue with ElasticSearch in a production
environment related to the particulars of the setup I had. I thought
I'd share it in case anyone else had a similar setup, and especially
if someone had found a good way to solve the problem.

The way I have ElasticSearch setup is:

 * I have a 3 node cluster being queried by ~200 machines
 * Each of those machines is a dumb webserver
 * Each dumb webserver pre-loads a list of available ElasticSearch
  nodes on startup before forking.
 * Each forked child lives for a fairly short time (i.e. only a few
  hundred requests), and since it's doing mixed traffic it's likely
  that it'll only do 1-2 ElasticSearch queries.

The failure that I had was that the switch pointing to 1/3 ES nodes
went down, so newly forked server children trying had a 1/3 chance of
contacting that machine for their first request and running into their
HTTP connection timeout before moving into the next one.

Thus effectively 1/3 requests to ElasticSearch would have to wait for
$HTTP_TIMEOUT seconds before marking that node as bad and proceeding
onto the next one, and since each child lives for such a few number of
requests the built-in safety valve in the client library of not
retrying queries against known-bad nodes effectively did nothing.

I'm currently pondering a few solutions to this which each have their
own pros and cons.

 1. Patch the client library to share the state of what nodes are
   good/bad between all processes on the system, e.g. using shared
   memory, some dumb file storage etc. This would be relatively easy
   and I could feed the changes back to the client library.

 2. Stick a load balancer in front of the ES boxes. I'd done this
   previously and it caused some problems due to the LB only
   understanding "connection refused" as a failure mode. I.e. it
   didn't understand that it should try again if a node was starting
   up and replying with "go away, I'm initializing".

   That could be solved by a smarter LB, or having the clients retry
   N times on the LB in case of failure, hoping that they'll get an
   OK node on the next request.

 3. Stick a long-living intermediary between the dumb machines and the
   ES servers, i.e. have search requests served by an API that uses
   the ES client library.



--
---------------------------------------------
Paul Loy
[hidden email]
http://uk.linkedin.com/in/paulloy
Reply | Threaded
Open this post in threaded view
|

Re: Networking issues + dumb ES clients = FAIL

Ævar Arnfjörð Bjarmason
On Sun, Nov 27, 2011 at 01:30, Paul Loy <[hidden email]> wrote:
> is option 4 to use a 'real' ES client that knows about cluster state? You
> didn't say what your webservers were running. But if it's Java, you can join
> the ES cluster in client mode
> (http://www.elasticsearch.org/guide/reference/java-api/client.html) and then
> you get all of elastic search's clustering for free including
> adding/removing nodes when they come up / go bad.

The webservers are running Perl and using Clinton Gormley's
ElasticSearch bindings (https://metacpan.org/module/ElasticSearch).

Due to the "we're forking dumb children all the time" architecture
it's not a viable option to have the webservers join the ES cluster as
clients.

Besides, it seems like a pretty big overkill to have all the
webservers in the ES cluster all the time receiving status updates
just to handle the case of a node occasionally being bad.

E.g. one simple way to implement #1 would be to have a
/tmp/elastic-search-state directory containing these files:

  servers:        a newline-separated list of servers we have
  failed-servers: servers that we know have failed
  num-requests:   how many requests have we made?

When a given child process would make a request it would do:

    echo -n . >>/tmp/elastic-search-state/num-requests

So you could get the total number of requests we've made so far on
this box with:

    du -b /tmp/elastic-search-state/num-requests

To get a list of OK servers you'd do:

    cat /tmp/elastic-search-state/{servers,failed} | sort | uniq -c |
grep "^ *1 " | awk '{print $2}'

And if any child process exceeded the max_requests (as retrieved with
"du -b num-requests") it would clobber the "servers" file and
"failed-servers" if needed.

You could check with stat(2) whether you needed to re-read the files,
and you wouldn't have to deal with locking the files for update since
two children redundantly both getting the server list would be just
fine.
Reply | Threaded
Open this post in threaded view
|

Re: Networking issues + dumb ES clients = FAIL

Clinton Gormley-2
In reply to this post by Paul Loy
On Sat, 2011-11-26 at 16:30 -0800, Paul Loy wrote:
> is option 4 to use a 'real' ES client that knows about cluster state?
> You didn't say what your webservers were running. But if it's Java,
> you can join the ES cluster in client mode
> (http://www.elasticsearch.org/guide/reference/java-api/client.html)
> and then you get all of elastic search's clustering for free including
> adding/removing nodes when they come up / go bad.

Hi Paul

I don't know much about how the Java client works, but from what I've
seen on the list, I was under the impression that it can take a second
or two to join the cluster. This is fine for long lived processes, but
would introduce too much latency to clients that only do 2-3 requests
before exiting.

The Perl API, by default, while it doesn't join the cluster, does try to
sniff the live nodes in the cluster by sending a nodes request to each
of a list of preconfigured nodes until it gets a successful response.

In this case, where the bad switch was causing 1 node to hang until it
timed out, this still wouldn't have been very useful.

Although, if Avar were to use one of the available async backends
(AnyEvent::HTTP or AnyEvent::Curl) then these sniff requests would be
sent in parallel, and the first successful response would be used.

>
>  1. Patch the client library to share the state of what nodes are
>     good/bad between all processes on the system, e.g. using shared
>     memory, some dumb file storage etc. This would be relatively easy
>     and I could feed the changes back to the client library.
>
One easy possibility would be to store this data in memcached, which is
frequently already being used in web apps, is fast and won't lock.

clint

Reply | Threaded
Open this post in threaded view
|

Re: Networking issues + dumb ES clients = FAIL

kimchy
Administrator
Why not just try and move to long lived "client" processes? Seems like an architecture where client processes are spawned for 2-3 requests is problematic not just connection wise (elasticsearch or other - db, memcached), but also resource wise.

On Mon, Nov 28, 2011 at 11:45 AM, Clinton Gormley <[hidden email]> wrote:
On Sat, 2011-11-26 at 16:30 -0800, Paul Loy wrote:
> is option 4 to use a 'real' ES client that knows about cluster state?
> You didn't say what your webservers were running. But if it's Java,
> you can join the ES cluster in client mode
> (http://www.elasticsearch.org/guide/reference/java-api/client.html)
> and then you get all of elastic search's clustering for free including
> adding/removing nodes when they come up / go bad.

Hi Paul

I don't know much about how the Java client works, but from what I've
seen on the list, I was under the impression that it can take a second
or two to join the cluster. This is fine for long lived processes, but
would introduce too much latency to clients that only do 2-3 requests
before exiting.

The Perl API, by default, while it doesn't join the cluster, does try to
sniff the live nodes in the cluster by sending a nodes request to each
of a list of preconfigured nodes until it gets a successful response.

In this case, where the bad switch was causing 1 node to hang until it
timed out, this still wouldn't have been very useful.

Although, if Avar were to use one of the available async backends
(AnyEvent::HTTP or AnyEvent::Curl) then these sniff requests would be
sent in parallel, and the first successful response would be used.

>
>  1. Patch the client library to share the state of what nodes are
>     good/bad between all processes on the system, e.g. using shared
>     memory, some dumb file storage etc. This would be relatively easy
>     and I could feed the changes back to the client library.
>
One easy possibility would be to store this data in memcached, which is
frequently already being used in web apps, is fast and won't lock.

clint


Reply | Threaded
Open this post in threaded view
|

Re: Networking issues + dumb ES clients = FAIL

Ævar Arnfjörð Bjarmason
On Mon, Nov 28, 2011 at 12:31, Shay Banon <[hidden email]> wrote:
> Why not just try and move to long lived "client" processes? Seems like an
> architecture where client processes are spawned for 2-3 requests is
> problematic not just connection wise (elasticsearch or other - db,
> memcached), but also resource wise.

The processes live for more than 2-3 requests in total, they live to
do around 150 requests on average. But out of those maybe only 2-3 are
search requests.

The reason it's set up like this is that fork(2) is cheap, and
depending on your design incrementally cleaning up memory or
incrementally doing garbage collection might be a lot more expensive
than just killing the process and having its memory freed by the OS,
and then replacing it with a new process forked from your
ready-to-be-used parent process.

MySQL deals with this use case well because it maintains a connection
pool, so if you reap a child and another fresh process connects MySQL
doesn't incur any noticeable overhead in setting up the connection,
unlike some other databases.
Reply | Threaded
Open this post in threaded view
|

Re: Networking issues + dumb ES clients = FAIL

kimchy
Administrator
Not sure I get it. Are you saying that you create a pool of connections in the parent process, and then pass a connection to the child process (sharing FD?) from the pool, and when that child process is dead, return the connection to the pool? If so, possibly you can do it with the http connection as well?

Still don't buy the fork VS. long running processes reasoning..., but thats not the platform to have this discussion.

On Mon, Nov 28, 2011 at 7:34 PM, Ævar Arnfjörð Bjarmason <[hidden email]> wrote:
On Mon, Nov 28, 2011 at 12:31, Shay Banon <[hidden email]> wrote:
> Why not just try and move to long lived "client" processes? Seems like an
> architecture where client processes are spawned for 2-3 requests is
> problematic not just connection wise (elasticsearch or other - db,
> memcached), but also resource wise.

The processes live for more than 2-3 requests in total, they live to
do around 150 requests on average. But out of those maybe only 2-3 are
search requests.

The reason it's set up like this is that fork(2) is cheap, and
depending on your design incrementally cleaning up memory or
incrementally doing garbage collection might be a lot more expensive
than just killing the process and having its memory freed by the OS,
and then replacing it with a new process forked from your
ready-to-be-used parent process.

MySQL deals with this use case well because it maintains a connection
pool, so if you reap a child and another fresh process connects MySQL
doesn't incur any noticeable overhead in setting up the connection,
unlike some other databases.