What is your best practice to access a cluster by a Java client?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

What is your best practice to access a cluster by a Java client?

joergprante@gmail.com
Hi,

I'd like to learn how you use the Java client API for Elasticsearch and what your experiences are so far.

My scenario is a web app (*.war or similar) running on an app server ( e.g. Glassfish, JBoss etc.) installed as a front-end (for security, query translation) to Elasticsearch. The cluster can be remote (but need not to be remote).

I need robust access, that is, each query or indexing request must be reliably answered by a success or failure event and of course I need fast response times.

There are at least three variants:

a) a Node in client mode

b) a TransportClient 

c) the REST API (HTTP port 9200)

Let's discuss some of the pros (+) & cons (-) from my naive view as an app developer:

a)  + zero-conf, out-of-the box cluster discovery 
    + automatic failover
    + fluent API interface
    - overhead of internal node joining the cluster(?)
    - missing network interface setup
    - unreadable binary protocol over port 9300

b) + almost zero-conf, configurable network interface setup
    + automatic failover
    + fluent API interface
    - slight overhead of TransportClient node when joining the cluster (finding the currently reachable nodes)
    - additional "sniff" mode for automatic node (re)discovery
    - unreadable binary protocol over port 9300

c) + readable protocol over port 9200 (HTTP)
    - no zero-conf cluster discovery, failover only with external load balancer
    - overhead of JSON/Java serialization and de-serialization

Right now, I decided to go with b)

My assumption was I could manage a TransportClient singleton as a long-lived object. I struggled with connections being apparently dropped (after a certain length of inactivity?), so subsequent client operations gave "no node available" - without being able to refresh the connection by an API method.  It's a challenge to understand how keep-alive connections can be configured with TransportClient - after a period of 5000ms by default, the communication seems to time out? Closing and re-opening a TransportClient in a web app environment looks like an expensive operation, because in the background there are extra threads running for watching the connection, but unfortunately I do so - with each query, I open a new TransportClient object. This works reliable but adds to the overall turn-around of a request/response cycle, so it will not scale I am afraid.

I am aware kimchy is working hard to improve the TransportClient internals but I am curious to learn about the optimal management of the life cyle of a TransportClient object (singleton or not) and if sharing of requests via a single TransportClient by multiple threads is recommended.

Additionally, I use the admin client via the TransportClient to issue a cluster health check command, so in case of a "red" state, querying/indexing can be interrupted at the app layer. This adds some more overhead to each access via the Java client API but is more robust because the web app could in case report a cluster availability problem to the user.

So what are your experiences? Are my assumptions valid? Do I miss something? Is a), b) or c) preferable for a web app front-end scenario? Do you have some advise for best practice? 

Best regards,

Jörg


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: What is your best practice to access a cluster by a Java client?

dadoonet
Hi Jörg,


I have the same design. My TransportClient is created by a Spring factory.
I moved from NodeClient to TransportClient because of a nasty bug with Google guice which cause ES not to release all Threads each time you redeploy your war. You have to restart Jboss.
TransportClient use less Threads so it seems to be better.

I love also your 3rd option as your webapp doesn't depends on a specific version of ES.



David

Le 25 mai 2012 à 16:26, Jörg Prante <[hidden email]> a écrit :

> Hi,
>
> I'd like to learn how you use the Java client API for Elasticsearch and what your experiences are so far.
>
> My scenario is a web app (*.war or similar) running on an app server ( e.g. Glassfish, JBoss etc.) installed as a front-end (for security, query translation) to Elasticsearch. The cluster can be remote (but need not to be remote).
>
> I need robust access, that is, each query or indexing request must be reliably answered by a success or failure event and of course I need fast response times.
>
> There are at least three variants:
>
> a) a Node in client mode
>
> b) a TransportClient
>
> c) the REST API (HTTP port 9200)
>
> Let's discuss some of the pros (+) & cons (-) from my naive view as an app developer:
>
> a)  + zero-conf, out-of-the box cluster discovery
>     + automatic failover
>     + fluent API interface
>     - overhead of internal node joining the cluster(?)
>     - missing network interface setup
>     - unreadable binary protocol over port 9300
>
> b) + almost zero-conf, configurable network interface setup
>     + automatic failover
>     + fluent API interface
>     - slight overhead of TransportClient node when joining the cluster (finding the currently reachable nodes)
>     - additional "sniff" mode for automatic node (re)discovery
>     - unreadable binary protocol over port 9300
>
> c) + readable protocol over port 9200 (HTTP)
>     - no zero-conf cluster discovery, failover only with external load balancer
>     - overhead of JSON/Java serialization and de-serialization
>
> Right now, I decided to go with b)
>
> My assumption was I could manage a TransportClient singleton as a long-lived object. I struggled with connections being apparently dropped (after a certain length of inactivity?), so subsequent client operations gave "no node available" - without being able to refresh the connection by an API method.  It's a challenge to understand how keep-alive connections can be configured with TransportClient - after a period of 5000ms by default, the communication seems to time out? Closing and re-opening a TransportClient in a web app environment looks like an expensive operation, because in the background there are extra threads running for watching the connection, but unfortunately I do so - with each query, I open a new TransportClient object. This works reliable but adds to the overall turn-around of a request/response cycle, so it will not scale I am afraid.
>
> I am aware kimchy is working hard to improve the TransportClient internals but I am curious to learn about the optimal management of the life cyle of a TransportClient object (singleton or not) and if sharing of requests via a single TransportClient by multiple threads is recommended.
>
> Additionally, I use the admin client via the TransportClient to issue a cluster health check command, so in case of a "red" state, querying/indexing can be interrupted at the app layer. This adds some more overhead to each access via the Java client API but is more robust because the web app could in case report a cluster availability problem to the user.
>
> So what are your experiences? Are my assumptions valid? Do I miss something? Is a), b) or c) preferable for a web app front-end scenario? Do you have some advise for best practice?
>
> Best regards,
>
> Jörg
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: What is your best practice to access a cluster by a Java client?

Eric Jain
In reply to this post by joergprante@gmail.com
On May 25, 7:26 am, Jörg Prante <[hidden email]> wrote:
> So what are your experiences? Are my assumptions valid? Do I miss
> something? Is a), b) or c) preferable for a web app front-end scenario? Do
> you have some advise for best practice?

The TransportClient doesn't support authentication, so if the cluster
isn't behind the same firewall as the frontend, only c) seems like a
viable solution.

The Java API supports a) and b), but not c), so you won't be able to
switch from a) or b) to c) without rewriting your code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: What is your best practice to access a cluster by a Java client?

kimchy
Administrator
In reply to this post by joergprante@gmail.com
Heya,

  You summary is correct. Java Clients (both Node and Transport) should be "singletons", thats how they should be used. 0.19.4 have some improvements to how it works, and if you see timeouts, it might make sense to increase those values (its fine), I updated the transport client page with those settings.

On Fri, May 25, 2012 at 4:26 PM, Jörg Prante <[hidden email]> wrote:
Hi,

I'd like to learn how you use the Java client API for Elasticsearch and what your experiences are so far.

My scenario is a web app (*.war or similar) running on an app server ( e.g. Glassfish, JBoss etc.) installed as a front-end (for security, query translation) to Elasticsearch. The cluster can be remote (but need not to be remote).

I need robust access, that is, each query or indexing request must be reliably answered by a success or failure event and of course I need fast response times.

There are at least three variants:

a) a Node in client mode

b) a TransportClient 

c) the REST API (HTTP port 9200)

Let's discuss some of the pros (+) & cons (-) from my naive view as an app developer:

a)  + zero-conf, out-of-the box cluster discovery 
    + automatic failover
    + fluent API interface
    - overhead of internal node joining the cluster(?)
    - missing network interface setup
    - unreadable binary protocol over port 9300

b) + almost zero-conf, configurable network interface setup
    + automatic failover
    + fluent API interface
    - slight overhead of TransportClient node when joining the cluster (finding the currently reachable nodes)
    - additional "sniff" mode for automatic node (re)discovery
    - unreadable binary protocol over port 9300

c) + readable protocol over port 9200 (HTTP)
    - no zero-conf cluster discovery, failover only with external load balancer
    - overhead of JSON/Java serialization and de-serialization

Right now, I decided to go with b)

My assumption was I could manage a TransportClient singleton as a long-lived object. I struggled with connections being apparently dropped (after a certain length of inactivity?), so subsequent client operations gave "no node available" - without being able to refresh the connection by an API method.  It's a challenge to understand how keep-alive connections can be configured with TransportClient - after a period of 5000ms by default, the communication seems to time out? Closing and re-opening a TransportClient in a web app environment looks like an expensive operation, because in the background there are extra threads running for watching the connection, but unfortunately I do so - with each query, I open a new TransportClient object. This works reliable but adds to the overall turn-around of a request/response cycle, so it will not scale I am afraid.

I am aware kimchy is working hard to improve the TransportClient internals but I am curious to learn about the optimal management of the life cyle of a TransportClient object (singleton or not) and if sharing of requests via a single TransportClient by multiple threads is recommended.

Additionally, I use the admin client via the TransportClient to issue a cluster health check command, so in case of a "red" state, querying/indexing can be interrupted at the app layer. This adds some more overhead to each access via the Java client API but is more robust because the web app could in case report a cluster availability problem to the user.

So what are your experiences? Are my assumptions valid? Do I miss something? Is a), b) or c) preferable for a web app front-end scenario? Do you have some advise for best practice? 

Best regards,

Jörg



Loading...