Building with Hadoop - best setup?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Building with Hadoop - best setup?

timrobertson100
Hey,

Is anyone building their indexes using Hadoop?  If so, are they deploying ES across the same cluster as Hadoop and trying to reduce network noise by making use of data locality, or keeping the clusters separate and just calling over HTTP from MapReduce when building the indexes?  I am about to set up on EC2, and planned to keep the search and processing machines separate.

Cheers,
Tim
Reply | Threaded
Open this post in threaded view
|

Re: Building with Hadoop - best setup?

kimchy
Administrator
Hi,

  You won't enjoy locally between elasticsearch and hadoop in any case since both use different distribution model. The locality would only make sense for the indexing part, and think that you probably won't really need it (it should be fast enough).

  What language are you going to write your jobs at? If Java, then make use of the native Java client (obtained from a "non data" Server started) and not HTTP. More here:http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon

On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <[hidden email]> wrote:
Hey,

Is anyone building their indexes using Hadoop?  If so, are they deploying ES across the same cluster as Hadoop and trying to reduce network noise by making use of data locality, or keeping the clusters separate and just calling over HTTP from MapReduce when building the indexes?  I am about to set up on EC2, and planned to keep the search and processing machines separate.

Cheers,
Tim

Reply | Threaded
Open this post in threaded view
|

Re: Building with Hadoop - best setup?

timrobertson100
Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?) for RPC instead of JSON and serializing and deserializing?

Cheers
Tim



On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon <[hidden email]> wrote:
Hi,

  You won't enjoy locally between elasticsearch and hadoop in any case since both use different distribution model. The locality would only make sense for the indexing part, and think that you probably won't really need it (it should be fast enough).

  What language are you going to write your jobs at? If Java, then make use of the native Java client (obtained from a "non data" Server started) and not HTTP. More here:http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon


On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <[hidden email]> wrote:
Hey,

Is anyone building their indexes using Hadoop?  If so, are they deploying ES across the same cluster as Hadoop and trying to reduce network noise by making use of data locality, or keeping the clusters separate and just calling over HTTP from MapReduce when building the indexes?  I am about to set up on EC2, and planned to keep the search and processing machines separate.

Cheers,
Tim


Reply | Threaded
Open this post in threaded view
|

Re: Building with Hadoop - best setup?

kimchy
Administrator
With the Java client, the "source" (which is the indexable document) is still json, but everything around it is a highly optimized stream serialization/deserialization. It is internal and does not use protobuff, but I expect it to be at least 10x faster than protobuf.

-shay.banon

On Fri, Mar 26, 2010 at 8:10 PM, Tim Robertson <[hidden email]> wrote:
Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?) for RPC instead of JSON and serializing and deserializing?

Cheers
Tim



On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon <[hidden email]> wrote:
Hi,

  You won't enjoy locally between elasticsearch and hadoop in any case since both use different distribution model. The locality would only make sense for the indexing part, and think that you probably won't really need it (it should be fast enough).

  What language are you going to write your jobs at? If Java, then make use of the native Java client (obtained from a "non data" Server started) and not HTTP. More here:http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon


On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <[hidden email]> wrote:
Hey,

Is anyone building their indexes using Hadoop?  If so, are they deploying ES across the same cluster as Hadoop and trying to reduce network noise by making use of data locality, or keeping the clusters separate and just calling over HTTP from MapReduce when building the indexes?  I am about to set up on EC2, and planned to keep the search and processing machines separate.

Cheers,
Tim



Reply | Threaded
Open this post in threaded view
|

Re: Building with Hadoop - best setup?

kimchy
Administrator
Just one note here thought, what takes most of the time in this cases is the remote call itself, not the serialization. But when it comes to pure serialization, its highly optimized.

-shay.banon

On Fri, Mar 26, 2010 at 8:59 PM, Shay Banon <[hidden email]> wrote:
With the Java client, the "source" (which is the indexable document) is still json, but everything around it is a highly optimized stream serialization/deserialization. It is internal and does not use protobuff, but I expect it to be at least 10x faster than protobuf.

-shay.banon


On Fri, Mar 26, 2010 at 8:10 PM, Tim Robertson <[hidden email]> wrote:
Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?) for RPC instead of JSON and serializing and deserializing?

Cheers
Tim



On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon <[hidden email]> wrote:
Hi,

  You won't enjoy locally between elasticsearch and hadoop in any case since both use different distribution model. The locality would only make sense for the indexing part, and think that you probably won't really need it (it should be fast enough).

  What language are you going to write your jobs at? If Java, then make use of the native Java client (obtained from a "non data" Server started) and not HTTP. More here:http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon


On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <[hidden email]> wrote:
Hey,

Is anyone building their indexes using Hadoop?  If so, are they deploying ES across the same cluster as Hadoop and trying to reduce network noise by making use of data locality, or keeping the clusters separate and just calling over HTTP from MapReduce when building the indexes?  I am about to set up on EC2, and planned to keep the search and processing machines separate.

Cheers,
Tim