index creation from very large data set

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

index creation from very large data set

Colin Surprenant
Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin
Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

kimchy
Administrator
ElasticSearch is a distributed search engine. So search is distributed, but also indexing. The more nodes you have, the better indexing throughput you will get (assuming you index from multiple threads / multiple clients).

One option, as you suggested, is to linearly walk the data you have, and index it. Even with a single process, you can, of course, parallelize the indexing process (stuff indexing into a thread pool). If you can parallelize your fetching of the data, you can do it as well.

An option to do a map/reduce over you data store is also certainly possible. Just fork jobs, each job fetches (and possibly massages) the data, and index it into elasticsearch.

In general, the more parallelism you get with your indexing process, the better. You still use a single elasticsearch cluster with the index API thanks to the fact that elasticsearch is distributed and highly concurrent (on a single node).

Some notes to increase your indexing speed with elasticsearch:

1. If you know in advance that the documents you index do not already exists in it, use the opType `create` with the index API.
2. If you don't need near real time search of 1 second, increase it (described here: http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/#Refresh).

-shay.banon

On Thu, Mar 25, 2010 at 9:37 PM, Colin Surprenant <[hidden email]> wrote:
Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin

Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

mooky
In reply to this post by Colin Surprenant

Do you mean using Elastic Search to distribute the process of loading/
transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
wrote:

> Hi,
>
> What are the options for creating a new index form an existing very
> large data set? do we need to linearly walk the data and insert each
> document one-by-one?
>
> Otherwise, given a distributed datastore with mapreduce support, would
> it be possible to leverage such a framework to distribute the ES index
> creation by launching mapreduce functions to, for example, compute
> some new information over our existing data and create a new index
> from it??
>
> Thanks for your help,
> Colin
Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

Colin Surprenant
I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw <[hidden email]> wrote:

> Do you mean using Elastic Search to distribute the process of loading/
> transforming the data?
> Or just the distribution of the indexing (which ES does).
>
> -Nick
>
> On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
> wrote:
>
>
>
> > Hi,
>
> > What are the options for creating a new index form an existing very
> > large data set? do we need to linearly walk the data and insert each
> > document one-by-one?
>
> > Otherwise, given a distributed datastore with mapreduce support, would
> > it be possible to leverage such a framework to distribute the ES index
> > creation by launching mapreduce functions to, for example, compute
> > some new information over our existing data and create a new index
> > from it??
>
> > Thanks for your help,
> > Colin
Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

kimchy
Administrator
Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <[hidden email]> wrote:
I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw <[hidden email]> wrote:
> Do you mean using Elastic Search to distribute the process of loading/
> transforming the data?
> Or just the distribution of the indexing (which ES does).
>
> -Nick
>
> On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
> wrote:
>
>
>
> > Hi,
>
> > What are the options for creating a new index form an existing very
> > large data set? do we need to linearly walk the data and insert each
> > document one-by-one?
>
> > Otherwise, given a distributed datastore with mapreduce support, would
> > it be possible to leverage such a framework to distribute the ES index
> > creation by launching mapreduce functions to, for example, compute
> > some new information over our existing data and create a new index
> > from it??
>
> > Thanks for your help,
> > Colin

Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

Colin Surprenant
Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon <[hidden email]> wrote:

> Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
> elasticsearch comes with and call the index API as part of your jobs.
>
> -shay.banon
>
> On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <
>
>
>
> [hidden email]> wrote:
> > I mean leveraging a mapreduce framework over a distributed datastore
> > to speedup the index creation.
>
> > I have a very large dataset over which we run mapreduce tasks to
> > analyse and/or transform the data and for which we might want to
> > create new indexes for this new transformed/extracted data. The most
> > efficient solution would be to create the indexes as part of the
> > mapreduce tasks to distribute the process.
>
> > Colin
>
> > On Mar 25, 5:15 pm, the claw <[hidden email]> wrote:
> > > Do you mean using Elastic Search to distribute the process of loading/
> > > transforming the data?
> > > Or just the distribution of the indexing (which ES does).
>
> > > -Nick
>
> > > On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
> > > wrote:
>
> > > > Hi,
>
> > > > What are the options for creating a new index form an existing very
> > > > large data set? do we need to linearly walk the data and insert each
> > > > document one-by-one?
>
> > > > Otherwise, given a distributed datastore with mapreduce support, would
> > > > it be possible to leverage such a framework to distribute the ES index
> > > > creation by launching mapreduce functions to, for example, compute
> > > > some new information over our existing data and create a new index
> > > > from it??
>
> > > > Thanks for your help,
> > > > Colin
Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

kimchy
Administrator
Not really sure I understood then. Where do you store your data that you plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where you store your data...

-shay.banon

On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <[hidden email]> wrote:
Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon <[hidden email]> wrote:
> Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
> elasticsearch comes with and call the index API as part of your jobs.
>
> -shay.banon
>
> On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <
>
>
>
> [hidden email]> wrote:
> > I mean leveraging a mapreduce framework over a distributed datastore
> > to speedup the index creation.
>
> > I have a very large dataset over which we run mapreduce tasks to
> > analyse and/or transform the data and for which we might want to
> > create new indexes for this new transformed/extracted data. The most
> > efficient solution would be to create the indexes as part of the
> > mapreduce tasks to distribute the process.
>
> > Colin
>
> > On Mar 25, 5:15 pm, the claw <[hidden email]> wrote:
> > > Do you mean using Elastic Search to distribute the process of loading/
> > > transforming the data?
> > > Or just the distribution of the indexing (which ES does).
>
> > > -Nick
>
> > > On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
> > > wrote:
>
> > > > Hi,
>
> > > > What are the options for creating a new index form an existing very
> > > > large data set? do we need to linearly walk the data and insert each
> > > > document one-by-one?
>
> > > > Otherwise, given a distributed datastore with mapreduce support, would
> > > > it be possible to leverage such a framework to distribute the ES index
> > > > creation by launching mapreduce functions to, for example, compute
> > > > some new information over our existing data and create a new index
> > > > from it??
>
> > > > Thanks for your help,
> > > > Colin

Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

Colin Surprenant
In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.

What I am currently looking at is to use riak http://riak.basho.com/
for data storage and their mapreduce framework https://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.

Colin

On Mar 26, 6:14 am, Shay Banon <[hidden email]> wrote:

> Not really sure I understood then. Where do you store your data that you
> plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
> you store your data...
>
> -shay.banon
>
> On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <
>
>
>
> [hidden email]> wrote:
> > Well, we do have a hadoop prototype but with lucene and katta. I am
> > currently loooking into riak.
>
> > Colin
>
> > On Mar 25, 6:36 pm, Shay Banon <[hidden email]> wrote:
> > > Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
> > > elasticsearch comes with and call the index API as part of your jobs.
>
> > > -shay.banon
>
> > > On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <
>
> > > [hidden email]> wrote:
> > > > I mean leveraging a mapreduce framework over a distributed datastore
> > > > to speedup the index creation.
>
> > > > I have a very large dataset over which we run mapreduce tasks to
> > > > analyse and/or transform the data and for which we might want to
> > > > create new indexes for this new transformed/extracted data. The most
> > > > efficient solution would be to create the indexes as part of the
> > > > mapreduce tasks to distribute the process.
>
> > > > Colin
>
> > > > On Mar 25, 5:15 pm, the claw <[hidden email]> wrote:
> > > > > Do you mean using Elastic Search to distribute the process of
> > loading/
> > > > > transforming the data?
> > > > > Or just the distribution of the indexing (which ES does).
>
> > > > > -Nick
>
> > > > > On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
> > > > > wrote:
>
> > > > > > Hi,
>
> > > > > > What are the options for creating a new index form an existing very
> > > > > > large data set? do we need to linearly walk the data and insert
> > each
> > > > > > document one-by-one?
>
> > > > > > Otherwise, given a distributed datastore with mapreduce support,
> > would
> > > > > > it be possible to leverage such a framework to distribute the ES
> > index
> > > > > > creation by launching mapreduce functions to, for example, compute
> > > > > > some new information over our existing data and create a new index
> > > > > > from it??
>
> > > > > > Thanks for your help,
> > > > > > Colin
Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

kimchy
Administrator
I see, thanks for sharing. If data is stored in Hadoop, I guess similar map reduce jobs katta gave as samples can be used with elasticsearch. With riak, I am not too familiar with it, but the same logic should apply.

-shay.banon

On Fri, Mar 26, 2010 at 5:07 PM, Colin Surprenant <[hidden email]> wrote:
In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.

What I am currently looking at is to use riak http://riak.basho.com/
for data storage and their mapreduce framework https://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.

Colin

On Mar 26, 6:14 am, Shay Banon <[hidden email]> wrote:
> Not really sure I understood then. Where do you store your data that you
> plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
> you store your data...
>
> -shay.banon
>
> On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <
>
>
>
> [hidden email]> wrote:
> > Well, we do have a hadoop prototype but with lucene and katta. I am
> > currently loooking into riak.
>
> > Colin
>
> > On Mar 25, 6:36 pm, Shay Banon <[hidden email]> wrote:
> > > Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
> > > elasticsearch comes with and call the index API as part of your jobs.
>
> > > -shay.banon
>
> > > On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <
>
> > > [hidden email]> wrote:
> > > > I mean leveraging a mapreduce framework over a distributed datastore
> > > > to speedup the index creation.
>
> > > > I have a very large dataset over which we run mapreduce tasks to
> > > > analyse and/or transform the data and for which we might want to
> > > > create new indexes for this new transformed/extracted data. The most
> > > > efficient solution would be to create the indexes as part of the
> > > > mapreduce tasks to distribute the process.
>
> > > > Colin
>
> > > > On Mar 25, 5:15 pm, the claw <[hidden email]> wrote:
> > > > > Do you mean using Elastic Search to distribute the process of
> > loading/
> > > > > transforming the data?
> > > > > Or just the distribution of the indexing (which ES does).
>
> > > > > -Nick
>
> > > > > On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
> > > > > wrote:
>
> > > > > > Hi,
>
> > > > > > What are the options for creating a new index form an existing very
> > > > > > large data set? do we need to linearly walk the data and insert
> > each
> > > > > > document one-by-one?
>
> > > > > > Otherwise, given a distributed datastore with mapreduce support,
> > would
> > > > > > it be possible to leverage such a framework to distribute the ES
> > index
> > > > > > creation by launching mapreduce functions to, for example, compute
> > > > > > some new information over our existing data and create a new index
> > > > > > from it??
>
> > > > > > Thanks for your help,
> > > > > > Colin

Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

John Lynch-2
In reply to this post by Colin Surprenant
Colin,

If it helps here is an example of using Ruby/Typhoeus to concurrently
submit data to Riak and ES.

http://rigelgroupllc.com/wp/blog/tee-with-sinatra

When you run a MR job in Riak, you get the results back via HTTP and
would then need to POST those results to ES. If you were really
concerned about performance, you could write your MR functions in
Erlang an do the POST to ES from there, I believe, although Im not
sure it would be significantly faster. The best way to speed up the
index creation is probably to just have more servers -- Riak and ES
both scale linearly. Im not sure about ES, but Riak will also scale
down, so you have the option of adding more servers for a specific
job, then when it is complete, scale your cluster back down.


Regards,

John Lynch, CTO
Rigel Group, LLC
[hidden email]


Also, in the comments to the post, someone suggests using http://www.pypes.org/
which is a high performance tool to

On Mar 26, 7:07 am, Colin Surprenant <[hidden email]>
wrote:

> In our hadoop+katta prototype, the data is simply in hdfs and we made
> an index creating job based on the katta hadoop indexing job samples.
>
> What I am currently looking at is to use riakhttp://riak.basho.com/
> for data storage and their mapreduce frameworkhttps://wiki.basho.com/display/RIAK/MapReduce
> to launch elasticsearch index creation jobs. I am trying evaluate what
> will be the most efficient way to parallelize the index creation when
> creating a new index over the complete data and what would be the best
> integration point between riak mapreduce and elasticsearch.
>
> Colin
>
> On Mar 26, 6:14 am, Shay Banon <[hidden email]> wrote:
>
> > Not really sure I understood then. Where do you store your data that you
> > plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
> > you store your data...
>
> > -shay.banon
>
> > On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <
>
> > [hidden email]> wrote:
> > > Well, we do have a hadoop prototype but with lucene and katta. I am
> > > currently loooking into riak.
>
> > > Colin
>
> > > On Mar 25, 6:36 pm, Shay Banon <[hidden email]> wrote:
> > > > Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
> > > > elasticsearch comes with and call the index API as part of your jobs.
>
> > > > -shay.banon
>
> > > > On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <
>
> > > > [hidden email]> wrote:
> > > > > I mean leveraging a mapreduce framework over a distributed datastore
> > > > > to speedup the index creation.
>
> > > > > I have a very large dataset over which we run mapreduce tasks to
> > > > > analyse and/or transform the data and for which we might want to
> > > > > create new indexes for this new transformed/extracted data. The most
> > > > > efficient solution would be to create the indexes as part of the
> > > > > mapreduce tasks to distribute the process.
>
> > > > > Colin
>
> > > > > On Mar 25, 5:15 pm, the claw <[hidden email]> wrote:
> > > > > > Do you mean using Elastic Search to distribute the process of
> > > loading/
> > > > > > transforming the data?
> > > > > > Or just the distribution of the indexing (which ES does).
>
> > > > > > -Nick
>
> > > > > > On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
> > > > > > wrote:
>
> > > > > > > Hi,
>
> > > > > > > What are the options for creating a new index form an existing very
> > > > > > > large data set? do we need to linearly walk the data and insert
> > > each
> > > > > > > document one-by-one?
>
> > > > > > > Otherwise, given a distributed datastore with mapreduce support,
> > > would
> > > > > > > it be possible to leverage such a framework to distribute the ES
> > > index
> > > > > > > creation by launching mapreduce functions to, for example, compute
> > > > > > > some new information over our existing data and create a new index
> > > > > > > from it??
>
> > > > > > > Thanks for your help,
> > > > > > > Colin
>
>
Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

egaumer
In reply to this post by Colin Surprenant
On Thu, Mar 25, 2010 at 6:08 PM, Colin Surprenant <[hidden email]> wrote:
I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Seems like you need more map and less reduce. The reduce phase (in this case) seems like it would cause a bottleneck. Why reduce results coming from riak partitions down to a single data stream? Since ES is distributed, the goal is to create as many parallel processes (i.e., mappers) as possible. If you execute a reduce phase then you're essentially reducing the result stream to a single process (jn which case it needs to be re-parallelized).

Traditionally, you leverage some middleware component (think ETL like) to help parallelize/partition the data stream. With these newer technologies, if you can leverage the map phase then that sounds like a more intelligent solution.

Even better, these new data stores should provide inherent search indexing as a function the CRUD operations. Think about Apple's Spotlight. You don't have to think about indexing new documents as you save them to your filesystem. It all happens for you, behind the scenes. It's inherent to the file system stack.

Basho has mentioned adding "hooks" to allow these sort of semantics in riak. Sergio and Shay have provided similar functionality with Terrastore and Elasticsearch. In my opinion, this is the *ideal* solution.

Regards,
-Eric


Reply | Threaded
Open this post in threaded view
|

Re: index creation from very large data set

Colin Surprenant
In reply to this post by John Lynch-2
ah! sorry for the late reply, thanks a lot for the info and pointer,
this is great stuff and exactly in line with what I am working on now,
see thread on the Riak mailing list
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-April/000920.html

Thanks again,
Colin

On Tue, Mar 30, 2010 at 1:56 PM, [hidden email]
<[hidden email]> wrote:

> Colin,
>
> If it helps here is an example of using Ruby/Typhoeus to concurrently
> submit data to Riak and ES.
>
> http://rigelgroupllc.com/wp/blog/tee-with-sinatra
>
> When you run a MR job in Riak, you get the results back via HTTP and
> would then need to POST those results to ES. If you were really
> concerned about performance, you could write your MR functions in
> Erlang an do the POST to ES from there, I believe, although Im not
> sure it would be significantly faster. The best way to speed up the
> index creation is probably to just have more servers -- Riak and ES
> both scale linearly. Im not sure about ES, but Riak will also scale
> down, so you have the option of adding more servers for a specific
> job, then when it is complete, scale your cluster back down.
>
>
> Regards,
>
> John Lynch, CTO
> Rigel Group, LLC
> [hidden email]
>
>
> Also, in the comments to the post, someone suggests using http://www.pypes.org/
> which is a high performance tool to
>
> On Mar 26, 7:07 am, Colin Surprenant <[hidden email]>
> wrote:
>> In our hadoop+katta prototype, the data is simply in hdfs and we made
>> an index creating job based on the katta hadoop indexing job samples.
>>
>> What I am currently looking at is to use riakhttp://riak.basho.com/
>> for data storage and their mapreduce frameworkhttps://wiki.basho.com/display/RIAK/MapReduce
>> to launch elasticsearch index creation jobs. I am trying evaluate what
>> will be the most efficient way to parallelize the index creation when
>> creating a new index over the complete data and what would be the best
>> integration point between riak mapreduce and elasticsearch.
>>
>> Colin
>>
>> On Mar 26, 6:14 am, Shay Banon <[hidden email]> wrote:
>>
>> > Not really sure I understood then. Where do you store your data that you
>> > plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
>> > you store your data...
>>
>> > -shay.banon
>>
>> > On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <
>>
>> > [hidden email]> wrote:
>> > > Well, we do have a hadoop prototype but with lucene and katta. I am
>> > > currently loooking into riak.
>>
>> > > Colin
>>
>> > > On Mar 25, 6:36 pm, Shay Banon <[hidden email]> wrote:
>> > > > Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
>> > > > elasticsearch comes with and call the index API as part of your jobs.
>>
>> > > > -shay.banon
>>
>> > > > On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <
>>
>> > > > [hidden email]> wrote:
>> > > > > I mean leveraging a mapreduce framework over a distributed datastore
>> > > > > to speedup the index creation.
>>
>> > > > > I have a very large dataset over which we run mapreduce tasks to
>> > > > > analyse and/or transform the data and for which we might want to
>> > > > > create new indexes for this new transformed/extracted data. The most
>> > > > > efficient solution would be to create the indexes as part of the
>> > > > > mapreduce tasks to distribute the process.
>>
>> > > > > Colin
>>
>> > > > > On Mar 25, 5:15 pm, the claw <[hidden email]> wrote:
>> > > > > > Do you mean using Elastic Search to distribute the process of
>> > > loading/
>> > > > > > transforming the data?
>> > > > > > Or just the distribution of the indexing (which ES does).
>>
>> > > > > > -Nick
>>
>> > > > > > On Mar 25, 7:37 pm, Colin Surprenant <[hidden email]>
>> > > > > > wrote:
>>
>> > > > > > > Hi,
>>
>> > > > > > > What are the options for creating a new index form an existing very
>> > > > > > > large data set? do we need to linearly walk the data and insert
>> > > each
>> > > > > > > document one-by-one?
>>
>> > > > > > > Otherwise, given a distributed datastore with mapreduce support,
>> > > would
>> > > > > > > it be possible to leverage such a framework to distribute the ES
>> > > index
>> > > > > > > creation by launching mapreduce functions to, for example, compute
>> > > > > > > some new information over our existing data and create a new index
>> > > > > > > from it??
>>
>> > > > > > > Thanks for your help,
>> > > > > > > Colin
>>
>>
>