Disabling _source field

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Disabling _source field

Sergio Bossa
Hi Shay,

weeks ago we talked about the possibility to disable the "_source"
field, in order to avoid storing a verbatim copy of the indexed
document inside the Lucene index: did you think about that?
What about adding such a feature in 0.6.0?
I may help with the implementation as well, if you'd want to.

Thanks,
Cheers,

Sergio B,

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

kimchy
Administrator
Hi Sergio,

  I added the ability to disable the source field: http://github.com/elasticsearch/elasticsearch/issues/issue/66. But, I strongly believe that most of the times, you would want to enable it. Let me explain why:

  When searching, you usually want to display data as part of the hits. That data can easily be extracted from the source field which is the json document that was indexed (instead of picking and choosing specific fields to be stored). 

   Even when elasticsearch is used with systems like Terrastore, which also stores the json document, I believe that it makes sense to store the json in elasticsearch "source" field as well. The main reason is simply performance. While you do pay in index size and indexing time, you can never fetch the source field faster then when you already are in the node that stores it (collocation), not talking about it already being distributed search. If all that is returned from the search results are ids, then you need, for each hit, to go and fetch it from Terrastore, and you have just increased the overhead of your search requests and general overhead of your system.

-shay.banon

On Wed, Mar 17, 2010 at 11:14 AM, Sergio Bossa <[hidden email]> wrote:
Hi Shay,

weeks ago we talked about the possibility to disable the "_source"
field, in order to avoid storing a verbatim copy of the indexed
document inside the Lucene index: did you think about that?
What about adding such a feature in 0.6.0?
I may help with the implementation as well, if you'd want to.

Thanks,
Cheers,

Sergio B,

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Lukáš Vlček
How about if I would like to implement search on top of books or simply documents having very large source? And I would like to display just fractions of the source in the result page for each relevant document... then I would welcome to have some flexibility in telling ES how much of the content (source) should be retrieved from the index. May be I am mixing this with highlighting feature (which is probably already in TODO list) but still... I think it can be useful to have a way how to tell ES how much of the source should be returned (one option could be giving some XPath expression to trim/filter the source or the like)... does it sound like a stupid idea?

Lukas

On Wed, Mar 17, 2010 at 8:37 PM, Shay Banon <[hidden email]> wrote:
Hi Sergio,

  I added the ability to disable the source field: http://github.com/elasticsearch/elasticsearch/issues/issue/66. But, I strongly believe that most of the times, you would want to enable it. Let me explain why:

  When searching, you usually want to display data as part of the hits. That data can easily be extracted from the source field which is the json document that was indexed (instead of picking and choosing specific fields to be stored). 

   Even when elasticsearch is used with systems like Terrastore, which also stores the json document, I believe that it makes sense to store the json in elasticsearch "source" field as well. The main reason is simply performance. While you do pay in index size and indexing time, you can never fetch the source field faster then when you already are in the node that stores it (collocation), not talking about it already being distributed search. If all that is returned from the search results are ids, then you need, for each hit, to go and fetch it from Terrastore, and you have just increased the overhead of your search requests and general overhead of your system.

-shay.banon


On Wed, Mar 17, 2010 at 11:14 AM, Sergio Bossa <[hidden email]> wrote:
Hi Shay,

weeks ago we talked about the possibility to disable the "_source"
field, in order to avoid storing a verbatim copy of the indexed
document inside the Lucene index: did you think about that?
What about adding such a feature in 0.6.0?
I may help with the implementation as well, if you'd want to.

Thanks,
Cheers,

Sergio B,

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

kimchy
Administrator
No its not at all. I think what you ask for is mostly covered by highlighting (and, when searching, you can pass an empty array of fields, in such a case, the source field would not be returned). With highlighting, you will be able to get interesting fragments of what you searched for (but, you would still need to store something to be able to highlight it...).

-shay.banon

On Wed, Mar 17, 2010 at 10:28 PM, Lukáš Vlček <[hidden email]> wrote:
How about if I would like to implement search on top of books or simply documents having very large source? And I would like to display just fractions of the source in the result page for each relevant document... then I would welcome to have some flexibility in telling ES how much of the content (source) should be retrieved from the index. May be I am mixing this with highlighting feature (which is probably already in TODO list) but still... I think it can be useful to have a way how to tell ES how much of the source should be returned (one option could be giving some XPath expression to trim/filter the source or the like)... does it sound like a stupid idea?

Lukas


On Wed, Mar 17, 2010 at 8:37 PM, Shay Banon <[hidden email]> wrote:
Hi Sergio,

  I added the ability to disable the source field: http://github.com/elasticsearch/elasticsearch/issues/issue/66. But, I strongly believe that most of the times, you would want to enable it. Let me explain why:

  When searching, you usually want to display data as part of the hits. That data can easily be extracted from the source field which is the json document that was indexed (instead of picking and choosing specific fields to be stored). 

   Even when elasticsearch is used with systems like Terrastore, which also stores the json document, I believe that it makes sense to store the json in elasticsearch "source" field as well. The main reason is simply performance. While you do pay in index size and indexing time, you can never fetch the source field faster then when you already are in the node that stores it (collocation), not talking about it already being distributed search. If all that is returned from the search results are ids, then you need, for each hit, to go and fetch it from Terrastore, and you have just increased the overhead of your search requests and general overhead of your system.

-shay.banon


On Wed, Mar 17, 2010 at 11:14 AM, Sergio Bossa <[hidden email]> wrote:
Hi Shay,

weeks ago we talked about the possibility to disable the "_source"
field, in order to avoid storing a verbatim copy of the indexed
document inside the Lucene index: did you think about that?
What about adding such a feature in 0.6.0?
I may help with the implementation as well, if you'd want to.

Thanks,
Cheers,

Sergio B,

--
Sergio Bossa
http://www.linkedin.com/in/sergiob



Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Sergio Bossa
In reply to this post by kimchy
On Wed, Mar 17, 2010 at 8:37 PM, Shay Banon
<[hidden email]> wrote:

> I added the ability to disable the source
> field: http://github.com/elasticsearch/elasticsearch/issues/issue/66.

You rock ;)

> But, I
> strongly believe that most of the times, you would want to enable it. Let me
> explain why:
>   When searching, you usually want to display data as part of the hits. That
> data can easily be extracted from the source field which is the json
> document that was indexed (instead of picking and choosing specific fields
> to be stored).
>    Even when elasticsearch is used with systems like Terrastore, which also
> stores the json document, I believe that it makes sense to store the json in
> elasticsearch "source" field as well. The main reason is simply performance.

I know the performance argument, but I think it's more important, when
you start to get more and more data, to have separated stores for
documents and indexes: this will help maintain the Lucene index as
lightweight as possible, and have a unique access point for documents
(be it Terrastore, Cassandra or whatever).
The performance penalty caused by the different network hits will be
IMHO paid off by the higher throughput of having two different
distributed entities independently deployed and independently working.

Thanks for the great work!
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

kimchy
Administrator
Not sure that I agree regarding the higher throughput argument, but, in any case, its there for people to use it :).

-shay.banon

On Wed, Mar 17, 2010 at 11:53 PM, Sergio Bossa <[hidden email]> wrote:
On Wed, Mar 17, 2010 at 8:37 PM, Shay Banon
<[hidden email]> wrote:

> I added the ability to disable the source
> field: http://github.com/elasticsearch/elasticsearch/issues/issue/66.

You rock ;)

> But, I
> strongly believe that most of the times, you would want to enable it. Let me
> explain why:
>   When searching, you usually want to display data as part of the hits. That
> data can easily be extracted from the source field which is the json
> document that was indexed (instead of picking and choosing specific fields
> to be stored).
>    Even when elasticsearch is used with systems like Terrastore, which also
> stores the json document, I believe that it makes sense to store the json in
> elasticsearch "source" field as well. The main reason is simply performance.

I know the performance argument, but I think it's more important, when
you start to get more and more data, to have separated stores for
documents and indexes: this will help maintain the Lucene index as
lightweight as possible, and have a unique access point for documents
(be it Terrastore, Cassandra or whatever).
The performance penalty caused by the different network hits will be
IMHO paid off by the higher throughput of having two different
distributed entities independently deployed and independently working.

Thanks for the great work!
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

egaumer
On Wed, Mar 17, 2010 at 5:58 PM, Shay Banon <[hidden email]> wrote:
Not sure that I agree regarding the higher throughput argument, but, in any case, its there for people to use it :).

In all fairness, search (typically) returns references to actual resources (look at Google). With this in mind, Sergio's argument is a valid one. One of the common pitfalls of enterprise search projects is that folks want to use the search index to house complete documents. This has many drawbacks especially with regards to volatile data. Of course, the tight integration between terrastore and elasticsearch invalidate some of the concerns.

With that said, I really like how elasticsearch emulates a (searchable) key/value store by returning the entire document. I think it expands the possibilities for different use cases. I've actually used it as a full fledged data store for customer information that (until recently) was housed in a large unwieldy spreadsheet.

The ability to disable this feature means users can decide what makes the most sense for their particular use case.

Regards,
-Eric


Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Sergio Bossa
On Wed, Mar 17, 2010 at 11:16 PM, Eric Gaumer <[hidden email]> wrote:

> Of course, the tight integration
> between terrastore and elasticsearch invalidate some of the concerns.

Thanks for sharing your thoughts, Eric.
Do you mind elaborating more on that?

Thanks again,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

kimchy
Administrator
In reply to this post by egaumer
Couldn't agree more. In terms of usability, the aim of elasticsearch is to be the best solution out of the box, and the most configurable one when needed. Its really up to the users.

As a side note, let me explain why storing the source field might make sense in certain features. Lets say I want to expose an API that allows to reindex an index into a new index. If elasticsearch has the source documents, then this API can be implemented easily within elasticsearch. If the source is not there, then elasticsearch can't really provide this API, and the user would need to "refetch" the data from another data store, and index it. The simplicity of the first solution is something that I really like, but the user can choose. If source is not enabled, then the API will simply bail.

There are other cases where it would be nice to have the actual content of a field (and not its analyzed form) without the user having to explicitly "store" it. But most of them are solved by the user cherry picking which fields to store.

-shay.banon

On Thu, Mar 18, 2010 at 12:16 AM, Eric Gaumer <[hidden email]> wrote:
On Wed, Mar 17, 2010 at 5:58 PM, Shay Banon <[hidden email]> wrote:
Not sure that I agree regarding the higher throughput argument, but, in any case, its there for people to use it :).

In all fairness, search (typically) returns references to actual resources (look at Google). With this in mind, Sergio's argument is a valid one. One of the common pitfalls of enterprise search projects is that folks want to use the search index to house complete documents. This has many drawbacks especially with regards to volatile data. Of course, the tight integration between terrastore and elasticsearch invalidate some of the concerns.

With that said, I really like how elasticsearch emulates a (searchable) key/value store by returning the entire document. I think it expands the possibilities for different use cases. I've actually used it as a full fledged data store for customer information that (until recently) was housed in a large unwieldy spreadsheet.

The ability to disable this feature means users can decide what makes the most sense for their particular use case.

Regards,
-Eric



Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Sergio Bossa
Shay, what about implementing a pluggable API to externally lookup the document?

Sergio Bossa
Sent by iPhone

Il giorno 17/mar/2010, alle ore 23.35, Shay Banon <[hidden email]> ha scritto:

Couldn't agree more. In terms of usability, the aim of elasticsearch is to be the best solution out of the box, and the most configurable one when needed. Its really up to the users.

As a side note, let me explain why storing the source field might make sense in certain features. Lets say I want to expose an API that allows to reindex an index into a new index. If elasticsearch has the source documents, then this API can be implemented easily within elasticsearch. If the source is not there, then elasticsearch can't really provide this API, and the user would need to "refetch" the data from another data store, and index it. The simplicity of the first solution is something that I really like, but the user can choose. If source is not enabled, then the API will simply bail.

There are other cases where it would be nice to have the actual content of a field (and not its analyzed form) without the user having to explicitly "store" it. But most of them are solved by the user cherry picking which fields to store.

-shay.banon

On Thu, Mar 18, 2010 at 12:16 AM, Eric Gaumer <[hidden email]> wrote:
On Wed, Mar 17, 2010 at 5:58 PM, Shay Banon <[hidden email]> wrote:
Not sure that I agree regarding the higher throughput argument, but, in any case, its there for people to use it :).

In all fairness, search (typically) returns references to actual resources (look at Google). With this in mind, Sergio's argument is a valid one. One of the common pitfalls of enterprise search projects is that folks want to use the search index to house complete documents. This has many drawbacks especially with regards to volatile data. Of course, the tight integration between terrastore and elasticsearch invalidate some of the concerns.

With that said, I really like how elasticsearch emulates a (searchable) key/value store by returning the entire document. I think it expands the possibilities for different use cases. I've actually used it as a full fledged data store for customer information that (until recently) was housed in a large unwieldy spreadsheet.

The ability to disable this feature means users can decide what makes the most sense for their particular use case.

Regards,
-Eric



Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

kimchy
Administrator
That is certainly possible, just open an issue for that.

-shay.banon

On Thu, Mar 18, 2010 at 12:45 AM, Sergio Bossa <[hidden email]> wrote:
Shay, what about implementing a pluggable API to externally lookup the document?

Sergio Bossa
Sent by iPhone

Il giorno 17/mar/2010, alle ore 23.35, Shay Banon <[hidden email]> ha scritto:

Couldn't agree more. In terms of usability, the aim of elasticsearch is to be the best solution out of the box, and the most configurable one when needed. Its really up to the users.

As a side note, let me explain why storing the source field might make sense in certain features. Lets say I want to expose an API that allows to reindex an index into a new index. If elasticsearch has the source documents, then this API can be implemented easily within elasticsearch. If the source is not there, then elasticsearch can't really provide this API, and the user would need to "refetch" the data from another data store, and index it. The simplicity of the first solution is something that I really like, but the user can choose. If source is not enabled, then the API will simply bail.

There are other cases where it would be nice to have the actual content of a field (and not its analyzed form) without the user having to explicitly "store" it. But most of them are solved by the user cherry picking which fields to store.

-shay.banon

On Thu, Mar 18, 2010 at 12:16 AM, Eric Gaumer <[hidden email][hidden email]> wrote:
On Wed, Mar 17, 2010 at 5:58 PM, Shay Banon <[hidden email][hidden email]> wrote:
Not sure that I agree regarding the higher throughput argument, but, in any case, its there for people to use it :).

In all fairness, search (typically) returns references to actual resources (look at Google). With this in mind, Sergio's argument is a valid one. One of the common pitfalls of enterprise search projects is that folks want to use the search index to house complete documents. This has many drawbacks especially with regards to volatile data. Of course, the tight integration between terrastore and elasticsearch invalidate some of the concerns.

With that said, I really like how elasticsearch emulates a (searchable) key/value store by returning the entire document. I think it expands the possibilities for different use cases. I've actually used it as a full fledged data store for customer information that (until recently) was housed in a large unwieldy spreadsheet.

The ability to disable this feature means users can decide what makes the most sense for their particular use case.

Regards,
-Eric




Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

egaumer
In reply to this post by Sergio Bossa
On Wed, Mar 17, 2010 at 6:22 PM, Sergio Bossa <[hidden email]> wrote:
On Wed, Mar 17, 2010 at 11:16 PM, Eric Gaumer <[hidden email]> wrote:

> Of course, the tight integration
> between terrastore and elasticsearch invalidate some of the concerns.

Thanks for sharing your thoughts, Eric.
Do you mind elaborating more on that?

In terms of enterprise search, roughly 80% of the project time is spent on document ingest. You've got to aggregate content from disparate sources like relational databases, content management systems, mail servers, file servers, file systems, web servers, web services, etc. You're typically talking about hundreds of millions of documents ranging in all sorts of formats.

Organizations spend millions of dollars on trying to leverage search to "unify" their data architecture and it's difficult, expensive, and tends to lead to fragile one off solutions that are a nightmare to maintain. To make matters worse, they want their enterprise development teams to be able to build applications against the search platform. In doing so they want to index complete documents to avoid having to make an additional network call out to the legacy system containing the actual resource.

The problem with this scenario is that data is (typically) quite volatile. When you rely on getting complete documents straight from a search index, you end up with tight coupling of the resource. When I do a Google search for "linux" I might get back a result pointing to kernel.org. If kernel.org makes changes to the site (i.e, the resource), my result (reference) still points to the latest version. This is a core principle of REST.

When an enterprise organization insists on building applications against fully indexed documents (i.e., the source), they suffer from synchronization problems at the presentation layer. Changes on the original data source are often not reflected in the application. When they realize this (or you make them realize it) the most common response is "real time indexing". It's very difficult to achieve this even when the search engine supports it. Why? Because you're dealing with large volumes of data that span the globe in some cases and it's all held together by these fragile ingest architectures.

The end result is lots of unhappy folks from stake holders to managers, to engineers, to end users.

So to elaborate on my original comment, when you can tightly integrate search as a layer of the data storage "stack", you get this relatively seamless synchronization between the resource and the references in the index. When a user updates a document, the storage system ensures the index is also updated to reflect the changes. From what I've read, this is exactly the relationship between terrastore and elasticsearch.

I've built search architectures for Comcast, IBM, Disney, Financial Times, Dow Jones, S&P, Associated Press, Thomson/Reuters, and Citi-Group, just to name a few. This type of integration addresses a huge need and that's what really interests me most about elasticsearch (the schema free nature and the elasticity). 

The only problem (and this has nothing to do with elasticsearch) is that these legacy systems aren't going away anytime soon. We'll be dealing with poorly implemented enterprise data architectures for years to come. The bright side is that new start ups can be built around these new ideas and pave the way for more intelligent data architectures.

Regards,
-Eric


Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

egaumer
In reply to this post by kimchy
On Wed, Mar 17, 2010 at 6:35 PM, Shay Banon <[hidden email]> wrote:
Couldn't agree more. In terms of usability, the aim of elasticsearch is to be the best solution out of the box, and the most configurable one when needed. Its really up to the users.

As a side note, let me explain why storing the source field might make sense in certain features. Lets say I want to expose an API that allows to reindex an index into a new index. If elasticsearch has the source documents, then this API can be implemented easily within elasticsearch. If the source is not there, then elasticsearch can't really provide this API, and the user would need to "refetch" the data from another data store, and index it. The simplicity of the first solution is something that I really like, but the user can choose. If source is not enabled, then the API will simply bail.

There are other cases where it would be nice to have the actual content of a field (and not its analyzed form) without the user having to explicitly "store" it. But most of them are solved by the user cherry picking which fields to store.

I completely agree. I think there are valid use cases on both sides. Search is so pervasive that there is no way to comprehend all possible uses. I think elasticsearch is one of the most flexible solutions I've come across. Yes there are missing features but the important thing is it's built on an intelligent core. Features will eventually be implemented, it's just a matter of time and community.

I'm guessing that the term "elastic" in elasticsearch is meant to symbolize the distributed nature of the system. I think it also symbolizes the flexibility of the system in terms of configuration and overall use. I mean honestly, I've indexed all sorts of content with elasticsearch and I've never had to edit/open a configuration file. That's impressive considering I've had to design some pretty elaborate index schemas in the past, using other products.

Regards,
-Eric


Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

egaumer
In reply to this post by kimchy
On Wed, Mar 17, 2010 at 4:36 PM, Shay Banon <[hidden email]> wrote:
No its not at all. I think what you ask for is mostly covered by highlighting (and, when searching, you can pass an empty array of fields, in such a case, the source field would not be returned). With highlighting, you will be able to get interesting fragments of what you searched for (but, you would still need to store something to be able to highlight it...).

After a long debate, Lucene and Solr are officially merging. What makes this interesting for ES is that we'll see some of the Solr features (faceting, highlighting, etc.) become part of Lucene itself. This should ease the burden of getting things like highlighting into elasticsearch.

Regards,
-Eric

Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Lukáš Vlček
Well... I am not that familiar with Solr guts but my fear is that some of its functionality implementations will not fit directly with ES architecture path. We'll see. But anyway, it is good that Lucene-Solr developers are joining forces.

On Thu, Mar 18, 2010 at 3:55 AM, Eric Gaumer <[hidden email]> wrote:
On Wed, Mar 17, 2010 at 4:36 PM, Shay Banon <[hidden email]> wrote:
No its not at all. I think what you ask for is mostly covered by highlighting (and, when searching, you can pass an empty array of fields, in such a case, the source field would not be returned). With highlighting, you will be able to get interesting fragments of what you searched for (but, you would still need to store something to be able to highlight it...).

After a long debate, Lucene and Solr are officially merging. What makes this interesting for ES is that we'll see some of the Solr features (faceting, highlighting, etc.) become part of Lucene itself. This should ease the burden of getting things like highlighting into elasticsearch.

Regards,
-Eric


Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

kimchy
Administrator
In reply to this post by egaumer
Actually, in both cases, they can be implemented by elasticsearch without Solr. Highlighting is slowly taking form as we speak, and you already have query facets in elasticsearch :)

As for the merger, I have mixed feelings about it. If they do hold to their promise, and keep a lucene "core" and lucene "modules" separated from Solr, then it will be good. I think I know why the merge is happening, and sadly it probably has nothing to do with pure software.... .

-shay.banon

On Thu, Mar 18, 2010 at 4:55 AM, Eric Gaumer <[hidden email]> wrote:
On Wed, Mar 17, 2010 at 4:36 PM, Shay Banon <[hidden email]> wrote:
No its not at all. I think what you ask for is mostly covered by highlighting (and, when searching, you can pass an empty array of fields, in such a case, the source field would not be returned). With highlighting, you will be able to get interesting fragments of what you searched for (but, you would still need to store something to be able to highlight it...).

After a long debate, Lucene and Solr are officially merging. What makes this interesting for ES is that we'll see some of the Solr features (faceting, highlighting, etc.) become part of Lucene itself. This should ease the burden of getting things like highlighting into elasticsearch.

Regards,
-Eric


Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Sergio Bossa
In reply to this post by kimchy
On Wed, Mar 17, 2010 at 11:51 PM, Shay Banon
<[hidden email]> wrote:

> That is certainly possible, just open an issue for that.

Done: http://github.com/elasticsearch/elasticsearch/issues#issue/67

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Sergio Bossa
In reply to this post by egaumer
Really great thoughts, and I couldn't agree more: they deserve a whole
(blog) post by their own, if you'd decided to write one do not
hesitate to let us know ;)

Talking about Terrastore/ElasticSearch integration, the answer is yes,
it aims to provide an integrated store/search experience: it's in
early stages, but basic features are there.

Thanks again for sharing,
Cheers,

Sergio B.

On Thu, Mar 18, 2010 at 12:36 AM, Eric Gaumer <[hidden email]> wrote:

> On Wed, Mar 17, 2010 at 6:22 PM, Sergio Bossa <[hidden email]>
> wrote:
>>
>> On Wed, Mar 17, 2010 at 11:16 PM, Eric Gaumer <[hidden email]> wrote:
>>
>> > Of course, the tight integration
>> > between terrastore and elasticsearch invalidate some of the concerns.
>>
>> Thanks for sharing your thoughts, Eric.
>> Do you mind elaborating more on that?
>
> In terms of enterprise search, roughly 80% of the project time is spent on
> document ingest. You've got to aggregate content from disparate sources like
> relational databases, content management systems, mail servers, file
> servers, file systems, web servers, web services, etc. You're typically
> talking about hundreds of millions of documents ranging in all sorts of
> formats.
> Organizations spend millions of dollars on trying to leverage search to
> "unify" their data architecture and it's difficult, expensive, and tends to
> lead to fragile one off solutions that are a nightmare to maintain. To make
> matters worse, they want their enterprise development teams to be able to
> build applications against the search platform. In doing so they want to
> index complete documents to avoid having to make an additional network call
> out to the legacy system containing the actual resource.
> The problem with this scenario is that data is (typically) quite volatile.
> When you rely on getting complete documents straight from a search index,
> you end up with tight coupling of the resource. When I do a Google search
> for "linux" I might get back a result pointing to kernel.org. If kernel.org
> makes changes to the site (i.e, the resource), my result (reference) still
> points to the latest version. This is a core principle of REST.
> When an enterprise organization insists on building applications against
> fully indexed documents (i.e., the source), they suffer from synchronization
> problems at the presentation layer. Changes on the original data source are
> often not reflected in the application. When they realize this (or you make
> them realize it) the most common response is "real time indexing". It's very
> difficult to achieve this even when the search engine supports it. Why?
> Because you're dealing with large volumes of data that span the globe in
> some cases and it's all held together by these fragile ingest architectures.
> The end result is lots of unhappy folks from stake holders to managers, to
> engineers, to end users.
> So to elaborate on my original comment, when you can tightly integrate
> search as a layer of the data storage "stack", you get this relatively
> seamless synchronization between the resource and the references in the
> index. When a user updates a document, the storage system ensures the index
> is also updated to reflect the changes. From what I've read, this is exactly
> the relationship between terrastore and elasticsearch.
> I've built search architectures for Comcast, IBM, Disney, Financial Times,
> Dow Jones, S&P, Associated Press, Thomson/Reuters, and Citi-Group, just to
> name a few. This type of integration addresses a huge need and that's what
> really interests me most about elasticsearch (the schema free nature and the
> elasticity).
> The only problem (and this has nothing to do with elasticsearch) is that
> these legacy systems aren't going away anytime soon. We'll be dealing with
> poorly implemented enterprise data architectures for years to come. The
> bright side is that new start ups can be built around these new ideas and
> pave the way for more intelligent data architectures.
> Regards,
> -Eric
>
>



--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Clinton Gormley
In reply to this post by egaumer

>
> So to elaborate on my original comment, when you can tightly integrate
> search as a layer of the data storage "stack", you get this relatively
> seamless synchronization between the resource and the references in
> the index. When a user updates a document, the storage system ensures
> the index is also updated to reflect the changes. From what I've read,
> this is exactly the relationship between terrastore and elasticsearch.

Really interesting point!

clint

>
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

Reply | Threaded
Open this post in threaded view
|

Re: Disabling _source field

Sergio Bossa
On Thu, Mar 18, 2010 at 3:20 PM, Clinton Gormley
<[hidden email]> wrote:

> Really interesting point!

Yes, it is ... so you may want to provide a nice perl API for
Terrastore as well ;) ... okay, that was shameless, please forgive me
;)

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
12