Large (stored) fields, json source and highlighting

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Large (stored) fields, json source and highlighting

Tomislav Poljak
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav






Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

kimchy
Administrator
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav







Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Lukáš Vlček
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav








Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Lukáš Vlček
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav









Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

kimchy
Administrator
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav










Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Lukáš Vlček
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav











Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

kimchy
Administrator
So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav












Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Lukáš Vlček
Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name, address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like to display list of Names and allow users to click individual name which would display whole bio with Java highlighted in it (here comes the highlighting in the play!). Now I can display bio (just using GET REST API with given document ID but not highlighted. So I was thinking that it would be cool to have this function.

Lukas

2010/8/12 Shay Banon <[hidden email]>
So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon


On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav













Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

kimchy
Administrator
Ahh, I see. So you would still need to provide a query to the GET api in order to do the highlighting, right?

On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček <[hidden email]> wrote:
Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name, address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like to display list of Names and allow users to click individual name which would display whole bio with Java highlighted in it (here comes the highlighting in the play!). Now I can display bio (just using GET REST API with given document ID but not highlighted. So I was thinking that it would be cool to have this function.

Lukas

2010/8/12 Shay Banon <[hidden email]>

So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon


On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav














Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Lukáš Vlček
If I want to display whole bio highlighted then I can either get "_source" and cut bio from it on the client side but in this case I need to tell ES to use highlighting on it first. Or I need to specify in mapping that bio is also stored and use fields query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).

Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon <[hidden email]>
Ahh, I see. So you would still need to provide a query to the GET api in order to do the highlighting, right?


On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček <[hidden email]> wrote:
Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name, address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like to display list of Names and allow users to click individual name which would display whole bio with Java highlighted in it (here comes the highlighting in the play!). Now I can display bio (just using GET REST API with given document ID but not highlighted. So I was thinking that it would be cool to have this function.

Lukas

2010/8/12 Shay Banon <[hidden email]>

So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon


On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav















Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Lukáš Vlček
Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček <[hidden email]> wrote:
If I want to display whole bio highlighted then I can either get "_source" and cut bio from it on the client side but in this case I need to tell ES to use highlighting on it first. Or I need to specify in mapping that bio is also stored and use fields query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).

And the later also requires specification of Fragmenter that returns whole body, not fragments, thus my reference to NullFragmenter, which is not implemented in FastVectorHighlighter API (as far as I understand it), it can be found in the older Highlighting API, thus I opened also http://github.com/elasticsearch/elasticsearch/issues/issue/307 

May be it would be better if the NullFragmenter-like functionality is contributed directly into Lucene FastVectorHighlighter API. I was looking at the FVH API today and I think I can try to implement such Fragmenter.


Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon <[hidden email]>
Ahh, I see. So you would still need to provide a query to the GET api in order to do the highlighting, right?


On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček <[hidden email]> wrote:
Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name, address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like to display list of Names and allow users to click individual name which would display whole bio with Java highlighted in it (here comes the highlighting in the play!). Now I can display bio (just using GET REST API with given document ID but not highlighted. So I was thinking that it would be cool to have this function.

Lukas

2010/8/12 Shay Banon <[hidden email]>

So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon


On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav
















Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

kimchy
Administrator
ok, so you want to get the whole bio field highlighted, so you would need to pass the query to the get API as well, otherwise, there is no way to highlight it (aside from other things you need, like the option to do no fragmentation and getting the actual data).

On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček <[hidden email]> wrote:
Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček <[hidden email]> wrote:
If I want to display whole bio highlighted then I can either get "_source" and cut bio from it on the client side but in this case I need to tell ES to use highlighting on it first. Or I need to specify in mapping that bio is also stored and use fields query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).

And the later also requires specification of Fragmenter that returns whole body, not fragments, thus my reference to NullFragmenter, which is not implemented in FastVectorHighlighter API (as far as I understand it), it can be found in the older Highlighting API, thus I opened also http://github.com/elasticsearch/elasticsearch/issues/issue/307 

May be it would be better if the NullFragmenter-like functionality is contributed directly into Lucene FastVectorHighlighter API. I was looking at the FVH API today and I think I can try to implement such Fragmenter.


Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon <[hidden email]>
Ahh, I see. So you would still need to provide a query to the GET api in order to do the highlighting, right?


On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček <[hidden email]> wrote:
Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name, address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like to display list of Names and allow users to click individual name which would display whole bio with Java highlighted in it (here comes the highlighting in the play!). Now I can display bio (just using GET REST API with given document ID but not highlighted. So I was thinking that it would be cool to have this function.

Lukas

2010/8/12 Shay Banon <[hidden email]>

So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon


On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav

















Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Lukáš Vlček
Yes, I did not realize that earlier but you are right that I will need to pass query into the highlight section as well.
Take the following example:

I need to display all candidates that match "dude java" query and then I want to allow user to click on individual name and get whole bio highlighted.

So how I can go about this:
First, I can get relevant documents using simple "query_string" query for "dude java". I can now display names of candidates without highlights and highlighted fragments from bio for each name, kind of basic search interface that already works now. But if I wanted to display highlighted name I would get something like "..e <em>Dude</em> Abid..." which is not what I want (sure, I can work with fragment size but that is just workaround and does not fit all situations). So when using that "query_string" query I would like to specify in the highlight section that the person.name should be highlighted with no fragments.

Second, now, when the user clicks individual name, then I want to get whole bio highlighted.
So I need to get specific document (by ID) and have the bio field highlighted (and the name field as well)
The example of the query that could be used:

{ "query" : { "term" : { "person-id" : "1234" } },
  "highlight" : {
    "fields" : {
      "_source" : {
        "path" : "person.bio,person.name",
        "fragmenter" : "classpath.to.NullFragmenter",
        "query" : {
          "query_string" : { "fields" : ["bio","name"], "query" : "dude java" }
        }
      }
    }
  }
}'

or I could use fields query:

{ "query" : { "term" : { "person-id" : "1234" } },
  "fields" : ["bio","name"],
  "highlight" : {
    "fields" : {
      "bio" : {
        "query" : {
          "query_string" : { "fields" : ["bio"], "query" : "dude java" }
        }
      },
      "name" : {
        "query" : {
          "query_string" : { "fields" : ["name"], "query" : "dude java" }
        }
      }
    }
  }
}'

The later query requires both bio and name to be stored (and this is where it gets back to Tomislav's original point I think).
Ugh! I am complicating it way too much... but hope the request is clear now :-)

Regards,
Lukas


2010/8/13 Shay Banon <[hidden email]>
ok, so you want to get the whole bio field highlighted, so you would need to pass the query to the get API as well, otherwise, there is no way to highlight it (aside from other things you need, like the option to do no fragmentation and getting the actual data).


On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček <[hidden email]> wrote:
Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček <[hidden email]> wrote:
If I want to display whole bio highlighted then I can either get "_source" and cut bio from it on the client side but in this case I need to tell ES to use highlighting on it first. Or I need to specify in mapping that bio is also stored and use fields query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).

And the later also requires specification of Fragmenter that returns whole body, not fragments, thus my reference to NullFragmenter, which is not implemented in FastVectorHighlighter API (as far as I understand it), it can be found in the older Highlighting API, thus I opened also http://github.com/elasticsearch/elasticsearch/issues/issue/307 

May be it would be better if the NullFragmenter-like functionality is contributed directly into Lucene FastVectorHighlighter API. I was looking at the FVH API today and I think I can try to implement such Fragmenter.


Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon <[hidden email]>
Ahh, I see. So you would still need to provide a query to the GET api in order to do the highlighting, right?


On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček <[hidden email]> wrote:
Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name, address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like to display list of Names and allow users to click individual name which would display whole bio with Java highlighted in it (here comes the highlighting in the play!). Now I can display bio (just using GET REST API with given document ID but not highlighted. So I was thinking that it would be cool to have this function.

Lukas

2010/8/12 Shay Banon <[hidden email]>

So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon


On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav


















Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Lukáš Vlček


On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček <[hidden email]> wrote:
Yes, I did not realize that earlier but you are right that I will need to pass query into the highlight section as well.
Take the following example:

I need to display all candidates that match "dude java" query and then I want to allow user to click on individual name and get whole bio highlighted.

So how I can go about this:
First, I can get relevant documents using simple "query_string" query for "dude java". I can now display names of candidates without highlights and highlighted fragments from bio for each name, kind of basic search interface that already works now. But if I wanted to display highlighted name I would get something like "..e <em>Dude</em> Abid..." which is not what I want (sure, I can work with fragment size but that is just workaround and does not fit all situations). So when using that "query_string" query I would like to specify in the highlight section that the person.name should be highlighted with no fragments.

Second, now, when the user clicks individual name, then I want to get whole bio highlighted.
So I need to get specific document (by ID) and have the bio field highlighted (and the name field as well)
The example of the query that could be used:

{ "query" : { "term" : { "person-id" : "1234" } },
  "highlight" : {
    "fields" : {
      "_source" : {
        "path" : "person.bio,person.name",
        "fragmenter" : "classpath.to.NullFragmenter",
        "query" : {
          "query_string" : { "fields" : ["bio","name"], "query" : "dude java" }
        }
      }
    }
  }
}'

or I could use fields query:

{ "query" : { "term" : { "person-id" : "1234" } },
  "fields" : ["bio","name"],
  "highlight" : {
    "fields" : {
      "bio" : {
        "query" : {
          "query_string" : { "fields" : ["bio"], "query" : "dude java" }
        }
      },
      "name" : {
        "query" : {
          "query_string" : { "fields" : ["name"], "query" : "dude java" }
        }
      }
    }
  }
}'

The later query requires both bio and name to be stored (and this is where it gets back to Tomislav's original point I think).
Ugh! I am complicating it way too much... but hope the request is clear now :-)

Sure I am complicating it too much because in the later query example I forgot the specify NullFragmenter :-)
 

Regards,
Lukas


2010/8/13 Shay Banon <[hidden email]>

ok, so you want to get the whole bio field highlighted, so you would need to pass the query to the get API as well, otherwise, there is no way to highlight it (aside from other things you need, like the option to do no fragmentation and getting the actual data).


On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček <[hidden email]> wrote:
Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček <[hidden email]> wrote:
If I want to display whole bio highlighted then I can either get "_source" and cut bio from it on the client side but in this case I need to tell ES to use highlighting on it first. Or I need to specify in mapping that bio is also stored and use fields query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).

And the later also requires specification of Fragmenter that returns whole body, not fragments, thus my reference to NullFragmenter, which is not implemented in FastVectorHighlighter API (as far as I understand it), it can be found in the older Highlighting API, thus I opened also http://github.com/elasticsearch/elasticsearch/issues/issue/307 

May be it would be better if the NullFragmenter-like functionality is contributed directly into Lucene FastVectorHighlighter API. I was looking at the FVH API today and I think I can try to implement such Fragmenter.


Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon <[hidden email]>
Ahh, I see. So you would still need to provide a query to the GET api in order to do the highlighting, right?


On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček <[hidden email]> wrote:
Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name, address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like to display list of Names and allow users to click individual name which would display whole bio with Java highlighted in it (here comes the highlighting in the play!). Now I can display bio (just using GET REST API with given document ID but not highlighted. So I was thinking that it would be cool to have this function.

Lukas

2010/8/12 Shay Banon <[hidden email]>

So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon


On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav



















Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

kimchy
Administrator
No problem, I understand the general idea of the requirement.

-shay.banon

On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček <[hidden email]> wrote:


On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček <[hidden email]> wrote:
Yes, I did not realize that earlier but you are right that I will need to pass query into the highlight section as well.
Take the following example:

I need to display all candidates that match "dude java" query and then I want to allow user to click on individual name and get whole bio highlighted.

So how I can go about this:
First, I can get relevant documents using simple "query_string" query for "dude java". I can now display names of candidates without highlights and highlighted fragments from bio for each name, kind of basic search interface that already works now. But if I wanted to display highlighted name I would get something like "..e <em>Dude</em> Abid..." which is not what I want (sure, I can work with fragment size but that is just workaround and does not fit all situations). So when using that "query_string" query I would like to specify in the highlight section that the person.name should be highlighted with no fragments.

Second, now, when the user clicks individual name, then I want to get whole bio highlighted.
So I need to get specific document (by ID) and have the bio field highlighted (and the name field as well)
The example of the query that could be used:

{ "query" : { "term" : { "person-id" : "1234" } },
  "highlight" : {
    "fields" : {
      "_source" : {
        "path" : "person.bio,person.name",
        "fragmenter" : "classpath.to.NullFragmenter",
        "query" : {
          "query_string" : { "fields" : ["bio","name"], "query" : "dude java" }
        }
      }
    }
  }
}'

or I could use fields query:

{ "query" : { "term" : { "person-id" : "1234" } },
  "fields" : ["bio","name"],
  "highlight" : {
    "fields" : {
      "bio" : {
        "query" : {
          "query_string" : { "fields" : ["bio"], "query" : "dude java" }
        }
      },
      "name" : {
        "query" : {
          "query_string" : { "fields" : ["name"], "query" : "dude java" }
        }
      }
    }
  }
}'

The later query requires both bio and name to be stored (and this is where it gets back to Tomislav's original point I think).
Ugh! I am complicating it way too much... but hope the request is clear now :-)

Sure I am complicating it too much because in the later query example I forgot the specify NullFragmenter :-)
 

Regards,
Lukas


2010/8/13 Shay Banon <[hidden email]>

ok, so you want to get the whole bio field highlighted, so you would need to pass the query to the get API as well, otherwise, there is no way to highlight it (aside from other things you need, like the option to do no fragmentation and getting the actual data).


On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček <[hidden email]> wrote:
Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček <[hidden email]> wrote:
If I want to display whole bio highlighted then I can either get "_source" and cut bio from it on the client side but in this case I need to tell ES to use highlighting on it first. Or I need to specify in mapping that bio is also stored and use fields query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).

And the later also requires specification of Fragmenter that returns whole body, not fragments, thus my reference to NullFragmenter, which is not implemented in FastVectorHighlighter API (as far as I understand it), it can be found in the older Highlighting API, thus I opened also http://github.com/elasticsearch/elasticsearch/issues/issue/307 

May be it would be better if the NullFragmenter-like functionality is contributed directly into Lucene FastVectorHighlighter API. I was looking at the FVH API today and I think I can try to implement such Fragmenter.


Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon <[hidden email]>
Ahh, I see. So you would still need to provide a query to the GET api in order to do the highlighting, right?


On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček <[hidden email]> wrote:
Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name, address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like to display list of Names and allow users to click individual name which would display whole bio with Java highlighted in it (here comes the highlighting in the play!). Now I can display bio (just using GET REST API with given document ID but not highlighted. So I was thinking that it would be cool to have this function.

Lukas

2010/8/12 Shay Banon <[hidden email]>

So, what you want is to be able to get just the bio field, without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...

-shay.banon


On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <[hidden email]> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.

Lukas


On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.

-shay.banon


On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).


On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308

Lukas


On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.

Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon


On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :)

Tomislav




















Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Tomislav Poljak
Hi,
I'm not sure I fully understand what will be implement as a
result/conclusion of discussion here, but I think I can define what I
would like to be implemented (from my point of view) pretty clearly as:

It would be great if ES, beside returning whole document source (in json
format) in search results, supported returning json type structure with
'matching fields' and/or requested fields. Only fields which are matched
by a query or requested would be returned (from _source json) and this
would be possible without storing each field separately. Value in these
fields would be either a whole field value (with highlighting applied)
or a highlighting snippet (for large textual fields).

Will something like that be possible?

Thanks,
      Tomislav


On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:

> No problem, I understand the general idea of the requirement.
>
>
> -shay.banon
>
> On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček <[hidden email]>
> wrote:
>        
>        
>        
>         On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
>         <[hidden email]> wrote:
>                 Yes, I did not realize that earlier but you are right
>                 that I will need to pass query into the highlight
>                 section as well.
>                 Take the following example:
>                
>                
>                 I need to display all candidates that match "dude
>                 java" query and then I want to allow user to click on
>                 individual name and get whole bio highlighted.
>                
>                
>                 So how I can go about this:
>                 First, I can get relevant documents using simple
>                 "query_string" query for "dude java". I can now
>                 display names of candidates without highlights and
>                 highlighted fragments from bio for each name, kind of
>                 basic search interface that already works now. But if
>                 I wanted to display highlighted name I would get
>                 something like "..e <em>Dude</em> Abid..." which is
>                 not what I want (sure, I can work with fragment size
>                 but that is just workaround and does not fit all
>                 situations). So when using that "query_string" query I
>                 would like to specify in the highlight section that
>                 the person.name should be highlighted with no
>                 fragments.
>                
>                
>                 Second, now, when the user clicks individual name,
>                 then I want to get whole bio highlighted.
>                 So I need to get specific document (by ID) and have
>                 the bio field highlighted (and the name field as well)
>                 The example of the query that could be used:
>                
>                 curl -XGET http://localhost:9200/_all/_search -d '
>                 { "query" : { "term" : { "person-id" : "1234" } },
>                   "highlight" : {
>                     "fields" : {
>                       "_source" : {
>                         "path" : "person.bio,person.name",
>                         "fragmenter" : "classpath.to.NullFragmenter",
>                         "query" : {
>                           "query_string" : { "fields" :
>                 ["bio","name"], "query" : "dude java" }
>                         }
>                       }
>                     }
>                   }
>                 }'
>                
>                
>                 or I could use fields query:
>                
>                
>                 curl -XGET http://localhost:9200/_all/_search -d '
>                 { "query" : { "term" : { "person-id" : "1234" } },
>                   "fields" : ["bio","name"],
>                   "highlight" : {
>                     "fields" : {
>                       "bio" : {
>                         "query" : {
>                           "query_string" : { "fields" : ["bio"],
>                 "query" : "dude java" }
>                         }
>                       },
>                       "name" : {
>                         "query" : {
>                           "query_string" : { "fields" : ["name"],
>                 "query" : "dude java" }
>                         }
>                       }
>                     }
>                   }
>                 }'
>                
>                
>                 The later query requires both bio and name to be
>                 stored (and this is where it gets back to Tomislav's
>                 original point I think).
>                 Ugh! I am complicating it way too much... but hope the
>                 request is clear now :-)
>        
>        
>         Sure I am complicating it too much because in the later query
>         example I forgot the specify NullFragmenter :-)
>        
>          
>                
>                
>                 Regards,
>                 Lukas
>                
>                
>                
>                 2010/8/13 Shay Banon <[hidden email]>
>                
>                
>                         ok, so you want to get the whole bio field
>                         highlighted, so you would need to pass the
>                         query to the get API as well, otherwise, there
>                         is no way to highlight it (aside from other
>                         things you need, like the option to do no
>                         fragmentation and getting the actual data).
>                        
>                        
>                        
>                         On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček
>                         <[hidden email]> wrote:
>                                 Oh, and one more note, see below:
>                                
>                                 On Thu, Aug 12, 2010 at 9:22 PM, Lukáš
>                                 Vlček <[hidden email]> wrote:
>                                         If I want to display whole bio
>                                         highlighted then I can either
>                                         get "_source" and cut bio from
>                                         it on the client side but in
>                                         this case I need to tell ES to
>                                         use highlighting on it first.
>                                         Or I need to specify in
>                                         mapping that bio is also
>                                         stored and use fields
>                                         query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).
>                                
>                                
>                                 And the later also requires
>                                 specification of Fragmenter that
>                                 returns whole body, not fragments,
>                                 thus my reference to NullFragmenter,
>                                 which is not implemented in
>                                 FastVectorHighlighter API (as far as I
>                                 understand it), it can be found in the
>                                 older Highlighting API, thus I opened
>                                 also http://github.com/elasticsearch/elasticsearch/issues/issue/307 
>                                
>                                
>                                 May be it would be better if the
>                                 NullFragmenter-like functionality is
>                                 contributed directly into Lucene
>                                 FastVectorHighlighter API. I was
>                                 looking at the FVH API today and I
>                                 think I can try to implement such
>                                 Fragmenter.
>                                
>                                
>                                
>                                        
>                                        
>                                         Hope this makes it clear.
>                                         (Sorry if I confused you).
>                                        
>                                        
>                                        
>                                         Lukas
>                                        
>                                         2010/8/12 Shay Banon
>                                         <[hidden email]>
>                                                 Ahh, I see. So you
>                                                 would still need to
>                                                 provide a query to the
>                                                 GET api in order to do
>                                                 the highlighting,
>                                                 right?
>                                                
>                                                
>                                                
>                                                 On Thu, Aug 12, 2010
>                                                 at 10:08 PM, Lukáš
>                                                 Vlček
>                                                 <[hidden email]> wrote:
>                                                         Imagine search
>                                                         app for HR:
>                                                         Candidate
>                                                         catalog (cool
>                                                         name!).
>                                                         The entities
>                                                         stored in the
>                                                         index are as
>                                                         follows:
>                                                         person: { id,
>                                                         name, address,
>                                                         bio }
>                                                         Now I am using
>                                                         just the REST
>                                                         API. Say I
>                                                         search for
>                                                         "Java" and I
>                                                         would like to
>                                                         display list
>                                                         of Names and
>                                                         allow users to
>                                                         click
>                                                         individual
>                                                         name which
>                                                         would display
>                                                         whole bio with
>                                                         Java
>                                                         highlighted in
>                                                         it (here comes
>                                                         the
>                                                         highlighting
>                                                         in the play!).
>                                                         Now I can
>                                                         display bio
>                                                         (just using
>                                                         GET REST API
>                                                         with given
>                                                         document ID
>                                                         but not
>                                                         highlighted.
>                                                         So I was
>                                                         thinking that
>                                                         it would be
>                                                         cool to have
>                                                         this function.
>                                                        
>                                                        
>                                                         Lukas
>                                                        
>                                                         2010/8/12 Shay
>                                                         Banon
>                                                         <[hidden email]>
>                                                        
>                                                        
>                                                                 So,
>                                                                 what
>                                                                 you
>                                                                 want
>                                                                 is to
>                                                                 be
>                                                                 able
>                                                                 to get
>                                                                 just
>                                                                 the
>                                                                 bio
>                                                                 field,
>                                                                 without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...
>                                                                
>                                                                
>                                                                 -shay.banon
>                                                                
>                                                                
>                                                                
>                                                                 On
>                                                                 Thu,
>                                                                 Aug
>                                                                 12,
>                                                                 2010
>                                                                 at
>                                                                 9:53
>                                                                 PM,
>                                                                 Lukáš
>                                                                 Vlček
>                                                                 <[hidden email]> wrote:
>                                                                         Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.
>                                                                        
>                                                                        
>                                                                         Lukas
>                                                                        
>                                                                        
>                                                                        
>                                                                         On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
>                                                                                 Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.
>                                                                                
>                                                                                
>                                                                                 -shay.banon
>                                                                                
>                                                                                
>                                                                                
>                                                                                 On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
>                                                                                         One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).
>                                                                                        
>                                                                                        
>                                                                                        
>                                                                                         On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
>                                                                                                 If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308
>                                                                                                
>                                                                                                
>                                                                                                 Lukas
>                                                                                                
>                                                                                                
>                                                                                                
>                                                                                                 On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
>                                                                                                         Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.
>                                                                                                        
>                                                                                                        
>                                                                                                         Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).
>                                                                                                        
>                                                                                                        
>                                                                                                         Open an issue for this?
>                                                                                                        
>                                                                                                        
>                                                                                                         -shay.banon
>                                                                                                        
>                                                                                                        
>                                                                                                        
>                                                                                                         On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
>                                                                                                                 Hi,
>                                                                                                                 I really like all the features stored json (enabled _source) provides in
>                                                                                                                 both Java API (used it in indexing/updating) and REST API (used it for
>                                                                                                                 searching).
>                                                                                                                
>                                                                                                                 I do however have one possible request for improvement regarding
>                                                                                                                 documents with large textual fields and overall highlighting.
>                                                                                                                
>                                                                                                                 When there is a requirement to index/search documents with large textual
>                                                                                                                 fields (like 'content' with text in Mb, which is not unusual), returning
>                                                                                                                 a whole json for each result in result set can be impossible (if each
>                                                                                                                 json document has a few Mb in 'content' returning 30-50 results to a
>                                                                                                                 client doesn't sound realistic or even possible in acceptable time with
>                                                                                                                 usual 'Internet' bandwidth).
>                                                                                                                
>                                                                                                                 But, usually it's acceptable (or even requirement) to display/return
>                                                                                                                 only highlighting snippets for 30-50 results matches and retrieve whole
>                                                                                                                 document (json source) only for a single document (when requested by
>                                                                                                                 exact ID).
>                                                                                                                
>                                                                                                                 To be able to provide highlighting snippets for large textual
>                                                                                                                 ('content') field, it needs to be stored.
>                                                                                                                
>                                                                                                                 Now we are in situation where because textual field is too big (makes
>                                                                                                                 impassible to return json source for 30-50 results to a client) we need
>                                                                                                                 to store it twice in index (once as a part of original json source in
>                                                                                                                 _source and second time as a 'content' field "store" : "yes" for
>                                                                                                                 highlighting). This make index a lot bigger.
>                                                                                                                
>                                                                                                                 Also, if there is a requirement to display highlighting for all fields
>                                                                                                                 (separate highlight snippets for each field where match occurred,
>                                                                                                                 without mixing fields snippets -> stored _all field can not be used for
>                                                                                                                 highlighting in such case) then whole document (all fields) needs to be
>                                                                                                                 stored twice.
>                                                                                                                
>                                                                                                                 In this case seems only logical to disable _source field (since all
>                                                                                                                 fields are stored anyway) and when whole document needs to be retrieved
>                                                                                                                 use (newly added) fields=* feature (I've read similar discussion thread
>                                                                                                                 which led to this enhancement)
>                                                                                                                
>                                                                                                                 Here is my question/proposal:
>                                                                                                                
>                                                                                                                 would it be possible to enable use of json _source field for 'field
>                                                                                                                 specific' highlighting, where matching snippet needs to be returned
>                                                                                                                 separately for each field name?
>                                                                                                                
>                                                                                                                 Maybe to have term_vector for each field, but to somehow 'adjust' or
>                                                                                                                 recalculate positions_offsets to point to text snippet in _source
>                                                                                                                 instead of stored field?
>                                                                                                                
>                                                                                                                 I know this is not a simple requirement, but if 'field specific'
>                                                                                                                 highlighting could somehow use stored json instead of requiring an
>                                                                                                                 individual field to be (separately) stored, that would make a great
>                                                                                                                 use/reuse of stored json _source (and no one would ever think of
>                                                                                                                 disabling it :)
>                                                                                                                
>                                                                                                                 Tomislav
>                                                                                                                
>                                                                                                                
>                                                                                                                
>                                                                                                                
>                                                                                                                
>                                                                                                                
>                                                                                                        
>                                                                                                        
>                                                                                                
>                                                                                                
>                                                                                        
>                                                                                        
>                                                                                
>                                                                                
>                                                                        
>                                                                        
>                                                                
>                                                                
>                                                        
>                                                        
>                                                
>                                                
>                                        
>                                        
>                                
>                        
>                        
>                
>                
>        
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Tomislav Poljak
Hi,
actually this requirement can be even more summarized:

It would be really (really) great if ES could provide 'highlighting' and
'fields' features from the search API without the need for each field to
be stored separately (by reusing stored json _source field).

Do you think this would be possible?

Tomislav


On Fri, 2010-08-13 at 16:30 +0200, Tomislav Poljak wrote:

> Hi,
> I'm not sure I fully understand what will be implement as a
> result/conclusion of discussion here, but I think I can define what I
> would like to be implemented (from my point of view) pretty clearly as:
>
> It would be great if ES, beside returning whole document source (in json
> format) in search results, supported returning json type structure with
> 'matching fields' and/or requested fields. Only fields which are matched
> by a query or requested would be returned (from _source json) and this
> would be possible without storing each field separately. Value in these
> fields would be either a whole field value (with highlighting applied)
> or a highlighting snippet (for large textual fields).
>
> Will something like that be possible?
>
> Thanks,
>       Tomislav
>
>
> On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:
> > No problem, I understand the general idea of the requirement.
> >
> >
> > -shay.banon
> >
> > On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček <[hidden email]>
> > wrote:
> >        
> >        
> >        
> >         On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
> >         <[hidden email]> wrote:
> >                 Yes, I did not realize that earlier but you are right
> >                 that I will need to pass query into the highlight
> >                 section as well.
> >                 Take the following example:
> >                
> >                
> >                 I need to display all candidates that match "dude
> >                 java" query and then I want to allow user to click on
> >                 individual name and get whole bio highlighted.
> >                
> >                
> >                 So how I can go about this:
> >                 First, I can get relevant documents using simple
> >                 "query_string" query for "dude java". I can now
> >                 display names of candidates without highlights and
> >                 highlighted fragments from bio for each name, kind of
> >                 basic search interface that already works now. But if
> >                 I wanted to display highlighted name I would get
> >                 something like "..e <em>Dude</em> Abid..." which is
> >                 not what I want (sure, I can work with fragment size
> >                 but that is just workaround and does not fit all
> >                 situations). So when using that "query_string" query I
> >                 would like to specify in the highlight section that
> >                 the person.name should be highlighted with no
> >                 fragments.
> >                
> >                
> >                 Second, now, when the user clicks individual name,
> >                 then I want to get whole bio highlighted.
> >                 So I need to get specific document (by ID) and have
> >                 the bio field highlighted (and the name field as well)
> >                 The example of the query that could be used:
> >                
> >                 curl -XGET http://localhost:9200/_all/_search -d '
> >                 { "query" : { "term" : { "person-id" : "1234" } },
> >                   "highlight" : {
> >                     "fields" : {
> >                       "_source" : {
> >                         "path" : "person.bio,person.name",
> >                         "fragmenter" : "classpath.to.NullFragmenter",
> >                         "query" : {
> >                           "query_string" : { "fields" :
> >                 ["bio","name"], "query" : "dude java" }
> >                         }
> >                       }
> >                     }
> >                   }
> >                 }'
> >                
> >                
> >                 or I could use fields query:
> >                
> >                
> >                 curl -XGET http://localhost:9200/_all/_search -d '
> >                 { "query" : { "term" : { "person-id" : "1234" } },
> >                   "fields" : ["bio","name"],
> >                   "highlight" : {
> >                     "fields" : {
> >                       "bio" : {
> >                         "query" : {
> >                           "query_string" : { "fields" : ["bio"],
> >                 "query" : "dude java" }
> >                         }
> >                       },
> >                       "name" : {
> >                         "query" : {
> >                           "query_string" : { "fields" : ["name"],
> >                 "query" : "dude java" }
> >                         }
> >                       }
> >                     }
> >                   }
> >                 }'
> >                
> >                
> >                 The later query requires both bio and name to be
> >                 stored (and this is where it gets back to Tomislav's
> >                 original point I think).
> >                 Ugh! I am complicating it way too much... but hope the
> >                 request is clear now :-)
> >        
> >        
> >         Sure I am complicating it too much because in the later query
> >         example I forgot the specify NullFragmenter :-)
> >        
> >          
> >                
> >                
> >                 Regards,
> >                 Lukas
> >                
> >                
> >                
> >                 2010/8/13 Shay Banon <[hidden email]>
> >                
> >                
> >                         ok, so you want to get the whole bio field
> >                         highlighted, so you would need to pass the
> >                         query to the get API as well, otherwise, there
> >                         is no way to highlight it (aside from other
> >                         things you need, like the option to do no
> >                         fragmentation and getting the actual data).
> >                        
> >                        
> >                        
> >                         On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček
> >                         <[hidden email]> wrote:
> >                                 Oh, and one more note, see below:
> >                                
> >                                 On Thu, Aug 12, 2010 at 9:22 PM, Lukáš
> >                                 Vlček <[hidden email]> wrote:
> >                                         If I want to display whole bio
> >                                         highlighted then I can either
> >                                         get "_source" and cut bio from
> >                                         it on the client side but in
> >                                         this case I need to tell ES to
> >                                         use highlighting on it first.
> >                                         Or I need to specify in
> >                                         mapping that bio is also
> >                                         stored and use fields
> >                                         query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).
> >                                
> >                                
> >                                 And the later also requires
> >                                 specification of Fragmenter that
> >                                 returns whole body, not fragments,
> >                                 thus my reference to NullFragmenter,
> >                                 which is not implemented in
> >                                 FastVectorHighlighter API (as far as I
> >                                 understand it), it can be found in the
> >                                 older Highlighting API, thus I opened
> >                                 also http://github.com/elasticsearch/elasticsearch/issues/issue/307 
> >                                
> >                                
> >                                 May be it would be better if the
> >                                 NullFragmenter-like functionality is
> >                                 contributed directly into Lucene
> >                                 FastVectorHighlighter API. I was
> >                                 looking at the FVH API today and I
> >                                 think I can try to implement such
> >                                 Fragmenter.
> >                                
> >                                
> >                                
> >                                        
> >                                        
> >                                         Hope this makes it clear.
> >                                         (Sorry if I confused you).
> >                                        
> >                                        
> >                                        
> >                                         Lukas
> >                                        
> >                                         2010/8/12 Shay Banon
> >                                         <[hidden email]>
> >                                                 Ahh, I see. So you
> >                                                 would still need to
> >                                                 provide a query to the
> >                                                 GET api in order to do
> >                                                 the highlighting,
> >                                                 right?
> >                                                
> >                                                
> >                                                
> >                                                 On Thu, Aug 12, 2010
> >                                                 at 10:08 PM, Lukáš
> >                                                 Vlček
> >                                                 <[hidden email]> wrote:
> >                                                         Imagine search
> >                                                         app for HR:
> >                                                         Candidate
> >                                                         catalog (cool
> >                                                         name!).
> >                                                         The entities
> >                                                         stored in the
> >                                                         index are as
> >                                                         follows:
> >                                                         person: { id,
> >                                                         name, address,
> >                                                         bio }
> >                                                         Now I am using
> >                                                         just the REST
> >                                                         API. Say I
> >                                                         search for
> >                                                         "Java" and I
> >                                                         would like to
> >                                                         display list
> >                                                         of Names and
> >                                                         allow users to
> >                                                         click
> >                                                         individual
> >                                                         name which
> >                                                         would display
> >                                                         whole bio with
> >                                                         Java
> >                                                         highlighted in
> >                                                         it (here comes
> >                                                         the
> >                                                         highlighting
> >                                                         in the play!).
> >                                                         Now I can
> >                                                         display bio
> >                                                         (just using
> >                                                         GET REST API
> >                                                         with given
> >                                                         document ID
> >                                                         but not
> >                                                         highlighted.
> >                                                         So I was
> >                                                         thinking that
> >                                                         it would be
> >                                                         cool to have
> >                                                         this function.
> >                                                        
> >                                                        
> >                                                         Lukas
> >                                                        
> >                                                         2010/8/12 Shay
> >                                                         Banon
> >                                                         <[hidden email]>
> >                                                        
> >                                                        
> >                                                                 So,
> >                                                                 what
> >                                                                 you
> >                                                                 want
> >                                                                 is to
> >                                                                 be
> >                                                                 able
> >                                                                 to get
> >                                                                 just
> >                                                                 the
> >                                                                 bio
> >                                                                 field,
> >                                                                 without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...
> >                                                                
> >                                                                
> >                                                                 -shay.banon
> >                                                                
> >                                                                
> >                                                                
> >                                                                 On
> >                                                                 Thu,
> >                                                                 Aug
> >                                                                 12,
> >                                                                 2010
> >                                                                 at
> >                                                                 9:53
> >                                                                 PM,
> >                                                                 Lukáš
> >                                                                 Vlček
> >                                                                 <[hidden email]> wrote:
> >                                                                         Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.
> >                                                                        
> >                                                                        
> >                                                                         Lukas
> >                                                                        
> >                                                                        
> >                                                                        
> >                                                                         On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
> >                                                                                 Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.
> >                                                                                
> >                                                                                
> >                                                                                 -shay.banon
> >                                                                                
> >                                                                                
> >                                                                                
> >                                                                                 On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
> >                                                                                         One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).
> >                                                                                        
> >                                                                                        
> >                                                                                        
> >                                                                                         On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
> >                                                                                                 If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308
> >                                                                                                
> >                                                                                                
> >                                                                                                 Lukas
> >                                                                                                
> >                                                                                                
> >                                                                                                
> >                                                                                                 On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
> >                                                                                                         Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.
> >                                                                                                        
> >                                                                                                        
> >                                                                                                         Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).
> >                                                                                                        
> >                                                                                                        
> >                                                                                                         Open an issue for this?
> >                                                                                                        
> >                                                                                                        
> >                                                                                                         -shay.banon
> >                                                                                                        
> >                                                                                                        
> >                                                                                                        
> >                                                                                                         On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
> >                                                                                                                 Hi,
> >                                                                                                                 I really like all the features stored json (enabled _source) provides in
> >                                                                                                                 both Java API (used it in indexing/updating) and REST API (used it for
> >                                                                                                                 searching).
> >                                                                                                                
> >                                                                                                                 I do however have one possible request for improvement regarding
> >                                                                                                                 documents with large textual fields and overall highlighting.
> >                                                                                                                
> >                                                                                                                 When there is a requirement to index/search documents with large textual
> >                                                                                                                 fields (like 'content' with text in Mb, which is not unusual), returning
> >                                                                                                                 a whole json for each result in result set can be impossible (if each
> >                                                                                                                 json document has a few Mb in 'content' returning 30-50 results to a
> >                                                                                                                 client doesn't sound realistic or even possible in acceptable time with
> >                                                                                                                 usual 'Internet' bandwidth).
> >                                                                                                                
> >                                                                                                                 But, usually it's acceptable (or even requirement) to display/return
> >                                                                                                                 only highlighting snippets for 30-50 results matches and retrieve whole
> >                                                                                                                 document (json source) only for a single document (when requested by
> >                                                                                                                 exact ID).
> >                                                                                                                
> >                                                                                                                 To be able to provide highlighting snippets for large textual
> >                                                                                                                 ('content') field, it needs to be stored.
> >                                                                                                                
> >                                                                                                                 Now we are in situation where because textual field is too big (makes
> >                                                                                                                 impassible to return json source for 30-50 results to a client) we need
> >                                                                                                                 to store it twice in index (once as a part of original json source in
> >                                                                                                                 _source and second time as a 'content' field "store" : "yes" for
> >                                                                                                                 highlighting). This make index a lot bigger.
> >                                                                                                                
> >                                                                                                                 Also, if there is a requirement to display highlighting for all fields
> >                                                                                                                 (separate highlight snippets for each field where match occurred,
> >                                                                                                                 without mixing fields snippets -> stored _all field can not be used for
> >                                                                                                                 highlighting in such case) then whole document (all fields) needs to be
> >                                                                                                                 stored twice.
> >                                                                                                                
> >                                                                                                                 In this case seems only logical to disable _source field (since all
> >                                                                                                                 fields are stored anyway) and when whole document needs to be retrieved
> >                                                                                                                 use (newly added) fields=* feature (I've read similar discussion thread
> >                                                                                                                 which led to this enhancement)
> >                                                                                                                
> >                                                                                                                 Here is my question/proposal:
> >                                                                                                                
> >                                                                                                                 would it be possible to enable use of json _source field for 'field
> >                                                                                                                 specific' highlighting, where matching snippet needs to be returned
> >                                                                                                                 separately for each field name?
> >                                                                                                                
> >                                                                                                                 Maybe to have term_vector for each field, but to somehow 'adjust' or
> >                                                                                                                 recalculate positions_offsets to point to text snippet in _source
> >                                                                                                                 instead of stored field?
> >                                                                                                                
> >                                                                                                                 I know this is not a simple requirement, but if 'field specific'
> >                                                                                                                 highlighting could somehow use stored json instead of requiring an
> >                                                                                                                 individual field to be (separately) stored, that would make a great
> >                                                                                                                 use/reuse of stored json _source (and no one would ever think of
> >                                                                                                                 disabling it :)
> >                                                                                                                
> >                                                                                                                 Tomislav
> >                                                                                                                
> >                                                                                                                
> >                                                                                                                
> >                                                                                                                
> >                                                                                                                
> >                                                                                                                
> >                                                                                                        
> >                                                                                                        
> >                                                                                                
> >                                                                                                
> >                                                                                        
> >                                                                                        
> >                                                                                
> >                                                                                
> >                                                                        
> >                                                                        
> >                                                                
> >                                                                
> >                                                        
> >                                                        
> >                                                
> >                                                
> >                                        
> >                                        
> >                                
> >                        
> >                        
> >                
> >                
> >        
> >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

kimchy
Administrator
Yea, the discussion took a different turn from the original request. Yes, it is possible (with the mentioned downsides of needing to load the full source and parsing it on the "fetch" phase within the specific node, nothing that can't be solved by adding more replicas though if there are performance problems, and you can do that dynamically in upcoming 0.9.1). Can you open a feature request for it?

On Fri, Aug 13, 2010 at 5:47 PM, Tomislav Poljak <[hidden email]> wrote:
Hi,
actually this requirement can be even more summarized:

It would be really (really) great if ES could provide 'highlighting' and
'fields' features from the search API without the need for each field to
be stored separately (by reusing stored json _source field).

Do you think this would be possible?

Tomislav


On Fri, 2010-08-13 at 16:30 +0200, Tomislav Poljak wrote:
> Hi,
> I'm not sure I fully understand what will be implement as a
> result/conclusion of discussion here, but I think I can define what I
> would like to be implemented (from my point of view) pretty clearly as:
>
> It would be great if ES, beside returning whole document source (in json
> format) in search results, supported returning json type structure with
> 'matching fields' and/or requested fields. Only fields which are matched
> by a query or requested would be returned (from _source json) and this
> would be possible without storing each field separately. Value in these
> fields would be either a whole field value (with highlighting applied)
> or a highlighting snippet (for large textual fields).
>
> Will something like that be possible?
>
> Thanks,
>       Tomislav
>
>
> On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:
> > No problem, I understand the general idea of the requirement.
> >
> >
> > -shay.banon
> >
> > On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček <[hidden email]>
> > wrote:
> >
> >
> >
> >         On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
> >         <[hidden email]> wrote:
> >                 Yes, I did not realize that earlier but you are right
> >                 that I will need to pass query into the highlight
> >                 section as well.
> >                 Take the following example:
> >
> >
> >                 I need to display all candidates that match "dude
> >                 java" query and then I want to allow user to click on
> >                 individual name and get whole bio highlighted.
> >
> >
> >                 So how I can go about this:
> >                 First, I can get relevant documents using simple
> >                 "query_string" query for "dude java". I can now
> >                 display names of candidates without highlights and
> >                 highlighted fragments from bio for each name, kind of
> >                 basic search interface that already works now. But if
> >                 I wanted to display highlighted name I would get
> >                 something like "..e <em>Dude</em> Abid..." which is
> >                 not what I want (sure, I can work with fragment size
> >                 but that is just workaround and does not fit all
> >                 situations). So when using that "query_string" query I
> >                 would like to specify in the highlight section that
> >                 the person.name should be highlighted with no
> >                 fragments.
> >
> >
> >                 Second, now, when the user clicks individual name,
> >                 then I want to get whole bio highlighted.
> >                 So I need to get specific document (by ID) and have
> >                 the bio field highlighted (and the name field as well)
> >                 The example of the query that could be used:
> >
> >                 curl -XGET http://localhost:9200/_all/_search -d '
> >                 { "query" : { "term" : { "person-id" : "1234" } },
> >                   "highlight" : {
> >                     "fields" : {
> >                       "_source" : {
> >                         "path" : "person.bio,person.name",
> >                         "fragmenter" : "classpath.to.NullFragmenter",
> >                         "query" : {
> >                           "query_string" : { "fields" :
> >                 ["bio","name"], "query" : "dude java" }
> >                         }
> >                       }
> >                     }
> >                   }
> >                 }'
> >
> >
> >                 or I could use fields query:
> >
> >
> >                 curl -XGET http://localhost:9200/_all/_search -d '
> >                 { "query" : { "term" : { "person-id" : "1234" } },
> >                   "fields" : ["bio","name"],
> >                   "highlight" : {
> >                     "fields" : {
> >                       "bio" : {
> >                         "query" : {
> >                           "query_string" : { "fields" : ["bio"],
> >                 "query" : "dude java" }
> >                         }
> >                       },
> >                       "name" : {
> >                         "query" : {
> >                           "query_string" : { "fields" : ["name"],
> >                 "query" : "dude java" }
> >                         }
> >                       }
> >                     }
> >                   }
> >                 }'
> >
> >
> >                 The later query requires both bio and name to be
> >                 stored (and this is where it gets back to Tomislav's
> >                 original point I think).
> >                 Ugh! I am complicating it way too much... but hope the
> >                 request is clear now :-)
> >
> >
> >         Sure I am complicating it too much because in the later query
> >         example I forgot the specify NullFragmenter :-)
> >
> >
> >
> >
> >                 Regards,
> >                 Lukas
> >
> >
> >
> >                 2010/8/13 Shay Banon <[hidden email]>
> >
> >
> >                         ok, so you want to get the whole bio field
> >                         highlighted, so you would need to pass the
> >                         query to the get API as well, otherwise, there
> >                         is no way to highlight it (aside from other
> >                         things you need, like the option to do no
> >                         fragmentation and getting the actual data).
> >
> >
> >
> >                         On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček
> >                         <[hidden email]> wrote:
> >                                 Oh, and one more note, see below:
> >
> >                                 On Thu, Aug 12, 2010 at 9:22 PM, Lukáš
> >                                 Vlček <[hidden email]> wrote:
> >                                         If I want to display whole bio
> >                                         highlighted then I can either
> >                                         get "_source" and cut bio from
> >                                         it on the client side but in
> >                                         this case I need to tell ES to
> >                                         use highlighting on it first.
> >                                         Or I need to specify in
> >                                         mapping that bio is also
> >                                         stored and use fields
> >                                         query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).
> >
> >
> >                                 And the later also requires
> >                                 specification of Fragmenter that
> >                                 returns whole body, not fragments,
> >                                 thus my reference to NullFragmenter,
> >                                 which is not implemented in
> >                                 FastVectorHighlighter API (as far as I
> >                                 understand it), it can be found in the
> >                                 older Highlighting API, thus I opened
> >                                 also http://github.com/elasticsearch/elasticsearch/issues/issue/307
> >
> >
> >                                 May be it would be better if the
> >                                 NullFragmenter-like functionality is
> >                                 contributed directly into Lucene
> >                                 FastVectorHighlighter API. I was
> >                                 looking at the FVH API today and I
> >                                 think I can try to implement such
> >                                 Fragmenter.
> >
> >
> >
> >
> >
> >                                         Hope this makes it clear.
> >                                         (Sorry if I confused you).
> >
> >
> >
> >                                         Lukas
> >
> >                                         2010/8/12 Shay Banon
> >                                         <[hidden email]>
> >                                                 Ahh, I see. So you
> >                                                 would still need to
> >                                                 provide a query to the
> >                                                 GET api in order to do
> >                                                 the highlighting,
> >                                                 right?
> >
> >
> >
> >                                                 On Thu, Aug 12, 2010
> >                                                 at 10:08 PM, Lukáš
> >                                                 Vlček
> >                                                 <[hidden email]> wrote:
> >                                                         Imagine search
> >                                                         app for HR:
> >                                                         Candidate
> >                                                         catalog (cool
> >                                                         name!).
> >                                                         The entities
> >                                                         stored in the
> >                                                         index are as
> >                                                         follows:
> >                                                         person: { id,
> >                                                         name, address,
> >                                                         bio }
> >                                                         Now I am using
> >                                                         just the REST
> >                                                         API. Say I
> >                                                         search for
> >                                                         "Java" and I
> >                                                         would like to
> >                                                         display list
> >                                                         of Names and
> >                                                         allow users to
> >                                                         click
> >                                                         individual
> >                                                         name which
> >                                                         would display
> >                                                         whole bio with
> >                                                         Java
> >                                                         highlighted in
> >                                                         it (here comes
> >                                                         the
> >                                                         highlighting
> >                                                         in the play!).
> >                                                         Now I can
> >                                                         display bio
> >                                                         (just using
> >                                                         GET REST API
> >                                                         with given
> >                                                         document ID
> >                                                         but not
> >                                                         highlighted.
> >                                                         So I was
> >                                                         thinking that
> >                                                         it would be
> >                                                         cool to have
> >                                                         this function.
> >
> >
> >                                                         Lukas
> >
> >                                                         2010/8/12 Shay
> >                                                         Banon
> >                                                         <[hidden email]>
> >
> >
> >                                                                 So,
> >                                                                 what
> >                                                                 you
> >                                                                 want
> >                                                                 is to
> >                                                                 be
> >                                                                 able
> >                                                                 to get
> >                                                                 just
> >                                                                 the
> >                                                                 bio
> >                                                                 field,
> >                                                                 without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...
> >
> >
> >                                                                 -shay.banon
> >
> >
> >
> >                                                                 On
> >                                                                 Thu,
> >                                                                 Aug
> >                                                                 12,
> >                                                                 2010
> >                                                                 at
> >                                                                 9:53
> >                                                                 PM,
> >                                                                 Lukáš
> >                                                                 Vlček
> >                                                                 <[hidden email]> wrote:
> >                                                                         Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.
> >
> >
> >                                                                         Lukas
> >
> >
> >
> >                                                                         On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <[hidden email]> wrote:
> >                                                                                 Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.
> >
> >
> >                                                                                 -shay.banon
> >
> >
> >
> >                                                                                 On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <[hidden email]> wrote:
> >                                                                                         One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).
> >
> >
> >
> >                                                                                         On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <[hidden email]> wrote:
> >                                                                                                 If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308
> >
> >
> >                                                                                                 Lukas
> >
> >
> >
> >                                                                                                 On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <[hidden email]> wrote:
> >                                                                                                         Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.
> >
> >
> >                                                                                                         Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).
> >
> >
> >                                                                                                         Open an issue for this?
> >
> >
> >                                                                                                         -shay.banon
> >
> >
> >
> >                                                                                                         On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <[hidden email]> wrote:
> >                                                                                                                 Hi,
> >                                                                                                                 I really like all the features stored json (enabled _source) provides in
> >                                                                                                                 both Java API (used it in indexing/updating) and REST API (used it for
> >                                                                                                                 searching).
> >
> >                                                                                                                 I do however have one possible request for improvement regarding
> >                                                                                                                 documents with large textual fields and overall highlighting.
> >
> >                                                                                                                 When there is a requirement to index/search documents with large textual
> >                                                                                                                 fields (like 'content' with text in Mb, which is not unusual), returning
> >                                                                                                                 a whole json for each result in result set can be impossible (if each
> >                                                                                                                 json document has a few Mb in 'content' returning 30-50 results to a
> >                                                                                                                 client doesn't sound realistic or even possible in acceptable time with
> >                                                                                                                 usual 'Internet' bandwidth).
> >
> >                                                                                                                 But, usually it's acceptable (or even requirement) to display/return
> >                                                                                                                 only highlighting snippets for 30-50 results matches and retrieve whole
> >                                                                                                                 document (json source) only for a single document (when requested by
> >                                                                                                                 exact ID).
> >
> >                                                                                                                 To be able to provide highlighting snippets for large textual
> >                                                                                                                 ('content') field, it needs to be stored.
> >
> >                                                                                                                 Now we are in situation where because textual field is too big (makes
> >                                                                                                                 impassible to return json source for 30-50 results to a client) we need
> >                                                                                                                 to store it twice in index (once as a part of original json source in
> >                                                                                                                 _source and second time as a 'content' field "store" : "yes" for
> >                                                                                                                 highlighting). This make index a lot bigger.
> >
> >                                                                                                                 Also, if there is a requirement to display highlighting for all fields
> >                                                                                                                 (separate highlight snippets for each field where match occurred,
> >                                                                                                                 without mixing fields snippets -> stored _all field can not be used for
> >                                                                                                                 highlighting in such case) then whole document (all fields) needs to be
> >                                                                                                                 stored twice.
> >
> >                                                                                                                 In this case seems only logical to disable _source field (since all
> >                                                                                                                 fields are stored anyway) and when whole document needs to be retrieved
> >                                                                                                                 use (newly added) fields=* feature (I've read similar discussion thread
> >                                                                                                                 which led to this enhancement)
> >
> >                                                                                                                 Here is my question/proposal:
> >
> >                                                                                                                 would it be possible to enable use of json _source field for 'field
> >                                                                                                                 specific' highlighting, where matching snippet needs to be returned
> >                                                                                                                 separately for each field name?
> >
> >                                                                                                                 Maybe to have term_vector for each field, but to somehow 'adjust' or
> >                                                                                                                 recalculate positions_offsets to point to text snippet in _source
> >                                                                                                                 instead of stored field?
> >
> >                                                                                                                 I know this is not a simple requirement, but if 'field specific'
> >                                                                                                                 highlighting could somehow use stored json instead of requiring an
> >                                                                                                                 individual field to be (separately) stored, that would make a great
> >                                                                                                                 use/reuse of stored json _source (and no one would ever think of
> >                                                                                                                 disabling it :)
> >
> >                                                                                                                 Tomislav
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>


Reply | Threaded
Open this post in threaded view
|

Re: Large (stored) fields, json source and highlighting

Tomislav Poljak
Hi,
I've opened issue 319 for it, here

 http://github.com/elasticsearch/elasticsearch/issues/issue/319


Hope this is fine,

     Tomislav

On Fri, 2010-08-13 at 18:00 +0300, Shay Banon wrote:

> Yea, the discussion took a different turn from the original request.
> Yes, it is possible (with the mentioned downsides of needing to load
> the full source and parsing it on the "fetch" phase within the
> specific node, nothing that can't be solved by adding more replicas
> though if there are performance problems, and you can do that
> dynamically in upcoming 0.9.1). Can you open a feature request for it?
>
> On Fri, Aug 13, 2010 at 5:47 PM, Tomislav Poljak <[hidden email]>
> wrote:
>         Hi,
>         actually this requirement can be even more summarized:
>        
>         It would be really (really) great if ES could provide
>         'highlighting' and
>         'fields' features from the search API without the need for
>         each field to
>         be stored separately (by reusing stored json _source field).
>        
>         Do you think this would be possible?
>        
>         Tomislav
>        
>        
>        
>         On Fri, 2010-08-13 at 16:30 +0200, Tomislav Poljak wrote:
>         > Hi,
>         > I'm not sure I fully understand what will be implement as a
>         > result/conclusion of discussion here, but I think I can
>         define what I
>         > would like to be implemented (from my point of view) pretty
>         clearly as:
>         >
>         > It would be great if ES, beside returning whole document
>         source (in json
>         > format) in search results, supported returning json type
>         structure with
>         > 'matching fields' and/or requested fields. Only fields which
>         are matched
>         > by a query or requested would be returned (from _source
>         json) and this
>         > would be possible without storing each field separately.
>         Value in these
>         > fields would be either a whole field value (with
>         highlighting applied)
>         > or a highlighting snippet (for large textual fields).
>         >
>         > Will something like that be possible?
>         >
>         > Thanks,
>         >       Tomislav
>         >
>         >
>         > On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:
>         > > No problem, I understand the general idea of the
>         requirement.
>         > >
>         > >
>         > > -shay.banon
>         > >
>         > > On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček
>         <[hidden email]>
>         > > wrote:
>         > >
>         > >
>         > >
>         > >         On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
>         > >         <[hidden email]> wrote:
>         > >                 Yes, I did not realize that earlier but
>         you are right
>         > >                 that I will need to pass query into the
>         highlight
>         > >                 section as well.
>         > >                 Take the following example:
>         > >
>         > >
>         > >                 I need to display all candidates that
>         match "dude
>         > >                 java" query and then I want to allow user
>         to click on
>         > >                 individual name and get whole bio
>         highlighted.
>         > >
>         > >
>         > >                 So how I can go about this:
>         > >                 First, I can get relevant documents using
>         simple
>         > >                 "query_string" query for "dude java". I
>         can now
>         > >                 display names of candidates without
>         highlights and
>         > >                 highlighted fragments from bio for each
>         name, kind of
>         > >                 basic search interface that already works
>         now. But if
>         > >                 I wanted to display highlighted name I
>         would get
>         > >                 something like "..e <em>Dude</em> Abid..."
>         which is
>         > >                 not what I want (sure, I can work with
>         fragment size
>         > >                 but that is just workaround and does not
>         fit all
>         > >                 situations). So when using that
>         "query_string" query I
>         > >                 would like to specify in the highlight
>         section that
>         > >                 the person.name should be highlighted with
>         no
>         > >                 fragments.
>         > >
>         > >
>         > >                 Second, now, when the user clicks
>         individual name,
>         > >                 then I want to get whole bio highlighted.
>         > >                 So I need to get specific document (by ID)
>         and have
>         > >                 the bio field highlighted (and the name
>         field as well)
>         > >                 The example of the query that could be
>         used:
>         > >
>         > >                 curl -XGET
>         http://localhost:9200/_all/_search -d '
>         > >                 { "query" : { "term" : { "person-id" :
>         "1234" } },
>         > >                   "highlight" : {
>         > >                     "fields" : {
>         > >                       "_source" : {
>         > >                         "path" : "person.bio,person.name",
>         > >                         "fragmenter" :
>         "classpath.to.NullFragmenter",
>         > >                         "query" : {
>         > >                           "query_string" : { "fields" :
>         > >                 ["bio","name"], "query" : "dude java" }
>         > >                         }
>         > >                       }
>         > >                     }
>         > >                   }
>         > >                 }'
>         > >
>         > >
>         > >                 or I could use fields query:
>         > >
>         > >
>         > >                 curl -XGET
>         http://localhost:9200/_all/_search -d '
>         > >                 { "query" : { "term" : { "person-id" :
>         "1234" } },
>         > >                   "fields" : ["bio","name"],
>         > >                   "highlight" : {
>         > >                     "fields" : {
>         > >                       "bio" : {
>         > >                         "query" : {
>         > >                           "query_string" : { "fields" :
>         ["bio"],
>         > >                 "query" : "dude java" }
>         > >                         }
>         > >                       },
>         > >                       "name" : {
>         > >                         "query" : {
>         > >                           "query_string" : { "fields" :
>         ["name"],
>         > >                 "query" : "dude java" }
>         > >                         }
>         > >                       }
>         > >                     }
>         > >                   }
>         > >                 }'
>         > >
>         > >
>         > >                 The later query requires both bio and name
>         to be
>         > >                 stored (and this is where it gets back to
>         Tomislav's
>         > >                 original point I think).
>         > >                 Ugh! I am complicating it way too much...
>         but hope the
>         > >                 request is clear now :-)
>         > >
>         > >
>         > >         Sure I am complicating it too much because in the
>         later query
>         > >         example I forgot the specify NullFragmenter :-)
>         > >
>         > >
>         > >
>         > >
>         > >                 Regards,
>         > >                 Lukas
>         > >
>         > >
>         > >
>         > >                 2010/8/13 Shay Banon
>         <[hidden email]>
>         > >
>         > >
>         > >                         ok, so you want to get the whole
>         bio field
>         > >                         highlighted, so you would need to
>         pass the
>         > >                         query to the get API as well,
>         otherwise, there
>         > >                         is no way to highlight it (aside
>         from other
>         > >                         things you need, like the option
>         to do no
>         > >                         fragmentation and getting the
>         actual data).
>         > >
>         > >
>         > >
>         > >                         On Thu, Aug 12, 2010 at 10:34 PM,
>         Lukáš Vlček
>         > >                         <[hidden email]> wrote:
>         > >                                 Oh, and one more note, see
>         below:
>         > >
>         > >                                 On Thu, Aug 12, 2010 at
>         9:22 PM, Lukáš
>         > >                                 Vlček
>         <[hidden email]> wrote:
>         > >                                         If I want to
>         display whole bio
>         > >                                         highlighted then I
>         can either
>         > >                                         get "_source" and
>         cut bio from
>         > >                                         it on the client
>         side but in
>         > >                                         this case I need
>         to tell ES to
>         > >                                         use highlighting
>         on it first.
>         > >                                         Or I need to
>         specify in
>         > >                                         mapping that bio
>         is also
>         > >                                         stored and use
>         fields
>         > >                                         query
>         http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).
>         > >
>         > >
>         > >                                 And the later also
>         requires
>         > >                                 specification of
>         Fragmenter that
>         > >                                 returns whole body, not
>         fragments,
>         > >                                 thus my reference to
>         NullFragmenter,
>         > >                                 which is not implemented
>         in
>         > >                                 FastVectorHighlighter API
>         (as far as I
>         > >                                 understand it), it can be
>         found in the
>         > >                                 older Highlighting API,
>         thus I opened
>         > >                                 also
>         http://github.com/elasticsearch/elasticsearch/issues/issue/307
>         > >
>         > >
>         > >                                 May be it would be better
>         if the
>         > >                                 NullFragmenter-like
>         functionality is
>         > >                                 contributed directly into
>         Lucene
>         > >                                 FastVectorHighlighter API.
>         I was
>         > >                                 looking at the FVH API
>         today and I
>         > >                                 think I can try to
>         implement such
>         > >                                 Fragmenter.
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >                                         Hope this makes it
>         clear.
>         > >                                         (Sorry if I
>         confused you).
>         > >
>         > >
>         > >
>         > >                                         Lukas
>         > >
>         > >                                         2010/8/12 Shay
>         Banon
>         > >
>         <[hidden email]>
>         > >                                                 Ahh, I
>         see. So you
>         > >                                                 would
>         still need to
>         > >                                                 provide a
>         query to the
>         > >                                                 GET api in
>         order to do
>         > >                                                 the
>         highlighting,
>         > >                                                 right?
>         > >
>         > >
>         > >
>         > >                                                 On Thu,
>         Aug 12, 2010
>         > >                                                 at 10:08
>         PM, Lukáš
>         > >                                                 Vlček
>         > >
>         <[hidden email]> wrote:
>         > >
>         Imagine search
>         > >
>         app for HR:
>         > >
>         Candidate
>         > >
>         catalog (cool
>         > >
>         name!).
>         > >
>         The entities
>         > >
>         stored in the
>         > >
>         index are as
>         > >
>         follows:
>         > >
>         person: { id,
>         > >
>         name, address,
>         > >
>         bio }
>         > >
>         Now I am using
>         > >
>         just the REST
>         > >
>         API. Say I
>         > >
>         search for
>         > >
>         "Java" and I
>         > >
>         would like to
>         > >
>         display list
>         > >                                                         of
>         Names and
>         > >
>         allow users to
>         > >
>         click
>         > >
>         individual
>         > >
>         name which
>         > >
>         would display
>         > >
>         whole bio with
>         > >
>         Java
>         > >
>         highlighted in
>         > >                                                         it
>         (here comes
>         > >
>         the
>         > >
>         highlighting
>         > >                                                         in
>         the play!).
>         > >
>         Now I can
>         > >
>         display bio
>         > >
>         (just using
>         > >
>         GET REST API
>         > >
>         with given
>         > >
>         document ID
>         > >
>         but not
>         > >
>         highlighted.
>         > >                                                         So
>         I was
>         > >
>         thinking that
>         > >                                                         it
>         would be
>         > >
>         cool to have
>         > >
>         this function.
>         > >
>         > >
>         > >
>         Lukas
>         > >
>         > >
>         2010/8/12 Shay
>         > >
>         Banon
>         > >
>         <[hidden email]>
>         > >
>         > >
>         > >
>         So,
>         > >
>         what
>         > >
>         you
>         > >
>         want
>         > >
>         is to
>         > >
>         be
>         > >
>         able
>         > >
>         to get
>         > >
>         just
>         > >
>         the
>         > >
>         bio
>         > >
>         field,
>         > >
>         without the full source, and without the bio field being
>         stored? If so, then the response I gave, where the logic might
>         apply also to get fields using something like "source_field"
>         notion applies here. It does mean that the full source will
>         need to be retrieved and parsed. Not sure how highlighting
>         comes into play here...
>         > >
>         > >
>         > >
>         -shay.banon
>         > >
>         > >
>         > >
>         > >
>         On
>         > >
>         Thu,
>         > >
>         Aug
>         > >
>         12,
>         > >
>         2010
>         > >
>         at
>         > >
>         9:53
>         > >
>         PM,
>         > >
>         Lukáš
>         > >
>         Vlček
>         > >
>         <[hidden email]> wrote:
>         > >
>         Actually, that ticket has two parts. One is Fragmenter related
>         and the other one is possibility to tell, that I want to
>         highlight some portion of _source data. Imagine I am using
>         only REST API and for example if _source is a person with
>         name, address and bio fields then I would like to tell that I
>         want to highlight just the bio field (and I think the
>         NullFragmenter would be needed for this if I want to display
>         whole content of bio highlighted, not just fragments). The
>         other possibility would be to define mapping for person in
>         such a way that bio would be a stored field, then I could
>         query for stored fields (not pulling the _source field) and
>         tell the I want to apply NullFragmenter to this data while
>         highlighting. But this gets back to the Tomislav's situation,
>         because this would mean that bio is probably stored twice,
>         once as a part of source and then separately as a stored bio
>         field.
>         > >
>         > >
>         > >
>         Lukas
>         > >
>         > >
>         > >
>         > >
>         On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon
>         <[hidden email]> wrote:
>         > >
>         Not sure if it overlaps, fragmenter controls how to break the
>         highlighted data, this relates to how to fetch that date to
>         highlight.
>         > >
>         > >
>         > >
>         -shay.banon
>         > >
>         > >
>         > >
>         > >
>         On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček
>         <[hidden email]> wrote:
>         > >
>         One of differences is that the 308 issue was meant to return
>         whole content of the _source or some of its fields (or stored
>         fields if not using "_source"). But the point is that the user
>         should be able to specify Fragmenter type (or provide custom
>         implementation of Fragmenter).
>         > >
>         > >
>         > >
>         > >
>         On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček
>         <[hidden email]> wrote:
>         > >
>         If I read it correctly then I think it partly overlaps with
>         http://github.com/elasticsearch/elasticsearch/issues/issue/308
>         > >
>         > >
>         > >
>         Lukas
>         > >
>         > >
>         > >
>         > >
>         On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon
>         <[hidden email]> wrote:
>         > >
>         Agreed. A bit tricky to implement, but possible. Also, note
>         that this will require loading the full json from the index,
>         and parse it in order to get the relevant parts from it. It
>         won't be returned, but still loaded. So, it might not make
>         sense when you have several big fields in the index, and you
>         want to get fragments for one of them. But does make sense
>         when having one big field.
>         > >
>         > >
>         > >
>         Also, if this is implemented, it should also be possible to
>         get specific fields out of the json as a response as well
>         (similar to asking for specific fields in the search request,
>         maybe call them source_fileds).
>         > >
>         > >
>         > >
>         Open an issue for this?
>         > >
>         > >
>         > >
>         -shay.banon
>         > >
>         > >
>         > >
>         > >
>         On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak
>         <[hidden email]> wrote:
>         > >
>         Hi,
>         > >
>         I really like all the features stored json (enabled _source)
>         provides in
>         > >
>         both Java API (used it in indexing/updating) and REST API
>         (used it for
>         > >
>         searching).
>         > >
>         > >
>         I do however have one possible request for improvement
>         regarding
>         > >
>         documents with large textual fields and overall highlighting.
>         > >
>         > >
>         When there is a requirement to index/search documents with
>         large textual
>         > >
>         fields (like 'content' with text in Mb, which is not unusual),
>         returning
>         > >
>         a whole json for each result in result set can be impossible
>         (if each
>         > >
>         json document has a few Mb in 'content' returning 30-50
>         results to a
>         > >
>         client doesn't sound realistic or even possible in acceptable
>         time with
>         > >
>         usual 'Internet' bandwidth).
>         > >
>         > >
>         But, usually it's acceptable (or even requirement) to
>         display/return
>         > >
>         only highlighting snippets for 30-50 results matches and
>         retrieve whole
>         > >
>         document (json source) only for a single document (when
>         requested by
>         > >
>         exact ID).
>         > >
>         > >
>         To be able to provide highlighting snippets for large textual
>         > >
>         ('content') field, it needs to be stored.
>         > >
>         > >
>         Now we are in situation where because textual field is too big
>         (makes
>         > >
>         impassible to return json source for 30-50 results to a
>         client) we need
>         > >
>         to store it twice in index (once as a part of original json
>         source in
>         > >
>         _source and second time as a 'content' field "store" : "yes"
>         for
>         > >
>         highlighting). This make index a lot bigger.
>         > >
>         > >
>         Also, if there is a requirement to display highlighting for
>         all fields
>         > >
>         (separate highlight snippets for each field where match
>         occurred,
>         > >
>         without mixing fields snippets -> stored _all field can not be
>         used for
>         > >
>         highlighting in such case) then whole document (all fields)
>         needs to be
>         > >
>         stored twice.
>         > >
>         > >
>         In this case seems only logical to disable _source field
>         (since all
>         > >
>         fields are stored anyway) and when whole document needs to be
>         retrieved
>         > >
>         use (newly added) fields=* feature (I've read similar
>         discussion thread
>         > >
>         which led to this enhancement)
>         > >
>         > >
>         Here is my question/proposal:
>         > >
>         > >
>         would it be possible to enable use of json _source field for
>         'field
>         > >
>         specific' highlighting, where matching snippet needs to be
>         returned
>         > >
>         separately for each field name?
>         > >
>         > >
>         Maybe to have term_vector for each field, but to somehow
>         'adjust' or
>         > >
>         recalculate positions_offsets to point to text snippet in
>         _source
>         > >
>         instead of stored field?
>         > >
>         > >
>         I know this is not a simple requirement, but if 'field
>         specific'
>         > >
>         highlighting could somehow use stored json instead of
>         requiring an
>         > >
>         individual field to be (separately) stored, that would make a
>         great
>         > >
>         use/reuse of stored json _source (and no one would ever think
>         of
>         > >
>         disabling it :)
>         > >
>         > >
>         Tomislav
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         > >
>         >
>        
>        
>
>